48
2004/10/27 1 ECMWF Workshop 2004 - Parallelisation of HIRLAM Parallelization of HIRLAM Model Parallelization and Asynchronous I/O

Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

1ECMWF Workshop 2004 - Parallelisation of HIRLAM

Parallelization of HIRLAM

Model Parallelization and Asynchronous I/O

Page 2: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

2ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM Optimizations

• Two major projects:

⑥ Model scalability improvement

⑥ Asynchronous I/O optimization• Both projects funded by DMI and NEC

Page 3: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

3ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM Parallelization

Message Passing Optimization

Page 4: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

4ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM Scalability Optimization

• Methods• Implementation• Performance

Page 5: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

5ECMWF Workshop 2004 - Parallelisation of HIRLAM

Optimization Focus

• Data transposition– from 2D to FFT distribution and reverse– from FFT to TRI distribution and reverse

• Exchange of halo points– between north and south– between east and west

Page 6: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

6ECMWF Workshop 2004 - Parallelisation of HIRLAM

Approach

• First attempt: straight-forward conversion from SHMEM to MPI-2 put/get calls– It works, but:– Too much overhead due to fine granularity

• Original plan shattered, now what?– Panic (mild form)

• More to do: in-depth analysis of how things really work

• Human memory: retention at least 6 years

Page 7: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

7ECMWF Workshop 2004 - Parallelisation of HIRLAM

Approach

• Redesign of transposition based on method used in initiative by met.no called PARLAM (Dag Bjørge and Roar Skålin, 1995)

• Redesign of halo swap routines• Benefits:

➼ less and larger messages➼ independent message passing process

groups

Page 8: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

8ECMWF Workshop 2004 - Parallelisation of HIRLAM

2D Sub Grids

• HIRLAM sub grid definition in TWOD data distribution

• Processors:

nproc = nprocx · nprocy

0 1 2

3 4 5

6 7 8

9 10 11

lati

tud

e

longitude

levels

Page 9: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

9ECMWF Workshop 2004 - Parallelisation of HIRLAM

Original FFT Sub Grids

• HIRLAM sub grid definition in FFT data distribution

• Each processor handles slabs of full longitude lines

lati

tud

e

longitude

levels

Page 10: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

10ECMWF Workshop 2004 - Parallelisation of HIRLAM

2D�FFT Redistribution

Sub grid data to be distributed to all processors:

nproc2 send-receive pairs

4

lati

tud

e

longitude

levels

Page 11: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

11ECMWF Workshop 2004 - Parallelisation of HIRLAM

3 4 5

2D�FFT Redistribution

• Sub grids in east-west direction form full longitude lines

• nprocy independent sets of nprocx2 send-receive pairs

• Total nr of pairs: nprocy · nprocx2

or:

nproc2 / nprocy

• nprocy times less messages

lati

tud

e

longitude

levels

5

4

3

Page 12: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

12ECMWF Workshop 2004 - Parallelisation of HIRLAM

Transpositions 2D�FFT�TRI

0 1 2

3 4 5

6 7 8

9 10 11

21

0

54

3

87

6

1110

9

2 5 8 111 4 7 10

0 3 6 9

2D FFT TRI

lati

tud

e

longitude

levels

Page 13: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

13ECMWF Workshop 2004 - Parallelisation of HIRLAM

MPI Methods

• Three transfer methods tried:– Remote Memory Access: mpi_put, mpi_get– Async Point-to-Point: mpi_isend, mpi_irecv– All-to-All: mpi_alltoallv, mpi_alltoallw

• Buffering vs. direct– Explicit buffering– MPI derived types

(Method selection by environment variables)

Page 14: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

14ECMWF Workshop 2004 - Parallelisation of HIRLAM

SHMEM Support

• Integrated in new design• Not much specific SHMEM code left: same

buffer structures used for MPI and SHMEM• Very similar to MPI-2 RMA code part• New SHMEM code ported to SHMEM

architecture by Ole Vignes, met.no

Page 15: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

15ECMWF Workshop 2004 - Parallelisation of HIRLAM

Performance

Test grid Details

seconds180Time step

noneInitialization

steps40NSTOP

levels60Vertical

points568Latitude

points602Longitude

Page 16: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

16ECMWF Workshop 2004 - Parallelisation of HIRLAM

Parallel Speedup on NEC SX-6

• Cluster of 8 NEC SX-6 nodes at DMI

• Up to 60 processors:

� 7 nodes with 8 processors per node

� 1 node with 4 processors

• Parallel efficiency 78% on 60 processors

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

0 10 20 30 40 50 60

Processors

Sp

eed

up

ideal original optimized

Page 17: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

17ECMWF Workshop 2004 - Parallelisation of HIRLAM

Performance – Observations on SX-6

• New data redistribution method much more efficient (78% vs. 45% on 60 processors)

• No performance advantage with RMA (one-sided MP) or All-to-All over plain Point-to-Point method

• Elegant code with MPI derived types, but explicit buffering faster

Page 18: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

18ECMWF Workshop 2004 - Parallelisation of HIRLAM

Performance – Observations on SGI Origin

• Tests executed at met.no with grid size 468x378x40 points on 196 processors

• Time step times:– Old SHMEM: 1.4 s– New SHMEM: 1.15 s– MPI: 0.85 s

• Conclusions regarding SHMEM:– New SHMEM code faster than old SHMEM code– New MPI code faster than new SHMEM code– End of SHMEM version?

Page 19: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

19ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM GRIBfile Server

Asynchronous I/O

Page 20: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

20ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Overview

HIRLAM GRIBfile Server• Purpose:

Take both input and output processing out of the time stepping loop

• Server is a virtual concept, not a separate application or system

• Integrated in model source• Enabled/disabled at runtime

Page 21: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

21ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Overview

Brief history1. HGS originally written for shared memory

architecture (IPC) (Sun, Jan Boerhout, 2001)

2. Similar initiative for distributed memory architecture (MPI), output only (CSC, Jussi Heikonen

and FMI, Kalle Eerola, 2001)3. HGS rewrite for MPI, based on ideas from 1

and 2 (met.no, Ole Vignes, 2002)

4. Recently optimized for faster data exchange (NEC, Jan Boerhout, 2004; funded by DMI)

Page 22: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

22ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Sequential

• 3 hours forecast• 610 x 568 x 40 points• 15 SX-6 processors

(8GF peak)

• Hourly input and output• Time step 360 seconds• Graph shows function call

stack on time line

InputOutput

Time (s)

Model

Model

Time step

Page 23: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

23ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - AsynchronousDisk Input

Disk Output

Time (s)

HGS

Model

Model

Model Input Model Output

Page 24: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

24ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Asynchronous

A

1Time (s)

2

HGS

Model

Model

HGS:1. Receives preread

request and reads first input file

2. Reads second input file

Model:A. Sends HGS request to

pre-read first two input files and waits for HGS availability confirmation

Page 25: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

25ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Asynchronous

B

3Time (s)

HGS

Model

Model

HGS:3. Sends data from first two

input files

Model:B. Receives data from first

two input files

Page 26: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

26ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Asynchronous

C

D

4

Time (s)

HGS

Model

Model

HGS:4. Collects data for first

output files

Model:C. First time stepD. Sends data for first three

output files

Page 27: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

Grid 3

Grid 2

Grid 1

Grid 0

27ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM Model and HGS Animation

• Demonstration of output (input very similar)– Four model sub grids– One HGS process– Buffer block size: 4 fields (in

reality: 100)• Model running...• Next slides: model writes

output, HGS collects and stores the data

Page 28: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

28ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Collects Output

Page 29: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

29ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Collects Output

Page 30: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

30ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Collects Output

Page 31: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

31ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Collects Output

Page 32: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

32ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Output

Time (s)

HGS

Model

Model

Output:Short model

communication time

Will be even shorter with DMI’s new MSLP algorithm

Page 33: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

33ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Asynchronous

E

5Time (s)

HGS

Model

Model

HGS:5. Writes output files to disk

Model:E. Sends requests to store

output and pre-read next input, then continues

Page 34: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

34ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 35: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

35ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 36: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

36ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 37: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

37ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 38: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

38ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 39: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

39ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 40: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

40ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 41: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

41ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 42: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

GRIBfile Server

42ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model continues while HGS Writes to Disk

Page 43: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

43ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Asynchronous

G

F

Time (s)

6 7

HGS

Model

Model

HGS:6. Reads next input file7. Sends input data

Model:F. Sends requests to

receive input dataG. Next time step

Page 44: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

44ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Input

Time (s)

HGS

Model

Model

Input:Very short model

communication time

Page 45: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

45ECMWF Workshop 2004 - Parallelisation of HIRLAM

HIRLAM I/O Activity - Asynchronous

H

Time (s)

8

HGS

Model

Model

HGS:8. Collects output data

(repeat of 4)

Model:H. Sends output data

(repeat of D)

D

4

Page 46: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

46ECMWF Workshop 2004 - Parallelisation of HIRLAM

Model

HGS

HGS

HGS

Time (s)

Multiple HGS tasks: 3 HGS, 12 model (1 shown)

Input

Output

Multiple HGS Tasks Supported

Page 47: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

47ECMWF Workshop 2004 - Parallelisation of HIRLAM

HGS Documentation

• More detailed descriptions in HGS documentation

• To be published as HIRLAM Newsletter, see: http://hirlam.knmi.nl

• Graphs generated with Vftrace, free tool for NEC SX users, see http://vftrace.jboerhout.nl

Figure 4.4 – Model sends output data - overview

Time (s)

HGS

Model

A C

B D

E

12

34 5

F G I K L

H J M O

N

76

Page 48: Parallelization of HIRLAM v1 - ECMWF · 2004/10/27 ECMWF Workshop 2004 - Parallelisation of HIRLAM 21 HGS Overview Brief history 1. HGS originally written for shared memory architecture

2004/10/27

48ECMWF Workshop 2004 - Parallelisation of HIRLAM

Questions?

• Thank you!