30
Preparing NWP-Models for Tera-Computing Ulrich Schättler, Elisabeth Krenzien, Michael Czajkowski Deutscher Wetterdienst

Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

  • Upload
    others

  • View
    47

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

Preparing NWP-Modelsfor

Tera-Computing

Ulrich Schättler, Elisabeth Krenzien,

Michael Czajkowski

Deutscher Wetterdienst

Page 2: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 2

Contents

• T3E → IBM: Late Lament

• Upgrade of NWP-System at DWD

• Optimizations for Boundary Exchange and I/O

• Environment for the NWP-System: Now / Future

• LM_RAPS_3.0

• Conclusions

Page 3: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 3

T3E → IBM: Late Lament• If profiling various applications on a big IBM-System,

which MPI-Routine do you expect to take the most time?

• It is: MPI_BARRIER!

• Because most codes have been developed on a T3E:– Barriers did take almost no time

– MPI_SEND, MPI_RECV was not really blocking and a barrier could be helpful to ensure correct program execution!

• Is IBM-communication really so bad?

Page 4: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 4

Timings for LM on T3E / IBMTimings in Seconds T3E IBM (pwr3)Dyn. Computations 1146.58 1105.09 Communications 324.90 568.19 Barrier waiting 187.09 373.63Phys. Computations 708.80 712.22 Communications 30.84 60.05 Barrier waiting 200.41 241.18I/O 576.05 358.05# Processors 484 160

Page 5: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 5

Timings for LM on T3E / IBM

• Today´s operational domain size: 325 × 325 × 35

• 48 hour forecast (should be finished within 1 hour)

• Timings: „Not so good, but acceptable“

• But what happens, if O(1000) processors are used?

Page 6: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 6

Upgrade of NWP-System

• New model components– Cloud ice scheme (GME / LM)

– Multi-layer soil model (GME / LM in QI 2005)

– Sea-ice model (GME)

– Prognostic Precipitation (LM)

– 2 time level Runge-Kutta numerical core • 3rd order in time; 5th order (horizontal) in space

• at the moment tested for very high resolution runs

Page 7: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 7

Upgrade of NWP-System

• In Preparation– 3D Var Physical Space Assimilation System (GME)

– Assimilation of radar data (latent heat nudging; LM)

– 3D Turbulence scheme (LM)

– Graupel scheme (LM)

– Parameterization of shallow convection (LM)

– Lake-Model (LM)

Page 8: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 8

Upgrade of NWP-System• Local Model (LM):

– The LM is used and further developed within the Consortium for Small Scale Modeling (COSMO)

– Aim is to run LM with a very high resolution (≤ 3 km): LMK ( LM Kürzestfrist)

– But coming up next: Running the LM with 7 km resolution over the whole of Europe (LME)

• Global Model (GME):– GME is now run with a resolution of about 40 km and

40 vertical layers

Page 9: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 9

LMK• 2.8 km grid spacing

• 421×461 grid points

• 50 vertical layers

• 30 s time step

• 2 time level Runge-Kutta

• continuous upgrade by new components

• operational in 2006

Page 10: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 10

LME• Still 7 km grid spacing

• 665×657 grid points

• 40 vertical layers

• 40 s time step

• 3 time level Leapfrog

• (or Runge-Kutta: 72 s)

• operational in 2005 (QII)

Page 11: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 11

LME: First TimingsDyn. Computations 367.51 Communications 85.30 Barrier waiting 130.37Phys. Computations 188.16 Communications 7.31 Barrier waiting 31.76Input 126.62Output 258.70Total Time for the Job 1285.46

Page 12: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 12

LME: First Timings• 12 hour forecast on 900+1 procs (should be ≤ 900s)

• Boundary Exchange with MPI_ISEND, MPI_WAITon the sender and MPI_RECV on receiver side

• Explicit buffering of data

• Extra processor for I/O (asynchronous IO); but realization with blocking communication

• Still using all that barriers!

Page 13: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 13

LM Time Step with Trace Analyzer

Page 14: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 14

Output Step

Page 15: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 15

Input Step

Page 16: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 16

Optimizations• Boundary Exchange with

– MPI_ISEND, MPI_WAIT and MPI_RECV

– MPI_IRECV, MPI_WAIT and MPI_SEND

– MPI_SENDRECV

• Implicit buffering of data by using MPI_DATATYPES

• Non-blocking communication for extra I/O processor

• Try to do a look-ahead reading

• There is really no need for barriers!

Page 17: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 17

Optimized LM Time Step

Page 18: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 18

Optimized Output Step

Page 19: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 19

Optimized Input Step

Page 20: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 20

Optimized Input Step - 2

Page 21: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 21

LME: Old vs. Optimized TimingsOld Comm+

Output+ Input

Dyn. Computations 367.51 349.29 349.65 Communications 85.30 Barrier waiting 130.37

138.96 221.58

Phys. Computations 188.16 187.36 187.50 Communications 7.31 Barrier waiting 31.76

25.40 26.09

Input 126.62 135.29 80.70Output 258.70 142.21 142.19Total Time for the Job 1285.46 1066.35 1095.01

Page 22: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 22

Page 23: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 23

Profiling with MPI_Trace (IBM)325 × 325 × 35

Old Version Optimized

MPI_ALLREDUCE 0.24 3.71

MPI_BARRIER 30.08 0.01

MPI_ISEND 0.40 0.00

MPI_RECV 8.07 3.18

MPI_SEND 2.46 0.00

MPI_SENDRECV - 21.49

MPI_WAIT 0.77 0.15

Page 24: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 24

Tools for Performance Analysis• Trace Analyzer (VAMPIR)

– good to detect problems in communication

– helps to understand your code

• MPI_Trace– to detect hot-spots in the communication

– helps to understand MPI Performance

• HPMCOUNT (Hardware Performance Monitor)– produces a lot of data (at least on IBM)

– results are not easy to understand

Page 25: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 25

NWP Environment: IT Structure 2004

�������� ����� ����������� �����

����

�������������� !"�!"#

��$%&���'# !'�

�(��������)) !*'!+,

-��.����������* !"�/!"))

��$%&0������")##

�����1������##

2%.���

��2� 3--

�������������� ����������� �����

�3��,!"!+"'

2%.���

��������

��4����5�����/##��6

7%11��'!)!�*'

�38������"!"!+/

(�����9�" !/!'!�,,

2%.���

3��:�����(��9"###",!",!�')+

���&%�;1���<�1�����1��� �����

0%14.����������6������*"#!�",#!�)/"

�����������6�����/,!+"!"''�

�6��4/*#+"!�"'!'/,�

�6��4/*#+"!�"'!'/,�

2%.�����������

������������

����������

"�(�<�5�����:�������&���:�

���

���

��%��<�������+=�+>)��6

���

���

���

���

���

� �������

����

��� ���

������:�����!(�<�5�����:�����

���

����

���

����

���

����

2%.���

��2?����

�3��"� !"*�/���6

�(�� ���"'#

-��.�� ���+�

"=�+,��5��!

Page 26: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 26

IBM SP 9076 550 (NH II, 375 IBM SP 9076 550 (NH II, 375 MHZMHZ))

AMASS, DataMgrOracle

Development Filesystems

1629 GBSSA, GPFS

ProductionFilesystems

4344 GBSSA, GPFS

120 Nodes:Login 4Compute 84GPFS 32

logincos4

logincos5

login backup

Gigabit Ethernet Switches

internal network: colony switch, cisco router

rus3AIX

rus3AIX

rus4AIX

rus4AIX

das4AIX

das4AIX

das3AIX

das3AIX

STK-Silos9840-FC9940 FC

STK-Silos9840-FC9940 FC

Systemsoftware:AIX ,PSSP, GPFS, LoadLeveler, f90, MPI, OpenMPC, JAVA, UNICORE

production development

CWS1/2

Production (cos4) Development (cos5, cos6)

TEMP (gtmp) Filesystems1086 GB

SSA, GPFS

Applications:Assimilation, GME, GM2LM, LM, GSM, LSM, MSM, Trajektorien, LPDM RLM, csobank to das3/4, ecfs to das1, Projects with external partners

Testsystem ixt54 nodes, Filesystems 280 GB SSA, GPFS

das1IRIX

das1IRIX

Oracle

logincos6

Page 27: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 27

Future Plans for DMRZ2005: Application for funds

2006: Invitation to tender

2008: Start of operation in a new building

How does „forecast-process“ look like in 2008?

Performance enhancement by 8-10 (ca. 30 TeraFlop/s peak performance)

Investigate possible use of Linux-Clusters

Page 28: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 28

LM_RAPS_3.0

• To reflect the recent model changes, a new RAPS-Benchmark has been released– Changes in communication for boundary exchange

are included

– Changes for asynchronous IO are not included

• Benchmark is available for vendors (and interested people), if an „Agreement“ is signed (⇒ change in the RAPS structure)

Page 29: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 29

Conclusions• A programmer´s work is never done

• Getting good communication performance on the

IBM is not impossible, but takes time

• Optimizations may be machine dependent

(offer choice of selection)

• Next: How to optimize the computations (Flop/s)?

Page 30: Preparing NWP-Models Tera-Computing - ECMWF · rus3 AIX rus3 AIX rus4 AIX rus4 AIX das4 AIX das4 AIX das3 AIX das3 AIX STK-Silos 9840-FC 9940 FC STK-Silos 9840-FC 9940 FC Systemsoftware:

FE 13 / TI 15 HPC Workshop / ECMWF 25.-29.10.2004 30