35
Parallel CC & Petaflop Applications Ryan Olson Cray, Inc.

Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Embed Size (px)

Citation preview

Page 1: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Parallel CC &Petaflop Applications

Ryan OlsonCray, Inc.

Page 2: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Did you know …

Teraflop - CurrentPetaflop - ImminentWhat’s next?

ExaflopZettaflopYOTTAflop!

Page 3: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Outline

Sanibel Symposium

Programming Models

Parallel CC Implementations

Benchmarks

Petascale Applications

This Talk

Distributed Data Interface

GAMESS MP-CCSD(T)

O vs. V

Local & Many-Body Methods

Page 4: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Programming ModelsThe Distributed Data Interface (DDI)

Programming Interface, not Programming Model

Choose the key functionality from the best programming models and provide: Common Interface Simple and Portable General Implementation

Provide an interface to: SPMD: TCGMSG, MPI AMOs: SHMEM, GA SMPs: OpenMP, pThreads SIMD: GPUs, Vector directives, SSE, etc.

Use the best models for the underlying hardware.

Page 5: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Overview

GAMESSApplication Level

Distributed Data Interface (DDI)High-Level API

Implementation

SHMEM / GPSHMEM MPI-2 MPI-1 + GA MPI-1 TCP/IP System V IPCHardware APIElan, GM, etc.

Native Implementations Non-Native Implementations

Page 6: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Programming ModelsThe Distributed Data Interface

Overview Virtual Shared-Memory Model (Native) Cluster Implementation (Non-Native)

Shared Memory/SMP Awareness Clusters of SMP (DDI versions 2-3)

Goal: Multilevel Parallelism Intra/Inter-node Parallelism Maximize Data Locality Minimize Latency / Maximize Bandwidth

Page 7: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Virtual Shared Memory ModelCPU 1

Distributed Memory Storage

CPU 0 CPU 2 CPU 3

0 1 2 3

Distributed MatrixDDI_Create(Handle,NRows,NCols)

CPU0 CPU1 CPU2 CPU3

NCols

NR

ows

Subpatch

Key Point:• The physical memory available to each CPU is divided into

two parts: replicated storage and distributed storage.

Page 8: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Non-Native Implementations(and lost opportunities … )

Distributed Memory Storage(on separate data servers)

GETPUT

0 1 2 3

4 5 6 7

Node 0 (CPU0 + CPU1)

Node 1 (CPU2 + CPU3)

ComputeProcesses

DataServers

ACC(+=)

Page 9: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

DDI till 2003 …

Page 10: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

SystemV Shared Memory(Fast Model)

0

4 76

GET

PUTACC(+=)

1 2 3

Node 0 (CPU0 + CPU1)

Node 1 (CPU2 + CPU3)

ComputeProcesses

Data Servers

SharedMemory

Segments

5

Distributed Memory Storage(in SysV Shared Memory Segments)

Page 11: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

DDI v2 - Full SMP Awareness

Distributed Memory Storage(on separate System V Shared Memory Segments)

GET PUTACC(+=)

0 1 2 3

4 5 6 7

Node 0 (CPU0 + CPU1)

Node 1 (CPU2 + CPU3)

ComputeProcesses

DataServers

SharedMemory

Segments

Page 12: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Proof of Principle - 2003

8 16 32 64 96

DDI v2 18283 12978 8024 5034 3718

DDI–Fast 27400 19534 14809 11424 9010

DDI v1 Limit 109839

95627 85972 N/A

UMP2 Gradient Calculation - 380 BFsDual AMD MP2200 Cluster using SCI network(2003 Results)

Note: DDI v1 was especially problematic onthe SCI network.

Page 13: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

DDI v2The DDI Library is SMP Aware.

offers new interfaces to make application SMP aware.

DDI programs inherit improvements in the library.

DDI programs do not automatically become SMP aware, unless they utilize the new interface.

Page 14: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Parallel CC and Threads(Shared Memory Parallelism)

Bentz and KendallParallel BLAS3WOMPAT ‘05

OpenMPParallelized Remaining TermsProof of Principle

Page 15: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Results• Au4 ==> GOOD

• CCSD = (T)• No Disk I/O problems• Both CCSD and (T) scale well

• Au+(C3H6) ==> POOR/AVERAGE• CCSD scales poorly due to I/O vs. FLOP Balance• (T) scales well, overshadowed by bad CCSD performance

• Au8 ==> GOOD• CCSD scales reasonable

(Greater FLOP count, about equal I/O).• N7 (T) step dominates over the relatively small time for CCSD.• (T) scales well, so the overall performance is good.

Page 16: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Detailed Speedups …

Au4 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.002 1.91 1.80 2.17 1.88 1.904 3.18 3.55 4.20 3.70 3.398 4.60 5.30 6.29 5.52 4.97

Au+(C3H6) CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.008 1.99 5.40 6.07 5.52 2.61

Au8 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.0 1.0 1.0 1.0 1.08 4.5 5.6 6.8 5.8 5.2

Page 17: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

DDI v3Shared Memory for ALL

2 3

6 7

ComputeProcesses

DataServers

AggregrateDistributed

Storage

0 1

2 3

Replicated Storage ~ 500MB –1GB

Shared Memory ~ 1GB – 12GB

Distributed Memory ~ 10 – 1000GB

Page 18: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

DDI v3

Memory Hierarchy Replicated, Shared and Distributed

Program Models Traditional DDI Multilevel Model DDI Groups (a different talk)

Multilevel Models Intra/Internode Parallelism Superset of MPI/OpenMP and/or

MPI/pThreads models MPI lacks “true” one-sided messaging

Page 19: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Parallel Coupled Cluster(Topics)

Data Distribution for CCSD(T) Integrals Distributed Amplitudes in Shared Memory once per node Direct [vv|vv] term

Parallelism based on Data Locality

First Generation Algorithm Ignore I/O Focus on Data and FLOP parallelism

Page 20: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Important Array Sizes (in GB)

300 400 500 600 700 800 900 1000

10 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 15 0.2 0.3 0.4 0.6 0.8 1.1 1.4 1.7 20 0.3 0.5 0.7 1.1 1.5 1.9 2.4 3.0 25 0.4 0.7 1.2 1.7 2.3 3.0 3.8 4.7 30 0.6 1.1 1.7 2.4 3.3 4.3 5.4 6.7 35 0.8 1.5 2.3 3.3 4.5 5.8 7.4 9.1 40 1.1 1.9 3.0 4.3 5.8 7.6 9.7 11.9 45 1.4 2.4 3.8 5.4 7.4 9.7 12.2 15.1 50 1.7 3.0 4.7 6.7 9.1 11.9 15.1 18.6 55 2.0 3.6 5.6 8.1 11.0 14.4 18.3 22.5 60 2.4 4.3 6.7 9.7 13.1 17.2 21.7 26.8

300 400 500 600 700 800 900 1000

10 1 2 5 8 13 19 27 37 15 2 4 7 12 19 29 41 56 20 2 5 9 16 26 38 54 75 25 3 6 12 20 32 48 68 93 30 3 7 14 24 38 57 82 112 35 4 8 16 28 45 67 95 131 40 4 10 19 32 51 76 109 149 45 5 11 21 36 58 86 122 168 50 5 12 23 40 64 95 136 186 55 6 13 26 44 70 105 150 205 60 6 14 28 48 77 115 163 224

v

o

o

v

[vv|oo][vo|vo]T2

[vv|vo]

Page 21: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

MO Based Terms

Page 22: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Some code …

DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE

CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO)

CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)

Page 23: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

MO Parallelization

0 1

[vo*|vo*], [vv|o*o*][vv|v*o*]

2 3

T2 Soln T2 Soln

[vo*|vo*], [vv|o*o*][vv|v*o*]

Goal: Disjoint updates to the solution matrix.Avoid locking/critical sections whenever possible.

Page 24: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Direct [VV|VV] Term

0 1 … processes … P-1

PUT Iijνσ

11 12 13 …atom

ic orbital indices … N

bf 2

do = 1,nshell

do = 1,nshell

compute:

transform:

end do

end do

transform:

contract:

PUT and for ijI ijνσ

ν σ( )aν σ( ) = Ca ν σ( )

vabνσ = aν bσ( ) = Cb aν σ( )

I ijνσ =vab

νσcijab

I ijσν

Iijσν

do ν = 1,nshell do σ = 1,ν

end doend dosynchronizefor each “local” ij column do

GET

reorder: shell --> AO order

transform:

STORE in “local” solution vector

I ijνσ

GET Iijνσ

Iijab = I ij

νσCνaCσbνσ∑I ijab

end do

11 21 22…occ indices…(NoNo)*

Page 25: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

(T) Parallelism

Trivial -- in theory[vv|vo] distributedv3 work arrays

at large v -- stored in shared memory

disjoint updates where both quantities are shared

Page 26: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Timings … 1 Node Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 1.00 100% 1.90 95% 3.70 92% 6.18 77% CCSD-MO 1.00 100% 1.87 93% 3.11 78% 4.21 53% CCSD-Total 1.00 100% 1.86 93% 3.58 89% 5.68 71% Triples Correction (T) 1.00 100% 1.78 89% 2.59 65% 4.06 51% 2Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 2.00 100% 3.76 94% 7.43 93% 12.31 77% CCSD-MO 1.38 69% 2.46 62% 4.10 51% 6.21 39% CCSD-Total 1.88 94% 3.34 84% 6.53 82% 9.56 60% Triples Correction (T) 1.94 97% 3.38 85% 4.73 59% 7.13 45% 3 Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 3.00 100% 5.85 97% 11.07 92% 18.48 77% CCSD-MO 1.68 56% 2.96 49% 4.56 38% 6.91 29% CCSD-Total 2.55 85% 4.80 80% 8.28 69% 14.57 61% Triples Correction (T) 2.95 98% 5.24 87% 7.63 64% 11.82 49%

(H2O)6 Prism - aug’-cc-pVTZFastest timing: < 6 hours on 8x8 Power5

Page 27: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Improvements …

Semi-Direct [vv|vv] term (IKCUT)

Concurrent MO terms

Generalized amplitudes storage

Page 28: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Semi-Direct [VV|VV] Termdo = 1,nshell

do = 1,nshell

compute:

transform:

end do

end do

transform:

contract:

PUT and for ijI ijνσ

ν σ( )aν σ( ) = Ca ν σ( )

vabνσ = aν bσ( ) = Cb aν σ( )

I ijνσ =vab

νσcijab

I ijσν

do ν = 1,nshell ! I-SHELL do σ = 1,ν ! K-SHELL

end doend do

Define IKCUT

Store if: LEN(I)+LEN(K) > IKCUT

Automatic contention avoidance

Adjustable: Fully direct to fully conventional.

Page 29: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Semi-Direct [vv|vv] Timings

IKCUT Direct 12 8 6 Save AllCCSD - 64 cores 3122 2563 1805 1710 1702CCSD - 32 cores 5076 4088 2620 2363

Storage (GB) 7.6 18.8 21.3 25.6Seconds per MB - 64 73 70 66 55Seconds per MB - 32 129 131 127

However:

GPUs generate AOs much faster than they can be read off the disk.

Water Tetramer / aug’-cc-pVTZ

Storage: Shared NFS mounted (bad example).

Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better.

Page 30: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Concurrency

Everything N-ways parallelNO

Biggest mistakeParallelizing every MO term over all

cores.

FixConcurrency

Page 31: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Concurrent MO termsNodes

MO Terms - Parallelized over the minimum number of nodes while still efficient & fast.

[vv|vv]

MO nodes join the [vv|vv] term already in progress … dynamic load balancing.

Page 32: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Adaptive Computing

Self Adjusting / Self TuningConcurrent MO termsValue of IKCUT

Use the iterations to improve the calculation:

Adjust initial node assignmentsIncrease IKCUT

Monte Carlo approach to tuning paramaters.

Page 33: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Conclusions …

Good First Start … [vv|vv] scales perfectly with node count. multilevel parallelism adjustable i/o usage

A lot to do … improve intra-node memory bottlenecks concurrent MO terms generalized amplitude storage adaptive computing

Use the knowledge from these hand coded methods to refine the CS structure in automated methods.

Page 34: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Acknowledgements

PeopleMark GordonMike SchmidtJonathan BentzRicky KendallAlistair Rendell

FundingDoE SciDACSCL (Ames

Lab)APAC / ANUNSFMSI

Page 35: Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Petaflop Applications(benchmarks, too)

Petaflop = ~125,000 2.2 GHz AMD Opteron cores.

O vs. V small O, big V ==> CBS Limit big O ==> see below

Local and Many-Body Methods FMO, EE-MB, etc. - use existing parallel

methods Sampling