Upload
tristen-wetherill
View
224
Download
3
Tags:
Embed Size (px)
Citation preview
Parallel CC &Petaflop Applications
Ryan OlsonCray, Inc.
Did you know …
Teraflop - CurrentPetaflop - ImminentWhat’s next?
ExaflopZettaflopYOTTAflop!
Outline
Sanibel Symposium
Programming Models
Parallel CC Implementations
Benchmarks
Petascale Applications
This Talk
Distributed Data Interface
GAMESS MP-CCSD(T)
O vs. V
Local & Many-Body Methods
Programming ModelsThe Distributed Data Interface (DDI)
Programming Interface, not Programming Model
Choose the key functionality from the best programming models and provide: Common Interface Simple and Portable General Implementation
Provide an interface to: SPMD: TCGMSG, MPI AMOs: SHMEM, GA SMPs: OpenMP, pThreads SIMD: GPUs, Vector directives, SSE, etc.
Use the best models for the underlying hardware.
Overview
GAMESSApplication Level
Distributed Data Interface (DDI)High-Level API
Implementation
SHMEM / GPSHMEM MPI-2 MPI-1 + GA MPI-1 TCP/IP System V IPCHardware APIElan, GM, etc.
Native Implementations Non-Native Implementations
Programming ModelsThe Distributed Data Interface
Overview Virtual Shared-Memory Model (Native) Cluster Implementation (Non-Native)
Shared Memory/SMP Awareness Clusters of SMP (DDI versions 2-3)
Goal: Multilevel Parallelism Intra/Inter-node Parallelism Maximize Data Locality Minimize Latency / Maximize Bandwidth
Virtual Shared Memory ModelCPU 1
Distributed Memory Storage
CPU 0 CPU 2 CPU 3
0 1 2 3
Distributed MatrixDDI_Create(Handle,NRows,NCols)
CPU0 CPU1 CPU2 CPU3
NCols
NR
ows
Subpatch
Key Point:• The physical memory available to each CPU is divided into
two parts: replicated storage and distributed storage.
Non-Native Implementations(and lost opportunities … )
Distributed Memory Storage(on separate data servers)
GETPUT
0 1 2 3
4 5 6 7
Node 0 (CPU0 + CPU1)
Node 1 (CPU2 + CPU3)
ComputeProcesses
DataServers
ACC(+=)
DDI till 2003 …
SystemV Shared Memory(Fast Model)
0
4 76
GET
PUTACC(+=)
1 2 3
Node 0 (CPU0 + CPU1)
Node 1 (CPU2 + CPU3)
ComputeProcesses
Data Servers
SharedMemory
Segments
5
Distributed Memory Storage(in SysV Shared Memory Segments)
DDI v2 - Full SMP Awareness
Distributed Memory Storage(on separate System V Shared Memory Segments)
GET PUTACC(+=)
0 1 2 3
4 5 6 7
Node 0 (CPU0 + CPU1)
Node 1 (CPU2 + CPU3)
ComputeProcesses
DataServers
SharedMemory
Segments
Proof of Principle - 2003
8 16 32 64 96
DDI v2 18283 12978 8024 5034 3718
DDI–Fast 27400 19534 14809 11424 9010
DDI v1 Limit 109839
95627 85972 N/A
UMP2 Gradient Calculation - 380 BFsDual AMD MP2200 Cluster using SCI network(2003 Results)
Note: DDI v1 was especially problematic onthe SCI network.
DDI v2The DDI Library is SMP Aware.
offers new interfaces to make application SMP aware.
DDI programs inherit improvements in the library.
DDI programs do not automatically become SMP aware, unless they utilize the new interface.
Parallel CC and Threads(Shared Memory Parallelism)
Bentz and KendallParallel BLAS3WOMPAT ‘05
OpenMPParallelized Remaining TermsProof of Principle
Results• Au4 ==> GOOD
• CCSD = (T)• No Disk I/O problems• Both CCSD and (T) scale well
• Au+(C3H6) ==> POOR/AVERAGE• CCSD scales poorly due to I/O vs. FLOP Balance• (T) scales well, overshadowed by bad CCSD performance
• Au8 ==> GOOD• CCSD scales reasonable
(Greater FLOP count, about equal I/O).• N7 (T) step dominates over the relatively small time for CCSD.• (T) scales well, so the overall performance is good.
Detailed Speedups …
Au4 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.002 1.91 1.80 2.17 1.88 1.904 3.18 3.55 4.20 3.70 3.398 4.60 5.30 6.29 5.52 4.97
Au+(C3H6) CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.00 1.00 1.00 1.00 1.008 1.99 5.40 6.07 5.52 2.61
Au8 CCSD T3WT2 T3SQTOT (T) CCSD(T)1 1.0 1.0 1.0 1.0 1.08 4.5 5.6 6.8 5.8 5.2
DDI v3Shared Memory for ALL
2 3
6 7
ComputeProcesses
DataServers
AggregrateDistributed
Storage
0 1
2 3
Replicated Storage ~ 500MB –1GB
Shared Memory ~ 1GB – 12GB
Distributed Memory ~ 10 – 1000GB
DDI v3
Memory Hierarchy Replicated, Shared and Distributed
Program Models Traditional DDI Multilevel Model DDI Groups (a different talk)
Multilevel Models Intra/Internode Parallelism Superset of MPI/OpenMP and/or
MPI/pThreads models MPI lacks “true” one-sided messaging
Parallel Coupled Cluster(Topics)
Data Distribution for CCSD(T) Integrals Distributed Amplitudes in Shared Memory once per node Direct [vv|vv] term
Parallelism based on Data Locality
First Generation Algorithm Ignore I/O Focus on Data and FLOP parallelism
Important Array Sizes (in GB)
300 400 500 600 700 800 900 1000
10 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 15 0.2 0.3 0.4 0.6 0.8 1.1 1.4 1.7 20 0.3 0.5 0.7 1.1 1.5 1.9 2.4 3.0 25 0.4 0.7 1.2 1.7 2.3 3.0 3.8 4.7 30 0.6 1.1 1.7 2.4 3.3 4.3 5.4 6.7 35 0.8 1.5 2.3 3.3 4.5 5.8 7.4 9.1 40 1.1 1.9 3.0 4.3 5.8 7.6 9.7 11.9 45 1.4 2.4 3.8 5.4 7.4 9.7 12.2 15.1 50 1.7 3.0 4.7 6.7 9.1 11.9 15.1 18.6 55 2.0 3.6 5.6 8.1 11.0 14.4 18.3 22.5 60 2.4 4.3 6.7 9.7 13.1 17.2 21.7 26.8
300 400 500 600 700 800 900 1000
10 1 2 5 8 13 19 27 37 15 2 4 7 12 19 29 41 56 20 2 5 9 16 26 38 54 75 25 3 6 12 20 32 48 68 93 30 3 7 14 24 38 57 82 112 35 4 8 16 28 45 67 95 131 40 4 10 19 32 51 76 109 149 45 5 11 21 36 58 86 122 168 50 5 12 23 40 64 95 136 186 55 6 13 26 44 70 105 150 205 60 6 14 28 48 77 115 163 224
v
o
o
v
[vv|oo][vo|vo]T2
[vv|vo]
MO Based Terms
Some code …
DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE
CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO)
CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)
MO Parallelization
0 1
[vo*|vo*], [vv|o*o*][vv|v*o*]
2 3
T2 Soln T2 Soln
[vo*|vo*], [vv|o*o*][vv|v*o*]
Goal: Disjoint updates to the solution matrix.Avoid locking/critical sections whenever possible.
Direct [VV|VV] Term
0 1 … processes … P-1
PUT Iijνσ
11 12 13 …atom
ic orbital indices … N
bf 2
do = 1,nshell
do = 1,nshell
compute:
transform:
end do
end do
transform:
contract:
PUT and for ijI ijνσ
ν σ( )aν σ( ) = Ca ν σ( )
∑
vabνσ = aν bσ( ) = Cb aν σ( )
∑
I ijνσ =vab
νσcijab
I ijσν
Iijσν
do ν = 1,nshell do σ = 1,ν
end doend dosynchronizefor each “local” ij column do
GET
reorder: shell --> AO order
transform:
STORE in “local” solution vector
I ijνσ
GET Iijνσ
Iijab = I ij
νσCνaCσbνσ∑I ijab
end do
11 21 22…occ indices…(NoNo)*
(T) Parallelism
Trivial -- in theory[vv|vo] distributedv3 work arrays
at large v -- stored in shared memory
disjoint updates where both quantities are shared
Timings … 1 Node Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 1.00 100% 1.90 95% 3.70 92% 6.18 77% CCSD-MO 1.00 100% 1.87 93% 3.11 78% 4.21 53% CCSD-Total 1.00 100% 1.86 93% 3.58 89% 5.68 71% Triples Correction (T) 1.00 100% 1.78 89% 2.59 65% 4.06 51% 2Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 2.00 100% 3.76 94% 7.43 93% 12.31 77% CCSD-MO 1.38 69% 2.46 62% 4.10 51% 6.21 39% CCSD-Total 1.88 94% 3.34 84% 6.53 82% 9.56 60% Triples Correction (T) 1.94 97% 3.38 85% 4.73 59% 7.13 45% 3 Nodes Processes per Node 1 2 4 8 S E S E S E S E CCSD-AO 3.00 100% 5.85 97% 11.07 92% 18.48 77% CCSD-MO 1.68 56% 2.96 49% 4.56 38% 6.91 29% CCSD-Total 2.55 85% 4.80 80% 8.28 69% 14.57 61% Triples Correction (T) 2.95 98% 5.24 87% 7.63 64% 11.82 49%
(H2O)6 Prism - aug’-cc-pVTZFastest timing: < 6 hours on 8x8 Power5
Improvements …
Semi-Direct [vv|vv] term (IKCUT)
Concurrent MO terms
Generalized amplitudes storage
Semi-Direct [VV|VV] Termdo = 1,nshell
do = 1,nshell
compute:
transform:
end do
end do
transform:
contract:
PUT and for ijI ijνσ
ν σ( )aν σ( ) = Ca ν σ( )
∑
vabνσ = aν bσ( ) = Cb aν σ( )
∑
I ijνσ =vab
νσcijab
I ijσν
do ν = 1,nshell ! I-SHELL do σ = 1,ν ! K-SHELL
end doend do
Define IKCUT
Store if: LEN(I)+LEN(K) > IKCUT
Automatic contention avoidance
Adjustable: Fully direct to fully conventional.
Semi-Direct [vv|vv] Timings
IKCUT Direct 12 8 6 Save AllCCSD - 64 cores 3122 2563 1805 1710 1702CCSD - 32 cores 5076 4088 2620 2363
Storage (GB) 7.6 18.8 21.3 25.6Seconds per MB - 64 73 70 66 55Seconds per MB - 32 129 131 127
However:
GPUs generate AOs much faster than they can be read off the disk.
Water Tetramer / aug’-cc-pVTZ
Storage: Shared NFS mounted (bad example).
Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better.
Concurrency
Everything N-ways parallelNO
Biggest mistakeParallelizing every MO term over all
cores.
FixConcurrency
Concurrent MO termsNodes
MO Terms - Parallelized over the minimum number of nodes while still efficient & fast.
[vv|vv]
MO nodes join the [vv|vv] term already in progress … dynamic load balancing.
Adaptive Computing
Self Adjusting / Self TuningConcurrent MO termsValue of IKCUT
Use the iterations to improve the calculation:
Adjust initial node assignmentsIncrease IKCUT
Monte Carlo approach to tuning paramaters.
Conclusions …
Good First Start … [vv|vv] scales perfectly with node count. multilevel parallelism adjustable i/o usage
A lot to do … improve intra-node memory bottlenecks concurrent MO terms generalized amplitude storage adaptive computing
Use the knowledge from these hand coded methods to refine the CS structure in automated methods.
Acknowledgements
PeopleMark GordonMike SchmidtJonathan BentzRicky KendallAlistair Rendell
FundingDoE SciDACSCL (Ames
Lab)APAC / ANUNSFMSI
Petaflop Applications(benchmarks, too)
Petaflop = ~125,000 2.2 GHz AMD Opteron cores.
O vs. V small O, big V ==> CBS Limit big O ==> see below
Local and Many-Body Methods FMO, EE-MB, etc. - use existing parallel
methods Sampling