Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Parallel Tensor Computations in Python or C++Using Cyclops
Edgar Solomonik
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
PASC 2018
July 3, 2018
PASC 2018 Parallel Tensor Computations Using Cyclops 1/15
A library for parallel tensor computationsCyclops Tensor Framework (github.com/cyclops-community/ctf)
distributed-memory symmetric/sparse/dense tensor objectsMatrix <int > A(n, n, AS|SP, World(MPI_COMM_WORLD ));Tensor <float > T(order , is_sparse , dims , syms , ring , world);T.read(...); T.write(...); T.slice(...); T.permute(...);
parallel contraction/summation of tensorsC["ij"] = A["ik"]*B["kj"]; // matmulC["ijl"] += A["ikl"]*B["kjl"]; // batched matmulZ["abij"] += V["ijab"]; // tensor transposeT["wxyz"] += U["uw"]*T["uxyz"]; // TTMT["abij"] = T["abij"]*D["abij"]; // Hadamard productS["ii"] = v["i"] // S = diag(v)v["i"] += S["ii"] // v += diag(S)M["ij"] += Function <>([]( double x){ return 1/x; })(v["j"]);
∼2000 commits since 2011, open source since 2013
PASC 2018 Parallel Tensor Computations Using Cyclops 2/15
Electronic structure calculations with cyclops
Extracted from Aquarius (lead by Devin Matthews)https://github.com/devinamatthews/aquarius
FMI["mi"] += 0.5*WMNEF["mnef"]*T2["efin"];WMNIJ["mnij"] += 0.5*WMNEF["mnef"]*T2["efij"];FAE["ae"] -= 0.5*WMNEF["mnef"]*T2["afmn"];WAMEI["amei"] -= 0.5*WMNEF["mnef"]*T2["afin"];
Z2["abij"] = WMNEF["ijab"];Z2["abij"] += FAE["af"]*T2["fbij"];Z2["abij"] -= FMI["ni"]*T2["abnj"];Z2["abij"] += 0.5*WABEF["abef"]*T2["efij"];Z2["abij"] += 0.5*WMNIJ["mnij"]*T2["abmn"];Z2["abij"] -= WAMEI["amei"]*T2["ebmj"];
CTF has been integrated with QChem, VASP (CC4S), and PySCF
Is also being used for other applications, e.g. by IBM+LLNLcollaboration to perform 49-qubit quantum circuit simulation
PASC 2018 Parallel Tensor Computations Using Cyclops 3/15
Electronic structure calculations with CyclopsCCSD up to 55 (50) water molecules with cc-pVDZ
CCSDT up to 10 water molecules with cc-pVDZ
4
8
16
32
64
128
256
512
1024
512 1024 2048 4096 8192 16384 32768
Tera
flops
#nodes
Weak scaling on BlueGene/Q
Aquarius-CTF CCSDAquarius-CTF CCSDT
10
20
30
40
50
60
512 1024 2048 4096 8192 16384
Gig
aflo
ps/n
ode
#nodes
Weak scaling on BlueGene/Q
Aquarius-CTF CCSDAquarius-CTF CCSDT
1
2 4
8 16
32 64
128 256
512
32 64 128 256 512 1024 2048 4096
Tera
flops
#nodes
Weak scaling on Edison
Aquarius-CTF CCSDAquarius-CTF CCSDT
50
100
150
200
250
300
350
32 64 128 256 512 1024 2048 4096
Gig
aflo
ps/n
ode
#nodes
Weak scaling on Edison
Aquarius-CTF CCSDAquarius-CTF CCSDT
compares well to NWChem (up to 10x speed-ups for CCSDT)PASC 2018 Parallel Tensor Computations Using Cyclops 4/15
MP3 method
Tensor <> Ea, Ei, Fab , Fij , Vabij , Vijab , Vabcd , Vijkl , Vaibj;... // compute above 1-e an 2-e integrals
Tensor <> T(4, Vabij.lens , *Vabij.wrld);T["abij"] = Vabij["abij"];
divide_EaEi(Ea, Ei, T);
Tensor <> Z(4, Vabij.lens , *Vabij.wrld);Z["abij"] = Vijab["ijab"];Z["abij"] += Fab["af"]*T["fbij"];Z["abij"] -= Fij["ni"]*T["abnj"];Z["abij"] += 0.5*Vabcd["abef"]*T["efij"];Z["abij"] += 0.5*Vijkl["mnij"]*T["abmn"];Z["abij"] += Vaibj["amei"]*T["ebmj"];
divide_EaEi(Ea, Ei, Z);
double MP3_energy = Z["abij"]*Vabij["abij"];
PASC 2018 Parallel Tensor Computations Using Cyclops 5/15
Sparse MP3 code
Strong and weak scaling of sparse MP3 code, with(1) dense V and T (2) sparse V and dense T (3) sparse V and T
0.125
0.25
0.5
1
2
4
8
16
32
64
128
256
24 48 96 192 384 768
seco
nds/
itera
tion
#cores
Strong scaling of MP3 with no=40, nv=160
dense10% sparse*dense10% sparse*sparse
1% sparse*dense1% sparse*sparse.1% sparse*dense.1% sparse*sparse
1
2
4
8
16
32
64
128
256
512
1024
2048
24 48 96 192 384 768 1536 3072 6144
seco
nds/
itera
tion
#cores
Weak scaling of MP3 with no=40, nv=160
dense10% sparse*dense10% sparse*sparse
1% sparse*dense1% sparse*sparse.1% sparse*dense.1% sparse*sparse
PASC 2018 Parallel Tensor Computations Using Cyclops 6/15
Custom tensor element types
Cyclops permits arbitrary element types and custom functionsCombBLAS/GraphBLAS-like functionalitySee examples for SSSP, APSP, betweenness centrality, MIS, MIS-2Functionality to handle serialization of pointers within user-definedtypes is under developmentBlock-sparsity via sparse tensor (local) of dense tensors (parallel)
// Define Monoid tmon to perform matrix summation as addition...
Matrix < Matrix <> > C(nblk , nblk , SP, self_world , tmon);
C["ij"] = Function < Matrix <> >([](Matrix <> mA, Matrix <> mB){
Matrix <> mC(mA.nrow , mB.ncol);mC["ij"] += mA["ik"]*mB["kj"];return mC;
})(A["ik"],B["kj"]);
PASC 2018 Parallel Tensor Computations Using Cyclops 7/15
Symmetry and sparsity by cyclicity
for sparse tensors, a cyclic layout provides a load-balanced distribution
PASC 2018 Parallel Tensor Computations Using Cyclops 8/15
Parallel contraction in Cyclops
Cyclops uses nested parallel matrix multiplication variants
1D variants
perform a different matrix-vector product on each processorperform a different outer product on each processor
2D variants
perform a different inner product on each processorscale a vector on each processor then sum
3D variants
perform a different scalar product on each processor then sumcan be achieved by nesting 1D+1D+1D or 2D+1D or 1D+2D
All variants are blocked in practice, naturally generalized to sparsematrix products
PASC 2018 Parallel Tensor Computations Using Cyclops 9/15
Tensor blocking/virtualization
Preserving symmetric-packed layout using cyclic distributionconstrains possible tensor blockings
subdivision into more blocks than there are processors (virtualization)
PASC 2018 Parallel Tensor Computations Using Cyclops 10/15
Data mapping and redistribution
Transitions between contractions require redistribution and refolding
1D/2D/3D variants naturally map to 1D/2D/3D processor grids
Initial tensor distribution is oblivious of contraction
by default each tensor distributed over all processors
user can specify any processor grid mapping
Global redistribution done by one of three methods
reassign tensor blocks to processors (easy+fast)
reorder and reshuffle data to satisfy new blocking (fast)
treat tensors as sparse and sort globally by function of index
Matricization/transposition is then done locally
dense tensor transpose done using HPTT (by Paul Springer)
sparse tensor converted to CSR sparse matrix format
PASC 2018 Parallel Tensor Computations Using Cyclops 11/15
Local summation and contraction
For contractions, local summation and contraction is done viaBLAS, including batched GEMM
Threading is used via OpenMP and threaded BLAS
GPU offloading is available but not yet fully robust
For sparse matrices, MKL provides fast sparse matrix routines
To support general (mixed-type, user-defined) elementwisefunctions, manual implementations are available
User can specify blocked implementation of their function toimprove performance
PASC 2018 Parallel Tensor Computations Using Cyclops 12/15
Performance modeling and intelligent mapping
Performance models used to select best contraction algorithm
Based on linear cost model for each kernel
T ≈ αS︸︷︷︸latency
+ βW︸︷︷︸comm. bandwidth
+ νQ︸︷︷︸mem. bandwidth
+ γF︸︷︷︸flops
Scaling of S, W , Q, F is a function of parameters of each kernel
Coefficients for all kernels depend on compiler/architecture
Linear regression with Tykhonov regularization used to selectcoefficients x∗
Model training done by benchmark suite that executes variousend-functionality for growing problem sizes, collecting observationsof parameters in rows of A and execution timing in t
x∗ = argminx
(||Ax− t||2 + λ||x||2)
PASC 2018 Parallel Tensor Computations Using Cyclops 13/15
Cyclops with Python
Using Cython, we have provided a Python interface for CyclopsFollows numpy.ndarray conventions, plus sparsity and MPIexecution
Z["abij"] += V["ijab"]; // C++Z.i("abij") << V.i("ijab") // PythonW["mnij"] += 0.5*W["mnef"]*T["efij"]; // C++W.i("mnij") << 0.5*W.i("mnef")*T.i("efij") // Pythoneinsum("mnef ,efij ->mnij",W,T) // numpy -style Python
Python interface is under active development, but is functional andavailable (DEMO)
PASC 2018 Parallel Tensor Computations Using Cyclops 14/15
Future directions and acknowledgements
Future/ongoing directions in Cyclops development
General abstractions for tensor decompositions
Concurrent scheduling of multiple contractions
Fourier transforms along tensor modes
Improvements to functionality and performance for linear algebra
Acknowledgements
Devin Matthews (UT Austin), Jeff Hammond (Intel Corp.), JamesDemmel (UC Berkeley), Torsten Hoefler (ETH Zurich), ZechengZhang, Eric Song, Eduardo Yap, Linjian Ma (UIUC)
Computational resources at NERSC, CSCS, ALCF, NCSA, TACC
PASC 2018 Parallel Tensor Computations Using Cyclops 15/15
Backup slides
PASC 2018 Parallel Tensor Computations Using Cyclops 16/15