BEAST/BEAGLE Phylogene2cs So5ware 1
Mitch Horton – Keeneland Advanced Applica4on Development Team
GTC 2014 March 24-‐27, 2014 | San Jose, CA
BEAST/BEAGLE Phylogene2cs So5ware Enormous code base wriIen in Java, C++, CUDA BEAST version 1.4.6 consists of 81,000 lines of Java, 779 classes, and 81 packages Scaled to run on 120 Keeneland nodes (360 GPUs)
2
Keeneland Project Keeneland is a project inves4ga4ng the use of GPU accelerators with commodity microprocessors for high-‐performance scien4fic compu4ng. A significant component of the Keeneland project is to reach out to teams developing applica4ons that might map well to this innova4ve architecture.
3
Phylogene2cs In biology, phylogene4cs is the study of evolu4onary rela4on among groups of organisms, which is discovered through molecular sequencing data.
4
Phylogene2cs 5
AGTTCGATCCG
AGTGCGATCCG
AGTCCGAACAG
AGTCCGATGCC
AGTCCGAACCG
GGTCCGATCCG
AGTCAGAGCCG
AGTAAGAGCCG
AGTCCGACCAG
AGTCCGAGCCG
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
t16
t17
t18
t19
Phylogene2cs 6
AGTTCGATCCG
AGTGCGATCCG
AGTCCGAACAG
AGTCCGATGCC
AGTCCGAACCG
GGTCCGATCCG
AGTCAGAGCCG
AGTAAGAGCCG
AGTCCGACCAG
AGTCCGAGCCG
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
t16
t17
t18
t19
Phylogene2cs 7
AGTTCGATCCG
AGTGCGATCCG
AGTCCGAACAG
AGTCCGATGCC
AGTCCGAACCG
GGTCCGATCCG
AGTCAGAGCCG
AGTAAGAGCCG
AGTCCGACCAG
AGTCCGAGCCG
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
t16
t17
t18
t19
Phylogene2cs 8
Define as the number of possible unrooted bifurca4ng topologies for taxa. Thus, grows factorially with increasing , and becomes almost unimaginably large for (e.g. ). -‐ Derrick Joel Zwickl (Ph.D. Thesis, 2006) Enumera4ng and evalua4ng every possible topology would be computa4onally foolhardy.
T (n) n
T (n) = (2i− 5)i=3
n
∏
T n
n > 50 T (50) ≈ 2.84×1076
Monte Carlo Markov Chain Phylogene2cs 9
Startling, recent advances in sequencing technology are fueling a concomitant increase in the scale and ambi4on of phylogene4c analyses. Effort skyrockets with number of sequences, complexity of sequence characters, and complexity of sequence evolu4on model.
Felsenstein’s Algorithm for Likelihood 10
INITIALIZATION: SET RECURSION: COMPUTE FOR ALL AS FOLLOWS IF IS A LEAF NODE SET IF , IF IF IS NOT A LEAF NODE COMPUTE , FOR ALL AT THE DAUGHTER NODES AND SET TERMINATION: LIKELIHOOD AT SITE THE CONCLUDING STEP IN COMPUTING THE LIKELIHOOD IS TO USE THE ASSUMPTION OF INDEPENDENCE AT SITES TO WRITE:
k = 2n−1
a ≠ xuka = xu
kP(Lk | a) =1
k
aP(Lk | a)
P(Lk | a) = 0
P(x* |T, t*) = P(xu* |T, t*)
u=1
N
∏
u = P(xu* |T, t*) = P(L2n−1 | a)qa
a∑
P(Lk | a) = P(b | a, ti )P(Li | b)P(c | a, t j )P(Lj | c)b,c∑
i, jP(Li | a)
k
P(Lj | a) a
A C T G C
y w
z
x
t1 t2
t3
t4 t5
t6
t7
t8
nvvp – 100 mcmc steps 11
nvvp – 100 mcmc steps 12
Single GPU – 24 Hour Run 13
100 GPUs – 2 Hour Run 14
100 GPUs – 2 Hour Run 15
100 GPUs – 2 Hour Run 16
100 GPUs – 2 Hour Run 17
360 GPUs 18
360 GPUs 19
GPU-‐Based Bayesian MCMC Phylogene2c Inference at Scale 20
1
10
100
1000
10000
0 10 20 30 40 50 60 70 80 90 100
Tim
e to
Sol
utio
n (H
ours
)
Number of Compute Nodes
Performance of BEAST/BEAGLE Phylogenetics Software125 Sequences, 2968 Sites, 5 Rate Categories, Nucleotide Model
Comparing Different Sofware Configurations12 CPU Cores per Compute Node (2 x 6-cores) 2.8 GHz, X5660, 23 GB, 270 Gflops/s Peak
3 GPUs per Compute Node (Telsa M2090) 1.3 GHz, 5.4 GB, 1.33 Tflop/s Peak120 Compute Nodes
Single node, CPU, single coreSingle node, CPU, multi-core
Single node, single GPUMulti-node, single GPUMulti-node, multi-GPU
Each site evolves according to a Markov process in which a base (T,C,A, or G) is replaced by another base in an infinitesimally short interval of 4me, , with a probability as follows: Subs4tu4on From mathema4cal manifesta4on of the Probability Matrix Markovian nature of the process: for Infinitesimally Short Interval of Time Masami Hasegawa
Batch Matrix Mul2ply 21
In gene4cs, transi4on is a point muta4on that changes a purine nucleo4de (A,G) to another purine or a pyrimidine nucleo4de (C,T) to another pyrimidine. Although there are twice as many possible transversions, approximately two out of every three single nucleo4de muta4ons are transi4ons.
Pij (dt) =απ jdt (for transition)Pij (dt) = βπ jdt (for transversion)
ij dt Pij (dt)
Wikipedia
Batch Matrix Mul2ply 22
P (t) = exp(tA) = E×diag(etλ1,…,etλS )×E−1 = EDtE−1
Finite-‐4me transi4on probabili4es that characterize how state mutates to state along A branch of length .
Psj (t) s jt
Matrix exponen4a4on is defined to be:
eX = 1k!Xk
k=0
∞
∑For some simple cases, the above can be computed explicitly, otherwise, diagonaliza4on.
A = EDE−1⇒ An = EDnE−1⇒ I+ 11!A+ 1
2!An +…= EeDE−1
Batch Matrix Mul2ply 23
Batch Matrix Mul2ply 24
0.01
0.1
1
10
100
1000
100 1000 10000 100000
GFL
OPS
/S
Number of Matrices
Performance of cublasSgemmBatched, streams, hand-written CUDA, MKL, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak
Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak
Hand-written CUDA, KeplerHand-written CUDA, Fermi
cublasSgemmBatched, KeplercublasSgemmBatched, Fermi
MKLstreams, Fermi
streams, Kepler
Batch Matrix Mul2ply 25
0
50
100
150
200
250
300
350
100 1000 10000 100000 1e+06
GFL
OPS
/S
Number of Matrices
Performance of Hand-written CUDA, Optimized CUDA, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak
Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak
Optimized CUDA, KeplerOptimized CUDA, Fermi
Hand-written CUDA, KeplerHand-written CUDA, Fermi
Batch Matrix Mul2ply 26
0
100
200
300
400
500
100 1000 10000 100000
GFL
OPS
/S
Number of Matrices
Performance of Hand-written CUDA, Optimized CUDA, cublasSgemmBatched, 20x20 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak
Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak
Optimized CUDA, KeplerOptimized CUDA, Fermi
Hand-written CUDA, KeplerHand-written CUDA, Fermi
cublasSgemmBatched, KeplercublasSgemmBatched, Fermi
Batch Matrix Mul2ply 27
0
50
100
150
200
0 10000 20000 30000 40000 50000 60000
GFL
OPS
/S
Number of Matrices
Performance of Lagrange Interpolation, Newton Interpolation, Optimized CUDA, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak
Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak
Optimized CUDA, KeplerOptimized CUDA, Fermi
Newton Interpolation, KeplerLagrange Interpolation, Fermi
Newton Interpolation, FermiLagrange Interpolation, Kepler
Batch Matrix Mul2ply 28
Batch Matrix Mul2ply 29
One matrix mul4ply – size N N*N*N flops for N*N memory accesses 1024 matrix mul4plies – size N/32 1024*(N/32)*(N/32)*(N/32) flops for 1024*(N/32)*(N/32) memory accesses (N*N*N)/32 flops for N*N memory accesses