Download pdf - GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

BEAST/BEAGLE Phylogene2cs So5ware 1

Mitch Horton – Keeneland Advanced Applica4on Development Team

GTC 2014 March 24-‐27, 2014 | San Jose, CA

BEAST/BEAGLE Phylogene2cs So5ware Enormous code base wriIen in Java, C++, CUDA BEAST version 1.4.6 consists of 81,000 lines of Java, 779 classes, and 81 packages Scaled to run on 120 Keeneland nodes (360 GPUs)

2

Keeneland Project Keeneland is a project inves4ga4ng the use of GPU accelerators with commodity microprocessors for high-‐performance scien4fic compu4ng. A significant component of the Keeneland project is to reach out to teams developing applica4ons that might map well to this innova4ve architecture.

3

Phylogene2cs In biology, phylogene4cs is the study of evolu4onary rela4on among groups of organisms, which is discovered through molecular sequencing data.

4

Phylogene2cs 5

AGTTCGATCCG

AGTGCGATCCG

AGTCCGAACAG

AGTCCGATGCC

AGTCCGAACCG

GGTCCGATCCG

AGTCAGAGCCG

AGTAAGAGCCG

AGTCCGACCAG

AGTCCGAGCCG

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

Phylogene2cs 6

AGTTCGATCCG

AGTGCGATCCG

AGTCCGAACAG

AGTCCGATGCC

AGTCCGAACCG

GGTCCGATCCG

AGTCAGAGCCG

AGTAAGAGCCG

AGTCCGACCAG

AGTCCGAGCCG

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

Phylogene2cs 7

AGTTCGATCCG

AGTGCGATCCG

AGTCCGAACAG

AGTCCGATGCC

AGTCCGAACCG

GGTCCGATCCG

AGTCAGAGCCG

AGTAAGAGCCG

AGTCCGACCAG

AGTCCGAGCCG

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

Phylogene2cs 8

Define as the number of possible unrooted bifurca4ng topologies for taxa. Thus, grows factorially with increasing , and becomes almost unimaginably large for (e.g. ). -‐ Derrick Joel Zwickl (Ph.D. Thesis, 2006) Enumera4ng and evalua4ng every possible topology would be computa4onally foolhardy.

T (n) n

T (n) = (2i− 5)i=3

n

∏

T n

n > 50 T (50) ≈ 2.84×1076

Monte Carlo Markov Chain Phylogene2cs 9

Startling, recent advances in sequencing technology are fueling a concomitant increase in the scale and ambi4on of phylogene4c analyses. Effort skyrockets with number of sequences, complexity of sequence characters, and complexity of sequence evolu4on model.

Felsenstein’s Algorithm for Likelihood 10

INITIALIZATION: SET RECURSION: COMPUTE FOR ALL AS FOLLOWS IF IS A LEAF NODE SET IF , IF IF IS NOT A LEAF NODE COMPUTE , FOR ALL AT THE DAUGHTER NODES AND SET TERMINATION: LIKELIHOOD AT SITE THE CONCLUDING STEP IN COMPUTING THE LIKELIHOOD IS TO USE THE ASSUMPTION OF INDEPENDENCE AT SITES TO WRITE:

k = 2n−1

a ≠ xuka = xu

kP(Lk | a) =1

k

aP(Lk | a)

P(Lk | a) = 0

P(x* |T, t*) = P(xu* |T, t*)

u=1

N

∏

u = P(xu* |T, t*) = P(L2n−1 | a)qa

a∑

P(Lk | a) = P(b | a, ti )P(Li | b)P(c | a, t j )P(Lj | c)b,c∑

i, jP(Li | a)

k

P(Lj | a) a

A C T G C

y w

z

x

t1 t2

t3

t4 t5

t6

t7

t8

nvvp – 100 mcmc steps 11

nvvp – 100 mcmc steps 12

Single GPU – 24 Hour Run 13

100 GPUs – 2 Hour Run 14




360 GPUs 18

360 GPUs 19

GPU-‐Based Bayesian MCMC Phylogene2c Inference at Scale 20

1

10

100

1000

10000

0 10 20 30 40 50 60 70 80 90 100

Tim

e to

Sol

utio

n (H

ours

)

Number of Compute Nodes

Performance of BEAST/BEAGLE Phylogenetics Software125 Sequences, 2968 Sites, 5 Rate Categories, Nucleotide Model

Comparing Different Sofware Configurations12 CPU Cores per Compute Node (2 x 6-cores) 2.8 GHz, X5660, 23 GB, 270 Gflops/s Peak

3 GPUs per Compute Node (Telsa M2090) 1.3 GHz, 5.4 GB, 1.33 Tflop/s Peak120 Compute Nodes

Single node, CPU, single coreSingle node, CPU, multi-core

Single node, single GPUMulti-node, single GPUMulti-node, multi-GPU

Each site evolves according to a Markov process in which a base (T,C,A, or G) is replaced by another base in an infinitesimally short interval of 4me, , with a probability as follows: Subs4tu4on From mathema4cal manifesta4on of the Probability Matrix Markovian nature of the process: for Infinitesimally Short Interval of Time Masami Hasegawa

Batch Matrix Mul2ply 21

In gene4cs, transi4on is a point muta4on that changes a purine nucleo4de (A,G) to another purine or a pyrimidine nucleo4de (C,T) to another pyrimidine. Although there are twice as many possible transversions, approximately two out of every three single nucleo4de muta4ons are transi4ons.

Pij (dt) =απ jdt (for transition)Pij (dt) = βπ jdt (for transversion)

ij dt Pij (dt)

Wikipedia


P (t) = exp(tA) = E×diag(etλ1,…,etλS )×E−1 = EDtE−1

Finite-‐4me transi4on probabili4es that characterize how state mutates to state along A branch of length .

Psj (t) s jt

Matrix exponen4a4on is defined to be:

eX = 1k!Xk

k=0

∞

∑For some simple cases, the above can be computed explicitly, otherwise, diagonaliza4on.

A = EDE−1⇒ An = EDnE−1⇒ I+ 11!A+ 1

2!An +…= EeDE−1



0.01

0.1

1

10

100

1000

100 1000 10000 100000

GFL

OPS

/S

Number of Matrices

Performance of cublasSgemmBatched, streams, hand-written CUDA, MKL, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak

Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak

Hand-written CUDA, KeplerHand-written CUDA, Fermi

cublasSgemmBatched, KeplercublasSgemmBatched, Fermi

MKLstreams, Fermi

streams, Kepler


0

50

100

150

200

250

300

350

100 1000 10000 100000 1e+06

GFL

OPS

/S

Number of Matrices

Performance of Hand-written CUDA, Optimized CUDA, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak


Optimized CUDA, KeplerOptimized CUDA, Fermi



0

100

200

300

400

500

100 1000 10000 100000

GFL

OPS

/S

Number of Matrices

Performance of Hand-written CUDA, Optimized CUDA, cublasSgemmBatched, 20x20 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak




cublasSgemmBatched, KeplercublasSgemmBatched, Fermi


0

50

100

150

200

0 10000 20000 30000 40000 50000 60000

GFL

OPS

/S

Number of Matrices

Performance of Lagrange Interpolation, Newton Interpolation, Optimized CUDA, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak



Newton Interpolation, KeplerLagrange Interpolation, Fermi

Newton Interpolation, FermiLagrange Interpolation, Kepler



One matrix mul4ply – size N N*N*N flops for N*N memory accesses 1024 matrix mul4plies – size N/32 1024*(N/32)*(N/32)*(N/32) flops for 1024*(N/32)*(N/32) memory accesses (N*N*N)/32 flops for N*N memory accesses