29
BEAST/BEAGLE Phylogene2cs So5ware 1 Mitch Horton – Keeneland Advanced Applica4on Development Team GTC 2014 March 2427, 2014 | San Jose, CA

GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

BEAST/BEAGLE  Phylogene2cs  So5ware  1

 Mitch  Horton  –  Keeneland  Advanced  Applica4on  Development  Team  

 GTC  2014      March  24-­‐27,  2014  |  San  Jose,  CA          

Page 2: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

BEAST/BEAGLE  Phylogene2cs  So5ware    Enormous  code  base  wriIen  in  Java,  C++,  CUDA        BEAST  version  1.4.6  consists  of  81,000  lines  of  Java,  779  classes,  and  81  packages    Scaled  to  run  on  120  Keeneland  nodes  (360  GPUs)              

2

Page 3: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Keeneland  Project    Keeneland  is  a  project  inves4ga4ng  the  use  of  GPU  accelerators  with  commodity  microprocessors  for  high-­‐performance  scien4fic  compu4ng.    A  significant  component  of  the  Keeneland  project  is  to  reach  out  to  teams  developing  applica4ons  that  might  map  well  to  this  innova4ve  architecture.        

3

Page 4: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Phylogene2cs  In  biology,  phylogene4cs  is  the  study  of  evolu4onary  rela4on  among  groups  of  organisms,  which  is  discovered  through  molecular  sequencing  data.  

4

Page 5: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Phylogene2cs  5

AGTTCGATCCG

AGTGCGATCCG

AGTCCGAACAG

AGTCCGATGCC

AGTCCGAACCG

GGTCCGATCCG

AGTCAGAGCCG

AGTAAGAGCCG

AGTCCGACCAG

AGTCCGAGCCG

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

Page 6: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Phylogene2cs  6

AGTTCGATCCG

AGTGCGATCCG

AGTCCGAACAG

AGTCCGATGCC

AGTCCGAACCG

GGTCCGATCCG

AGTCAGAGCCG

AGTAAGAGCCG

AGTCCGACCAG

AGTCCGAGCCG

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

Page 7: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Phylogene2cs  7

AGTTCGATCCG

AGTGCGATCCG

AGTCCGAACAG

AGTCCGATGCC

AGTCCGAACCG

GGTCCGATCCG

AGTCAGAGCCG

AGTAAGAGCCG

AGTCCGACCAG

AGTCCGAGCCG

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

t15

t16

t17

t18

t19

Page 8: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Phylogene2cs  8

Define                  as  the  number  of  possible  unrooted  bifurca4ng  topologies  for          taxa.            Thus,            grows  factorially  with  increasing        ,  and  becomes  almost  unimaginably  large  for                          (e.g.                                                          ).                -­‐  Derrick  Joel  Zwickl  (Ph.D.  Thesis,  2006)    Enumera4ng  and  evalua4ng  every  possible  topology  would  be  computa4onally  foolhardy.      

T (n) n

T (n) = (2i− 5)i=3

n

T n

n > 50 T (50) ≈ 2.84×1076

Page 9: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Monte  Carlo  Markov  Chain  Phylogene2cs  9

Startling,  recent  advances  in  sequencing  technology  are  fueling  a  concomitant  increase  in  the  scale  and  ambi4on  of  phylogene4c  analyses.    Effort  skyrockets  with  number  of  sequences,  complexity  of  sequence  characters,  and  complexity  of  sequence  evolu4on  model.      

Page 10: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Felsenstein’s  Algorithm  for  Likelihood  10

INITIALIZATION:                  SET      RECURSION:    COMPUTE                              FOR  ALL        AS  FOLLOWS                  IF          IS  A  LEAF  NODE                                  SET                                          IF                      ,                                        IF                      IF          IS  NOT  A  LEAF  NODE                                  COMPUTE                        ,                            FOR  ALL            AT  THE  DAUGHTER  NODES                                    AND  SET      TERMINATION:                  LIKELIHOOD  AT  SITE      THE  CONCLUDING  STEP  IN  COMPUTING  THE  LIKELIHOOD  IS  TO  USE  THE    ASSUMPTION  OF  INDEPENDENCE  AT  SITES  TO  WRITE:        

k = 2n−1

a ≠ xuka = xu

kP(Lk | a) =1

k

aP(Lk | a)

P(Lk | a) = 0

P(x* |T, t*) = P(xu* |T, t*)

u=1

N

u = P(xu* |T, t*) = P(L2n−1 | a)qa

a∑

P(Lk | a) = P(b | a, ti )P(Li | b)P(c | a, t j )P(Lj | c)b,c∑

i, jP(Li | a)

k

P(Lj | a) a

A C T G C

y w

z

x

t1 t2

t3

t4 t5

t6

t7

t8

Page 11: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

nvvp  –  100  mcmc  steps  11

Page 12: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

nvvp  –  100  mcmc  steps  12

Page 13: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Single  GPU  –  24  Hour  Run  13

Page 14: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

100  GPUs  –  2  Hour  Run  14

Page 15: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

100  GPUs  –  2  Hour  Run  15

Page 16: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

100  GPUs  –  2  Hour  Run  16

Page 17: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

100  GPUs  –  2  Hour  Run  17

Page 18: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

360  GPUs    18

Page 19: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

360  GPUs  19

Page 20: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

GPU-­‐Based  Bayesian  MCMC  Phylogene2c  Inference  at  Scale    20

1

10

100

1000

10000

0 10 20 30 40 50 60 70 80 90 100

Tim

e to

Sol

utio

n (H

ours

)

Number of Compute Nodes

Performance of BEAST/BEAGLE Phylogenetics Software125 Sequences, 2968 Sites, 5 Rate Categories, Nucleotide Model

Comparing Different Sofware Configurations12 CPU Cores per Compute Node (2 x 6-cores) 2.8 GHz, X5660, 23 GB, 270 Gflops/s Peak

3 GPUs per Compute Node (Telsa M2090) 1.3 GHz, 5.4 GB, 1.33 Tflop/s Peak120 Compute Nodes

Single node, CPU, single coreSingle node, CPU, multi-core

Single node, single GPUMulti-node, single GPUMulti-node, multi-GPU

Page 21: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Each  site  evolves  according  to  a  Markov  process  in  which  a  base        (T,C,A,  or  G)  is  replaced  by  another  base          in  an  infinitesimally  short  interval  of  4me,        ,  with  a  probability                      as  follows:                                                                                                              Subs4tu4on                                                          From  mathema4cal  manifesta4on  of  the                                                                                                              Probability  Matrix                                          Markovian  nature  of  the  process:                                                                                                                            for  Infinitesimally                                                                                                                                Short  Interval                                                                                                                                      of  Time                                                                                                                                                                                                                                                                                                                        Masami  Hasegawa                                                                                                  

Batch  Matrix  Mul2ply  21

In  gene4cs,    transi4on  is  a  point  muta4on  that  changes  a  purine  nucleo4de  (A,G)  to  another  purine  or  a  pyrimidine  nucleo4de  (C,T)  to  another  pyrimidine.    Although  there  are  twice  as  many  possible  transversions,  approximately  two  out  of  every  three  single  nucleo4de  muta4ons  are  transi4ons.  

Pij (dt) =απ jdt (for transition)Pij (dt) = βπ jdt (for transversion)

ij dt Pij (dt)

Wikipedia  

Page 22: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  22

P (t) = exp(tA) = E×diag(etλ1,…,etλS )×E−1 = EDtE−1

Finite-­‐4me  transi4on  probabili4es                          that  characterize  how  state          mutates  to  state          along    A  branch  of  length        .  

Psj (t) s jt

Matrix  exponen4a4on  is  defined  to  be:  

eX = 1k!Xk

k=0

∑For  some  simple  cases,  the  above  can  be  computed  explicitly,  otherwise,  diagonaliza4on.  

A = EDE−1⇒ An = EDnE−1⇒ I+ 11!A+ 1

2!An +…= EeDE−1

Page 23: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  23

Page 24: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  24

0.01

0.1

1

10

100

1000

100 1000 10000 100000

GFL

OPS

/S

Number of Matrices

Performance of cublasSgemmBatched, streams, hand-written CUDA, MKL, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak

Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak

Hand-written CUDA, KeplerHand-written CUDA, Fermi

cublasSgemmBatched, KeplercublasSgemmBatched, Fermi

MKLstreams, Fermi

streams, Kepler

Page 25: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  25

0

50

100

150

200

250

300

350

100 1000 10000 100000 1e+06

GFL

OPS

/S

Number of Matrices

Performance of Hand-written CUDA, Optimized CUDA, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak

Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak

Optimized CUDA, KeplerOptimized CUDA, Fermi

Hand-written CUDA, KeplerHand-written CUDA, Fermi

Page 26: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  26

0

100

200

300

400

500

100 1000 10000 100000

GFL

OPS

/S

Number of Matrices

Performance of Hand-written CUDA, Optimized CUDA, cublasSgemmBatched, 20x20 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak

Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak

Optimized CUDA, KeplerOptimized CUDA, Fermi

Hand-written CUDA, KeplerHand-written CUDA, Fermi

cublasSgemmBatched, KeplercublasSgemmBatched, Fermi

Page 27: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  27

0

50

100

150

200

0 10000 20000 30000 40000 50000 60000

GFL

OPS

/S

Number of Matrices

Performance of Lagrange Interpolation, Newton Interpolation, Optimized CUDA, 4x4 CPU Cores (2 x 8-cores) 2.6 GHz, Xeon E5-2670, 32 GB, 332 Gflops/s Double Precision Peak

Fermi (Telsa M2090) 1.3 GHz, 5.4 GB, 665 Gflop/s Double Precision PeakKepler (Telsa K20X) 0.732 GHz, 5.4 GB, 1320 Gflops/s Double Precision Peak

Optimized CUDA, KeplerOptimized CUDA, Fermi

Newton Interpolation, KeplerLagrange Interpolation, Fermi

Newton Interpolation, FermiLagrange Interpolation, Kepler

Page 28: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  28

Page 29: GPU-Based Bayesian Phylogenetic Inference Beyond Extreme … · 2014. 4. 18. · t18 t19 . phylogenecs ) 6 agttcgatccg agtgcgatccg agtccgaacag agtccgatgcc agtccgaaccg ggtccgatccg

Batch  Matrix  Mul2ply  29

One  matrix  mul4ply  –  size  N    N*N*N  flops  for  N*N  memory  accesses        1024  matrix  mul4plies  –  size  N/32    1024*(N/32)*(N/32)*(N/32)  flops  for  1024*(N/32)*(N/32)  memory  accesses    (N*N*N)/32  flops  for  N*N  memory  accesses