17
Indika Kahanda, Buwani Manuweera, Brendan Mumey Gianforte School of Computing Montana State University Bozeman, MT, USA Alan Cleary Joann Mudge, Thiruvarangan Ramaraj National Center for Genome Resources Santa Fe, NM, USA Pangenome-wide association studies (PWAS) with frequented regions

Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Indika Kahanda, Buwani Manuweera, Brendan Mumey

Gianforte School of ComputingMontana State University

Bozeman, MT, USA

Alan Cleary Joann Mudge,Thiruvarangan Ramaraj

National Center for Genome ResourcesSanta Fe, NM, USA

Pangenome-wide association studies (PWAS) with frequented regions

Page 2: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Pangenomic data

• Increasingly common to sequence multiple genomes per species.

• ...creating pangenomic data sets.

Page 3: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Compressed De Brujn Graphs

Slide from:T. Beller, E. Ohlebusch, “Efficient construction of a compressed de Bruijn graph for pan-genome analysis”

Page 4: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

Slide from:S. Marcus, “splitMEM: graphical pan-genome analysis with suffix skips”

• Software:– SplitMem: (uses suffix trees)

S. Marcus, H. Lee, and M. C. Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483, 2014.

– E-SplitMem: (uses FM-index)T. Beller and E. Ohlebusch. 2016. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms for Molecular Biology11, 1 (2016), 20

Page 5: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Pangenomic graphs

genomicsequences

…can have millions of vertices and edges

The data:– A cDB graph G– A set of paths P within G

A frequented region (FR) is a tuple (C,S) : C is a set of de Bruijn nodesS is a set of (α, κ)-supporting subpathsParameters: k, a, K

p[i,j]

C nodes

gap <= K>= a|C|

NB: need to also consider reverse-complement support

Page 6: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Our FR Algorithm

• Basic idea: find FRs in a bottom-up, agglomerative fashion:– Each De Bruijn node starts in its own cluster.

– Repeat: merge best pair of clusters.

e

a

b

d

cf

g

Merge?

Merge process = maximum weight matching : fast parallel approx. algorithms exist

Running time: O(LV + V2lgV)– V = # of cDB vertices– L = total length of all genomic sequence in P

(NB: no dependence on # of sequences)

A. Cleary, T. Ramaraj, I. Kahanda, J. Mudge and B. Mumey, "Exploring Frequented Regions in Pan-Genomic Graphs," in IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018doi: 10.1109/TCBB.2018.2864564

Page 7: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Uses for FRs

• FRs identify syntenic regions• We have been exploring the following:– Visualizing pan-genomic space– Machine Learning with FRs as features

Page 8: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

FRs to visualize pangenomes…

Alcohol (wine, sake, ale, bioethanol)Laboratory

BakeryOther

17kb insertion on chromosome XIV

Yeast Insertion for Alcohol Tolerance:

Page 9: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Machine Learning

• We propose to use FRs as features for:– describing existing genomes– make inferences on unseen

genomes.

Strain FR1 FR2 … FRn LabelA 1 0 … 1 0.6B 0 1 … 1 0.2

Page 10: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Back to pangenomics…

Can we use FR content for ML task like phenotype regression?

050

0010

000

1500

020

000

2500

0

1 2 3 4 5 6 7 8 9 10 11 12 13# Sharing Accession

# O

rthol

og G

roup

s

Accession−SpecificHM101HM058HM056HM125HM129HM034HM095HM060HM185HM004HM050HM023HM010

A

020

000

4000

060

000

5 10# Genomes Sequenced

# O

rthol

og G

roup

s

Pan−proteomeCore−proteome

B

Medicago trunculata (450 Mb genome)Model legumeWhole genome duplicationsHigh level of rearrangements and gene family expansions

Page 11: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Test case: Phenotype regressionwith FRs

• 100 yeast genome dataset:– Strope, Pooja K., et al. “The 100-genomes

strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen.” Genome research 25.5 (2015): 762-774.

…studied SNP-phenotype associations for 49 phenotypes

Page 12: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

SNPs vs FRs for phenotype regression

• 5-fold cross validation:

need to tune parameters:

Page 13: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

SNP-based regressionFold 1

Slope= 0.48 R2= 0.0943

Fold 2Slope= 0.76 R2= 0.058

Fold 3Slope= 1.1 R2= 0.343

Fold 4Slope= 0.85 R2= 0.206

Fold 5Slope= 1 R2= 0.432

5

10

15

0 5 10 15 20Original

Predicted

variable

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Lithium Chloride Folds

Regression done with a sparseBayesian mixed modelusing the GEMMA tool

Page 14: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

FRs show improvement over SNPs:

Page 15: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Conclusions

• FRs capture the notion of frequented regions or junctions in the cDB graph.

• PWAS – pangenome wide association studies:– Identifying core and adapted gene sets quickly– visualization

Page 16: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

FutuRe work

• Scale up to larger plant and human pangenomic data sets.• Interested in new collaborations!

Page 17: Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

• Acknowledgements:Supported in part by:– NSF-ABI award 1542262– NSF-IOS award 1444806– NSF-DBI award 1759522– USDA-ARS project funding for

the Legume Information System– Google Summer of Code

Questions?