Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome

Indika Kahanda, Buwani Manuweera, Brendan Mumey

Gianforte School of ComputingMontana State University

Bozeman, MT, USA

Alan Cleary Joann Mudge,Thiruvarangan Ramaraj

National Center for Genome ResourcesSanta Fe, NM, USA

Pangenome-wide association studies (PWAS) with frequented regions

Pangenomic data

• Increasingly common to sequence multiple genomes per species.

• ...creating pangenomic data sets.

Compressed De Brujn Graphs

Slide from:T. Beller, E. Ohlebusch, “Efficient construction of a compressed de Bruijn graph for pan-genome analysis”

Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

Slide from:S. Marcus, “splitMEM: graphical pan-genome analysis with suffix skips”

• Software:– SplitMem: (uses suffix trees)

S. Marcus, H. Lee, and M. C. Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483, 2014.

– E-SplitMem: (uses FM-index)T. Beller and E. Ohlebusch. 2016. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms for Molecular Biology11, 1 (2016), 20

Pangenomic graphs

genomicsequences

…can have millions of vertices and edges

The data:– A cDB graph G– A set of paths P within G

A frequented region (FR) is a tuple (C,S) : C is a set of de Bruijn nodesS is a set of (α, κ)-supporting subpathsParameters: k, a, K

p[i,j]

C nodes

gap <= K>= a|C|

NB: need to also consider reverse-complement support

Our FR Algorithm

• Basic idea: find FRs in a bottom-up, agglomerative fashion:– Each De Bruijn node starts in its own cluster.

– Repeat: merge best pair of clusters.

e

a

b

d

cf

g

Merge?

Merge process = maximum weight matching : fast parallel approx. algorithms exist

Running time: O(LV + V2lgV)– V = # of cDB vertices– L = total length of all genomic sequence in P

(NB: no dependence on # of sequences)

A. Cleary, T. Ramaraj, I. Kahanda, J. Mudge and B. Mumey, "Exploring Frequented Regions in Pan-Genomic Graphs," in IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018doi: 10.1109/TCBB.2018.2864564

Uses for FRs

• FRs identify syntenic regions• We have been exploring the following:– Visualizing pan-genomic space– Machine Learning with FRs as features

FRs to visualize pangenomes…

Alcohol (wine, sake, ale, bioethanol)Laboratory

BakeryOther

17kb insertion on chromosome XIV

Yeast Insertion for Alcohol Tolerance:

Machine Learning

• We propose to use FRs as features for:– describing existing genomes– make inferences on unseen

genomes.

Strain FR1 FR2 … FRn LabelA 1 0 … 1 0.6B 0 1 … 1 0.2

Back to pangenomics…

Can we use FR content for ML task like phenotype regression?

050

0010

000

1500

020

000

2500

0

1 2 3 4 5 6 7 8 9 10 11 12 13# Sharing Accession

# O

rthol

og G

roup

s

Accession−SpecificHM101HM058HM056HM125HM129HM034HM095HM060HM185HM004HM050HM023HM010

A

020

000

4000

060

000

5 10# Genomes Sequenced

# O

rthol

og G

roup

s

Pan−proteomeCore−proteome

B

Medicago trunculata (450 Mb genome)Model legumeWhole genome duplicationsHigh level of rearrangements and gene family expansions

Test case: Phenotype regressionwith FRs

• 100 yeast genome dataset:– Strope, Pooja K., et al. “The 100-genomes

strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen.” Genome research 25.5 (2015): 762-774.

…studied SNP-phenotype associations for 49 phenotypes

SNPs vs FRs for phenotype regression

• 5-fold cross validation:

need to tune parameters:

SNP-based regressionFold 1

Slope= 0.48 R2= 0.0943

Fold 2Slope= 0.76 R2= 0.058

Fold 3Slope= 1.1 R2= 0.343

Fold 4Slope= 0.85 R2= 0.206

Fold 5Slope= 1 R2= 0.432

5

10

15

0 5 10 15 20Original

Predicted

variable

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Lithium Chloride Folds

Regression done with a sparseBayesian mixed modelusing the GEMMA tool

FRs show improvement over SNPs:

Conclusions

• FRs capture the notion of frequented regions or junctions in the cDB graph.

• PWAS – pangenome wide association studies:– Identifying core and adapted gene sets quickly– visualization

FutuRe work

• Scale up to larger plant and human pangenomic data sets.• Interested in new collaborations!

• Acknowledgements:Supported in part by:– NSF-ABI award 1542262– NSF-IOS award 1444806– NSF-DBI award 1759522– USDA-ARS project funding for

the Legume Information System– Google Summer of Code

Questions?

Documents

Pangenome-wide association studies (PWAS) with frequented … · 2020. 12. 31. · T. Bellerand E. Ohlebusch. 2016. A representation of a compressed de Bruijngraph for pan-genome