Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Indika Kahanda, Buwani Manuweera, Brendan Mumey
Gianforte School of ComputingMontana State University
Bozeman, MT, USA
Alan Cleary Joann Mudge,Thiruvarangan Ramaraj
National Center for Genome ResourcesSanta Fe, NM, USA
Pangenome-wide association studies (PWAS) with frequented regions
Pangenomic data
• Increasingly common to sequence multiple genomes per species.
• ...creating pangenomic data sets.
Compressed De Brujn Graphs
Slide from:T. Beller, E. Ohlebusch, “Efficient construction of a compressed de Bruijn graph for pan-genome analysis”
Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25
Slide from:S. Marcus, “splitMEM: graphical pan-genome analysis with suffix skips”
• Software:– SplitMem: (uses suffix trees)
S. Marcus, H. Lee, and M. C. Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483, 2014.
– E-SplitMem: (uses FM-index)T. Beller and E. Ohlebusch. 2016. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms for Molecular Biology11, 1 (2016), 20
Pangenomic graphs
genomicsequences
…can have millions of vertices and edges
The data:– A cDB graph G– A set of paths P within G
A frequented region (FR) is a tuple (C,S) : C is a set of de Bruijn nodesS is a set of (α, κ)-supporting subpathsParameters: k, a, K
p[i,j]
C nodes
gap <= K>= a|C|
NB: need to also consider reverse-complement support
Our FR Algorithm
• Basic idea: find FRs in a bottom-up, agglomerative fashion:– Each De Bruijn node starts in its own cluster.
– Repeat: merge best pair of clusters.
e
a
b
d
cf
g
Merge?
Merge process = maximum weight matching : fast parallel approx. algorithms exist
Running time: O(LV + V2lgV)– V = # of cDB vertices– L = total length of all genomic sequence in P
(NB: no dependence on # of sequences)
A. Cleary, T. Ramaraj, I. Kahanda, J. Mudge and B. Mumey, "Exploring Frequented Regions in Pan-Genomic Graphs," in IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018doi: 10.1109/TCBB.2018.2864564
Uses for FRs
• FRs identify syntenic regions• We have been exploring the following:– Visualizing pan-genomic space– Machine Learning with FRs as features
FRs to visualize pangenomes…
Alcohol (wine, sake, ale, bioethanol)Laboratory
BakeryOther
17kb insertion on chromosome XIV
Yeast Insertion for Alcohol Tolerance:
Machine Learning
• We propose to use FRs as features for:– describing existing genomes– make inferences on unseen
genomes.
Strain FR1 FR2 … FRn LabelA 1 0 … 1 0.6B 0 1 … 1 0.2
Back to pangenomics…
Can we use FR content for ML task like phenotype regression?
050
0010
000
1500
020
000
2500
0
1 2 3 4 5 6 7 8 9 10 11 12 13# Sharing Accession
# O
rthol
og G
roup
s
Accession−SpecificHM101HM058HM056HM125HM129HM034HM095HM060HM185HM004HM050HM023HM010
A
020
000
4000
060
000
5 10# Genomes Sequenced
# O
rthol
og G
roup
s
Pan−proteomeCore−proteome
B
Medicago trunculata (450 Mb genome)Model legumeWhole genome duplicationsHigh level of rearrangements and gene family expansions
Test case: Phenotype regressionwith FRs
• 100 yeast genome dataset:– Strope, Pooja K., et al. “The 100-genomes
strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen.” Genome research 25.5 (2015): 762-774.
…studied SNP-phenotype associations for 49 phenotypes
SNPs vs FRs for phenotype regression
• 5-fold cross validation:
need to tune parameters:
SNP-based regressionFold 1
Slope= 0.48 R2= 0.0943
Fold 2Slope= 0.76 R2= 0.058
Fold 3Slope= 1.1 R2= 0.343
Fold 4Slope= 0.85 R2= 0.206
Fold 5Slope= 1 R2= 0.432
5
10
15
0 5 10 15 20Original
Predicted
variable
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Lithium Chloride Folds
Regression done with a sparseBayesian mixed modelusing the GEMMA tool
FRs show improvement over SNPs:
Conclusions
• FRs capture the notion of frequented regions or junctions in the cDB graph.
• PWAS – pangenome wide association studies:– Identifying core and adapted gene sets quickly– visualization
FutuRe work
• Scale up to larger plant and human pangenomic data sets.• Interested in new collaborations!
• Acknowledgements:Supported in part by:– NSF-ABI award 1542262– NSF-IOS award 1444806– NSF-DBI award 1759522– USDA-ARS project funding for
the Legume Information System– Google Summer of Code
Questions?