Identifying gene expression features to distinguish SNPs causative of disease SHYLEE EZRONI, JENNY CHEN AND AVIV REGEV Department of Applied Mathematics, Wentworth Institute of Technology and Aviv Regev Lab, Broad Institute of Harvard and MIT
The completion of the Human Genome Project was a tremendous advancement in basic science and molecular medicine. For example, the completed human reference genome has allowed for multitudes of Genome Wide Association Studies (GWAS), which have identified tens of thousands of single nucleotide polymorphisms (SNPs) that are correlated with the presence of complex disease.
Several features of genes and genomes, including sequence conservation and protein deleteriousness predictions have been developed for prioritizing likely-causative SNPs for further interrogation. However, efficiently prioritizing SNPSs remains an open problem.
Introduction Discussion Results
1. Classify the location of each SNP and what region of a gene it resides in (e.g. UTR3, UTR5, Coding, Intron)
2. Calculate the length of each region 3. Analyze the rate distribution of SNPs in each region and
compare between expression-conserved and diverged genes
4. Conduct statistical analysis on data to determine significantly different SNP frequencies
Contribution of Conserved and Diverged Genes to Disease
Table 1. Table displaying fractions of disease causing and no phenotype genes as conserved. Disease-causing genes were downloaded from Online Mendelian Inheritance of Man (OMIM). Fishers exact test was used to test for significance.
1.Control for length confounder in analysis 2.Identify additional features for analysis: tissue
specificity, DNase I Hypersensitivity features 3.Build machine learning models to predict causative
SNPs based on expression features 4.Implement feature selection analysis to classify
prioritization performances of each feature
Table 2. Table describing p-values of 2-sample Kolmogrov Smirnov Test to SNP Rates of expression conserved and diverged genes.
Statistical Differences in SNP Distributions Across Conserved and Diverged Genes
We believe the lower rates of disease-associated SNPs in introns of genes conserved in expression may be an informative observation for prioritizing GWAS SNPs. Although our results indicate that conservation of expression can be an effective marker for prioritizing SNPs, we noticed that more outliers existed among the expression diverged genes. We suspect this may be due to a confounder in lengths of diverged vs conserved genes.
Figure 4. Box plot exhibiting comparison of lengths classified by region, in expression conserved and divergent genes. Genes divergent in expression had shorter region lengths.
Figure 1. Comparison of common SNP rates (number of SNPs normalized by length of region), classified by region, in expression conserved and divergent genes.
Figure 2. Comparison of GWAS SNP rates, classified by region, in expression conserved and divergent genes.
We found that: There are slightly lower rates of common SNP rates
in 5 UTRs and coding regions of conserved genes There are significantly less GWAS and ClinVar SNPs
in the intronic regions of expression-conserved genes as compared to expressed-diverged genes
Given that the majority of disease causing SNPs are believed to fall in the noncoding genome, and because it is known that non-coding sequence influences gene expression, we hypothesize that gene expression can be an informative feature in prioritizing SNPs. We plan to analyze the relationship of the disease and non-disease-causing SNPs to expression conservation of nearby genes Source of data: Gene expression data (RNA-seq) of 12 mammals across 7 tissue types, which allows us to characterize evolutionary patterns of gene expression. The length of each region, on average, was shorter in
expression diverged genes vs. expression conserved genes, which would introduce confounders into comparisons of SNP rates.
Regev Lab; Jenny Chen, Aviv Regev, Dawn Thompson Summer Research Program in Genomics; Bruce Birren, Francie Latour, Marlina Duncan, Brandon Ogbunu, Cohort
First, we performed a proof of concept by comparing what fraction of known disease causing genes were considered conserved vs. diverged in expression. We found that proportionally more expression conserved genes are related to disease than unconserved genes (p-value