1
Identifying gene expression features to distinguish SNPs causative of disease SHYLEE EZRONI, JENNY CHEN AND AVIV REGEV Department of Applied Mathematics, Wentworth Institute of Technology and Aviv Regev Lab, Broad Institute of Harvard and MIT The completion of the Human Genome Project was a tremendous advancement in basic science and molecular medicine. For example, the completed human reference genome has allowed for multitudes of Genome Wide Association Studies (GWAS), which have identified tens of thousands of single nucleotide polymorphisms (SNPs) that are correlated with the presence of complex disease. Several features of genes and genomes, including sequence conservation and protein deleteriousness predictions have been developed for prioritizing likely- causative SNPs for further interrogation. However, efficiently prioritizing SNPSs remains an open problem. Introduction Discussion Results Acknowledgements Objective Methods 1. Classify the location of each SNP and what region of a gene it resides in (e.g. UTR3, UTR5, Coding, Intron) 2. Calculate the length of each region 3. Analyze the rate distribution of SNPs in each region and compare between expression-conserved and diverged genes 4. Conduct statistical analysis on data to determine significantly different SNP frequencies Results Contribution of Conserved and Diverged Genes to Disease Table 1. Table displaying fractions of disease causing and no phenotype genes as conserved. Disease-causing genes were downloaded from Online Mendelian Inheritance of Man (OMIM). Fisher’s exact test was used to test for significance. Future Directions 1. Control for length confounder in analysis 2. Identify additional features for analysis: tissue specificity, DNase I Hypersensitivity features 3. Build machine learning models to predict causative SNPs based on expression features 4. Implement feature selection analysis to classify prioritization performances of each feature Table 2. Table describing p-values of 2-sample Kolmogrov Smirnov Test to SNP Rates of expression conserved and diverged genes. Statistical Differences in SNP Distributions Across Conserved and Diverged Genes We believe the lower rates of disease-associated SNPs in introns of genes conserved in expression may be an informative observation for prioritizing GWAS SNPs. Although our results indicate that conservation of expression can be an effective marker for prioritizing SNPs, we noticed that more outliers existed among the expression diverged genes. We suspect this may be due to a confounder in lengths of diverged vs conserved genes. Figure 4. Box plot exhibiting comparison of lengths classified by region, in expression conserved and divergent genes. Genes divergent in expression had shorter region lengths. Figure 1. Comparison of common SNP rates (number of SNPs normalized by length of region), classified by region, in expression conserved and divergent genes. Figure 2. Comparison of GWAS SNP rates, classified by region, in expression conserved and divergent genes. We found that: There are slightly lower rates of common SNP rates in 5’ UTRs and coding regions of conserved genes There are significantly less GWAS and ClinVar SNPs in the intronic regions of expression-conserved genes as compared to expressed-diverged genes Given that the majority of disease causing SNPs are believed to fall in the noncoding genome, and because it is known that non-coding sequence influences gene expression, we hypothesize that gene expression can be an informative feature in prioritizing SNPs. We plan to analyze the relationship of the disease and non-disease-causing SNPs to expression conservation of nearby genes Source of data: Gene expression data (RNA-seq) of 12 mammals across 7 tissue types, which allows us to characterize evolutionary patterns of gene expression. The length of each region, on average, was shorter in expression diverged genes vs. expression conserved genes, which would introduce confounders into comparisons of SNP rates. Regev Lab; Jenny Chen, Aviv Regev, Dawn Thompson Summer Research Program in Genomics; Bruce Birren, Francie Latour, Marlina Duncan, Brandon Ogbunu, Cohort First, we performed a proof of concept by comparing what fraction of known disease causing genes were considered conserved vs. diverged in expression. We found that proportionally more expression conserved genes are related to disease than unconserved genes (p-value <0.001), providing confidence that conservation of expression is an informative feature for prioritizing SNPs. Then, we analyzed common SNPs in order to verify an even distribution of SNPs in expression-conserved and expression-diverged genes. Conserved Genes Diverged Genes Total 1,096 7,434 # Causative of Disease 344 1,943 % Causative of Disease 31% 26% RNA -seq Common SNPs are evenly distributed in expression-conserved and expression- diverged genes Conservation of expression can indicate genes that are disease causing Significantly lower GWAS SNP rates exist in intronic regions of expression-conserved vs. expression-diverged genes Figure 3. Comparison of Clinvar SNP rates, classified by region, in expression conserved and divergent genes. Significantly lower ClinVar SNP rates exist in intronic regions of expression-conserved vs. expression-diverged genes

FINAL POSTER2

Embed Size (px)

Citation preview

Page 1: FINAL POSTER2

Identifying gene expression features to distinguish SNPs causative of disease SHYLEE EZRONI, JENNY CHEN AND AVIV REGEV Department of Applied Mathematics, Wentworth Institute of Technology and Aviv Regev Lab, Broad Institute of Harvard and MIT

The completion of the Human Genome Project was a tremendous advancement in basic science and molecular medicine. For example, the completed human reference genome has allowed for multitudes of Genome Wide Association Studies (GWAS), which have identified tens of thousands of single nucleotide polymorphisms (SNPs) that are correlated with the presence of complex disease.

Several features of genes and genomes, including sequence conservation and protein deleteriousness predictions have been developed for prioritizing likely-causative SNPs for further interrogation. However, efficiently prioritizing SNPSs remains an open problem.

Introduction Discussion Results

Acknowledgements

Objective

Methods

1.  Classify the location of each SNP and what region of a gene it resides in (e.g. UTR3, UTR5, Coding, Intron)

2.  Calculate the length of each region 3.  Analyze the rate distribution of SNPs in each region and

compare between expression-conserved and diverged genes

4.  Conduct statistical analysis on data to determine significantly different SNP frequencies

Results

Contribution of Conserved and Diverged Genes to Disease

Table 1. Table displaying fractions of disease causing and no phenotype genes as conserved. Disease-causing genes were downloaded from Online Mendelian Inheritance of Man (OMIM). Fisher’s exact test was used to test for significance.

Future Directions

1. Control for length confounder in analysis 2. Identify additional features for analysis: tissue

specificity, DNase I Hypersensitivity features 3. Build machine learning models to predict causative

SNPs based on expression features 4. Implement feature selection analysis to classify

prioritization performances of each feature

Table 2. Table describing p-values of 2-sample Kolmogrov Smirnov Test to SNP Rates of expression conserved and diverged genes.

Statistical Differences in SNP Distributions Across Conserved and Diverged Genes

We believe the lower rates of disease-associated SNPs in introns of genes conserved in expression may be an informative observation for prioritizing GWAS SNPs. Although our results indicate that conservation of expression can be an effective marker for prioritizing SNPs, we noticed that more outliers existed among the expression diverged genes. We suspect this may be due to a confounder in lengths of diverged vs conserved genes.

Figure 4. Box plot exhibiting comparison of lengths classified by region, in expression conserved and divergent genes. Genes divergent in expression had shorter region lengths.

Figure 1. Comparison of common SNP rates (number of SNPs normalized by length of region), classified by region, in expression conserved and divergent genes.

Figure 2. Comparison of GWAS SNP rates, classified by region, in expression conserved and divergent genes.

We found that: •  There are slightly lower rates of common SNP rates

in 5’ UTRs and coding regions of conserved genes •  There are significantly less GWAS and ClinVar SNPs

in the intronic regions of expression-conserved genes as compared to expressed-diverged genes

Given that the majority of disease causing SNPs are believed to fall in the noncoding genome, and because it is known that non-coding sequence influences gene expression, we hypothesize that gene expression can be an informative feature in prioritizing SNPs. We plan to analyze the relationship of the disease and non-disease-causing SNPs to expression conservation of nearby genes Source of data: Gene expression data (RNA-seq) of 12 mammals across 7 tissue types, which allows us to characterize evolutionary patterns of gene expression. The length of each region, on average, was shorter in

expression diverged genes vs. expression conserved genes, which would introduce confounders into comparisons of SNP rates.

Regev Lab; Jenny Chen, Aviv Regev, Dawn Thompson Summer Research Program in Genomics; Bruce Birren, Francie Latour, Marlina Duncan, Brandon Ogbunu, Cohort

First, we performed a proof of concept by comparing what fraction of known disease causing genes were considered conserved vs. diverged in expression. We found that proportionally more expression conserved genes are related to disease than unconserved genes (p-value <0.001), providing confidence that conservation of expression is an informative feature for prioritizing SNPs.

Then, we analyzed common SNPs in order to verify an even distribution of SNPs in expression-conserved and expression-diverged genes.

Conserved Genes Diverged Genes

Total 1,096 7,434

# Causative of Disease

344 1,943

% Causative of Disease

31% 26%

RNA -seq

Common SNPs are evenly distributed in expression-conserved and expression-

diverged genes

Conservation of expression can indicate genes that are disease causing

Significantly lower GWAS SNP rates exist in intronic regions of expression-conserved vs.

expression-diverged genes

Figure 3. Comparison of Clinvar SNP rates, classified by region, in expression conserved and divergent genes.

Significantly lower ClinVar SNP rates exist in intronic regions of expression-conserved vs.

expression-diverged genes