82
Bioinformatics Bioinformatics R for Bioinformatics R for Bioinformatics PART II PART II Kristel Van Steen, PhD, ScD Kristel Van Steen, PhD, ScD ([email protected]) ([email protected]) Université de Liege - Institut Montefiore Université de Liege - Institut Montefiore 2008-2009 2008-2009

Simplified Epistasis Testing

Embed Size (px)

DESCRIPTION

Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD ([email protected]) Université de Liege - Institut Montefiore 2008-2009. Simplified Epistasis Testing. - PowerPoint PPT Presentation

Citation preview

Page 1: Simplified Epistasis Testing

BioinformaticsBioinformaticsR for BioinformaticsR for Bioinformatics

PART IIPART II

Kristel Van Steen, PhD, ScDKristel Van Steen, PhD, ScD

([email protected])([email protected])

Université de Liege - Institut MontefioreUniversité de Liege - Institut Montefiore

2008-20092008-2009

Page 2: Simplified Epistasis Testing
Page 3: Simplified Epistasis Testing
Page 4: Simplified Epistasis Testing
Page 5: Simplified Epistasis Testing
Page 6: Simplified Epistasis Testing
Page 7: Simplified Epistasis Testing
Page 8: Simplified Epistasis Testing
Page 9: Simplified Epistasis Testing
Page 10: Simplified Epistasis Testing
Page 11: Simplified Epistasis Testing
Page 12: Simplified Epistasis Testing
Page 13: Simplified Epistasis Testing
Page 14: Simplified Epistasis Testing
Page 15: Simplified Epistasis Testing
Page 16: Simplified Epistasis Testing
Page 17: Simplified Epistasis Testing
Page 18: Simplified Epistasis Testing
Page 19: Simplified Epistasis Testing
Page 20: Simplified Epistasis Testing
Page 21: Simplified Epistasis Testing
Page 22: Simplified Epistasis Testing
Page 23: Simplified Epistasis Testing
Page 24: Simplified Epistasis Testing
Page 25: Simplified Epistasis Testing
Page 26: Simplified Epistasis Testing
Page 27: Simplified Epistasis Testing

Simplified Epistasis TestingSimplified Epistasis Testing

We shall now use logistic regression in R to test for epistatic interactions between We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects. or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects.

First get rid of the data in the memory and read in the new data. This data is the First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5: (unlinked) locus 5:

detach(casecon) detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T)newcasecon <- read.table("newcasecondata.txt", header=T)attach(newcasecon) attach(newcasecon)

You can look at the data by typing You can look at the data by typing

fix(newcasecon) fix(newcasecon)

Page 28: Simplified Epistasis Testing

Cordell practical Cordell practical (see statistical genetics (see statistical genetics

class)class) Next create appropriate genotype and case variables: Next create appropriate genotype and case variables:

case <- affected-1 case <- affected-1

g3 <- genotype(loc3_1, loc3_2) g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2) g5 <- genotype(loc5_1, loc5_2)

The individual effects at locus 3 and 5 are now coded by the variables g3 The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately: and g5. We can test for association at each locus separately:

gcontrasts(g3) <- "genotype" gcontrasts(g3) <- "genotype" logit (case ~ g3) logit (case ~ g3) anova(logit (case ~ g3)) anova(logit (case ~ g3))

gcontrasts(g5) <- "genotype" gcontrasts(g5) <- "genotype" logit (case ~ g5) logit (case ~ g5) anova(logit (case ~ g5))anova(logit (case ~ g5))

Page 29: Simplified Epistasis Testing

In order to investigate epistasis, it is more convenient to create new In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each variables that code numerically for the number of copies of allele 2 in each genotypes genotypes

count3<-allele.count(g3,2) count3<-allele.count(g3,2) count5<-allele.count(g5,2) count5<-allele.count(g5,2)

Check you understand how variables count3 and count5 relate to g3 and Check you understand how variables count3 and count5 relate to g3 and g5 by typing g5 by typing

g3 g3 count3 count3

g5 g5 count5 count5

Page 30: Simplified Epistasis Testing

We then create a variable that codes for the combined effect of locus 3 and 5 as We then create a variable that codes for the combined effect of locus 3 and 5 as follows: follows:

combo<-10*count3+count5 combo<-10*count3+count5

Check you understand how the 'combo' variable relates to g3 and g5 by typing Check you understand how the 'combo' variable relates to g3 and g5 by typing

g3 g3 g5 g5 combo combo

Now we need to code each of these variables as 'factors' which means we simply Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than consider the numeric codes to act as labels for the different categories rather than having numeric meaning: having numeric meaning:

fact3<-factor(count3) fact3<-factor(count3) fact5<-factor(count5) fact5<-factor(count5) factcombo<-factor(combo) factcombo<-factor(combo)

Page 31: Simplified Epistasis Testing

Check that the analysis with the 'factors' gives the same results as you found previously with the Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables: genotype variables:

anova(logit (case ~ fact3)) anova(logit (case ~ fact3)) anova(logit (case ~ fact5)) anova(logit (case ~ fact5))

Now test whether there is significant epistasis by typing Now test whether there is significant epistasis by typing

anova(logit(case ~ fact3 + fact5 + factcombo)) anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4) 1-pchisq(9.59,4)

This first fits the individual locus factors, and then adds in the extra effect of looking at the model with This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis. 4 df with p value 0.048 i.e. there is marginal evidence of epistasis.

The above test is valid for testing for epistasis between linked or unlinked loci, although it does not The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium. disequilibrium.

Page 32: Simplified Epistasis Testing

To do this, we can use a chi squared test to look for correlation (association) between the loci To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately. within the case and control groups separately.

First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands genotype vectors just for the cases using the following commands

caseg3<-g3[1:384] caseg3<-g3[1:384] caseg5<-g5[1:384] caseg5<-g5[1:384]

Take a look at the vectors you have created by typing Take a look at the vectors you have created by typing

caseg3 caseg3 caseg5 caseg5

Now do a chi-squared test on the genotype variables to see if they are correlated with each other: Now do a chi-squared test on the genotype variables to see if they are correlated with each other:

table(caseg3,caseg5) table(caseg3,caseg5) chisq.test(caseg3,caseg5) chisq.test(caseg3,caseg5)

Page 33: Simplified Epistasis Testing

You should find much more significant evidence of epistasis (p value 0.0018) than you did You should find much more significant evidence of epistasis (p value 0.0018) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population: variables g3 and g5 are uncorrelated in the control population:

contg3<-g3[385:1056] contg3<-g3[385:1056] contg5<-g5[385:1056] contg5<-g5[385:1056]

contg3 contg3 contg5 contg5

table(contg3,contg5) table(contg3,contg5) chisq.test(contg3,contg5) chisq.test(contg3,contg5)

You should find a non-significant p value (p=0.99). This suggests that the case-only analysis You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci. interaction between these loci.

Page 34: Simplified Epistasis Testing

Running the command lines in RRunning the command lines in R

to test for epistasisto test for epistasis

Page 35: Simplified Epistasis Testing
Page 36: Simplified Epistasis Testing
Page 37: Simplified Epistasis Testing
Page 38: Simplified Epistasis Testing
Page 39: Simplified Epistasis Testing
Page 40: Simplified Epistasis Testing
Page 41: Simplified Epistasis Testing
Page 42: Simplified Epistasis Testing
Page 43: Simplified Epistasis Testing
Page 44: Simplified Epistasis Testing
Page 45: Simplified Epistasis Testing
Page 46: Simplified Epistasis Testing
Page 47: Simplified Epistasis Testing
Page 48: Simplified Epistasis Testing
Page 49: Simplified Epistasis Testing
Page 50: Simplified Epistasis Testing
Page 51: Simplified Epistasis Testing
Page 52: Simplified Epistasis Testing
Page 53: Simplified Epistasis Testing
Page 54: Simplified Epistasis Testing
Page 55: Simplified Epistasis Testing
Page 56: Simplified Epistasis Testing
Page 57: Simplified Epistasis Testing
Page 58: Simplified Epistasis Testing
Page 59: Simplified Epistasis Testing
Page 60: Simplified Epistasis Testing
Page 61: Simplified Epistasis Testing
Page 62: Simplified Epistasis Testing
Page 63: Simplified Epistasis Testing
Page 64: Simplified Epistasis Testing
Page 65: Simplified Epistasis Testing
Page 66: Simplified Epistasis Testing
Page 67: Simplified Epistasis Testing
Page 68: Simplified Epistasis Testing
Page 69: Simplified Epistasis Testing
Page 70: Simplified Epistasis Testing
Page 71: Simplified Epistasis Testing
Page 72: Simplified Epistasis Testing
Page 73: Simplified Epistasis Testing
Page 74: Simplified Epistasis Testing
Page 75: Simplified Epistasis Testing
Page 76: Simplified Epistasis Testing
Page 77: Simplified Epistasis Testing
Page 78: Simplified Epistasis Testing
Page 79: Simplified Epistasis Testing
Page 80: Simplified Epistasis Testing

ResourcesResources

for microarray analysis:for microarray analysis:

http://www.nslij-genetics.org/microarray/http://www.nslij-genetics.org/microarray/

Page 81: Simplified Epistasis Testing

Review Paper: Review Paper:

gene expression analysisgene expression analysis

(Slonim et al 2002)(Slonim et al 2002)

Page 82: Simplified Epistasis Testing