Upload
eric-perry
View
213
Download
0
Embed Size (px)
Citation preview
Statistical Methods for Rare Variant Association Test Using Summarized Data
Qunyuan Zhang Ingrid Borecki, Michael A. Province
Division of Statistical Genomics
Motivation
Individual level Summarized level
SubjectVariant
TraitV1 V2 V3
1 0 0 0 case
2 1 0 0 case
3 0 0 0 case
4 0 0 0 control
5 0 0 0 control
6 0 0 1 control
… … … … …
Variant
V1 V2 V3
Variant No. in cases 10 8 3
Variant No. in controls 2 0 1
No. of cases 300 300 300
No. of controls 500 500 500
• Pooled DNA sequencing• Public data (as control)
Next generation sequencing => rare variants Two types of data
Existing Methods
Method Description Bi-directional effects Ref.
EFTExclusive Frequency Testtesting mutually exclusive allele/carrier freq.
× Commonly-used in publications, such as Cohen et al., 2004TFT Total Frequency Test
testing total allele/carrier freq. ×
CAST Cohort Allele Sum Testtesting total allele/carrier number
× Morgenthaler & Thilly, 2006
C-alpha testing variance √ Neale et al., 2011
QQ Plots of Existing Methods (under the null)
•EFT and C-alphainflated with false positives
•TFT and CAST no inflation, but assuming single effect-direction
•ObjectiveMore general, powerful methods …
CAST C-alpha
EFT TFT
Structure of Summarized datavariant 1
variant i variant k
variant 2
…
Strategy
Instead of testing total freq./number, we test the randomness of all tables.
variant 3 …
4. Calculating p-value P= Prob.( )
Exact Probability Test (EPT)
k
iiPL
1
)log(
iAiiiiii nNCanCanCP ,,, 2211
1.Calculating the probability of each table based on hypergeometric distribution
2. Calculating the logarized joint probability (L) for all k tables
3. Enumerating all possible tables and L scores
Likelihood Ratio Test (LRT)
2~):,,,Pr(
):,,,Pr(log2
1212211
12102211
kdfHbaba
HbabaLR k
i
iiA
iiii
k
i
iiiiii
Binomial distribution
Q-Q Plots of EPT and LRT(under the null)
EPTN=500
EPTN=3000
LRTN=500
LRTN=3000
Power Comparison significance level=0.00001
Variant proportion
Positive causal 80%
Neutral 20%
Negative Causal0%
Pow
er
Sample size
Pow
er
Sample size
Pow
er
Sample size
Power Comparison significance level=0.00001
Variant proportion
Positive causal 60%
Neutral 20%
Negative Causal20%
Pow
er
Sample size
Power Comparison significance level=0.00001
Variant proportion
Positive causal 40%
Neutral 20%
Negative Causal40%
Pow
er
Sample size
Power Comparison individual-level data vs. summarized data
N=1000, significance level=0.00001
Pow
er
Variant proportion positive : neutral : negative (%)
CMCLi & Leal, 2008
SKATWu et al., 2011
Application
-LOG10 p-values of 933 cancer-related genes
Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals, exome data, from NHBLI
ConclusionsEFT and C-alpha produce inflated p-value.TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects.EPT produces correct p-value and maintains power regardless of effect directions, more computer time.LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data
Acknowledgements
Dr. Li Ding
Charles Lu
Krishna-Latha Kanchi
(for providing the TCGA and NHBLI exome data)