Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics

Statistical Methods for Rare Variant Association Test Using Summarized Data

Qunyuan Zhang Ingrid Borecki, Michael A. Province

Division of Statistical Genomics

http://medschool.wustl.edu/

Motivation

Individual level Summarized level

SubjectVariant

TraitV1 V2 V3

1 0 0 0 case

2 1 0 0 case

3 0 0 0 case

4 0 0 0 control

5 0 0 0 control

6 0 0 1 control

… … … … …

Variant

V1 V2 V3

Variant No. in cases 10 8 3

Variant No. in controls 2 0 1

No. of cases 300 300 300

No. of controls 500 500 500

• Pooled DNA sequencing• Public data (as control)

Next generation sequencing => rare variants Two types of data

Existing Methods

Method Description Bi-directional effects Ref.

EFTExclusive Frequency Testtesting mutually exclusive allele/carrier freq.

× Commonly-used in publications, such as Cohen et al., 2004TFT Total Frequency Test

testing total allele/carrier freq. ×

CAST Cohort Allele Sum Testtesting total allele/carrier number

× Morgenthaler & Thilly, 2006

C-alpha testing variance √ Neale et al., 2011

QQ Plots of Existing Methods (under the null)

•EFT and C-alphainflated with false positives

•TFT and CAST no inflation, but assuming single effect-direction

•ObjectiveMore general, powerful methods …

CAST C-alpha

EFT TFT

Structure of Summarized datavariant 1

variant i variant k

variant 2

…

Strategy

Instead of testing total freq./number, we test the randomness of all tables.

variant 3 …

4. Calculating p-value P= Prob.( )

Exact Probability Test (EPT)

k

iiPL

1

)log(

iAiiiiii nNCanCanCP ,,, 2211

1.Calculating the probability of each table based on hypergeometric distribution

2. Calculating the logarized joint probability (L) for all k tables

3. Enumerating all possible tables and L scores

Likelihood Ratio Test (LRT)

2~):,,,Pr(

):,,,Pr(log2

1212211

12102211

kdfHbaba

HbabaLR k

i

iiA

iiii

k

i

iiiiii

Binomial distribution

Q-Q Plots of EPT and LRT(under the null)

EPTN=500

EPTN=3000

LRTN=500

LRTN=3000

Power Comparison significance level=0.00001

Variant proportion

Positive causal 80%

Neutral 20%

Negative Causal0%

Pow

er

Sample size

Pow

er

Sample size

Pow

er

Sample size


Variant proportion

Positive causal 60%

Neutral 20%

Negative Causal20%

Pow

er

Sample size


Variant proportion

Positive causal 40%

Neutral 20%

Negative Causal40%

Pow

er

Sample size

Power Comparison individual-level data vs. summarized data

N=1000, significance level=0.00001

Pow

er

Variant proportion positive : neutral : negative (%)

CMCLi & Leal, 2008

SKATWu et al., 2011

Application

-LOG10 p-values of 933 cancer-related genes

Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals, exome data, from NHBLI

ConclusionsEFT and C-alpha produce inflated p-value.TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects.EPT produces correct p-value and maintains power regardless of effect directions, more computer time.LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data

Acknowledgements

Dr. Li Ding

Charles Lu

Krishna-Latha Kanchi

(for providing the TCGA and NHBLI exome data)

Documents

Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics