Upload
rosemary-evans
View
222
Download
2
Embed Size (px)
Citation preview
Genome-Wide Association Studies (GWAS)
Epidemiology 243Molecular Epidemiology of
CancerSpring 2008
Association Studies of Genetic Factors 1st generation
Very small studies (<100 cases) Usually not epidemiologic study design; 1-2 SNPs
2nd generation Small studies (100-500 cases) More epi focus; a few SNPs
3rd generation Large molecular epi studies (>500 cases) Proper epi design; pathways
4th generation Consortium-based pooled analyses (>2000 cases) GxE analyses
5th generation Post-GWS studies
Boffeta, 2007
International Lung Cancer Consortium (ILCCO)
Goodman
Thun
Benhamou
ChenBerwick
Schwarts
Le Marchand
Kiyohara
McLaughlin
Zhang
Wiencke
Yang
Stucker
Boffetta
Spitz
Tajima
Risch
Brennan
Wichmann
Wild
Landi
3 cohort studies
17 population based case-control studies
13 hospital based case-control studies
2 studies with mixed controls
1 cross-sectional study
Vineis
Harris
Christiani
Lan
Hong
Lazarus
Issues in genetic association studies
Many genes ~25,000 genes, many can be candidates
Many SNPs ~12,000,000 SNPs, ability to predict functional SNPs is
limited Methods to select SNPs:
Only functional SNPs in a candidate gene Systematic screen of SNPs in a candidate gene Systematic screen of SNPs in an entire pathway Genomewide screen Systematic screen for all coding changes
Introduction A genome-wide association study is an approach
that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease.
Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses.
http://www.genome.gov/20019523
Definition of GWAS
A genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease (such as cancer) or condition.
Potential of GWAS Whole genome information, when combined
with epidemiological, clinical and other phenotype data, offers the potential for increased understanding of basic biological processes affecting human health, improvement in the prediction of disease and patient care, and ultimately the realization of the promise of personalized medicine.
In addition, rapid advances in understanding the patterns of human genetic variation and maturing high-throughput, cost-effective methods for genotyping are providing powerful research tools for identifying genetic variants that contribute to health and disease.
Potential of GWAS
Selection of SNPs(Genome-wide association studies) Molecular
Higher requirements: Affymetrix and Illumina
Analytical Highest requirements: Data management, automation
Advantages No biological assumptions and can identify novel
genes/pathways Excellent chance to identify risk alleles Utility in individual risk assessment
Disadvantages High costs Concern of multiple tests
SNP Selection
SNP Selection
Affymetrix® Genome-Wide Human SNP Array The new Affymetrix® Genome-Wide
Human SNP Array 6.0 features 1.8 million genetic markers, including more than 906,600 single nucleotide polymorphisms (SNPs) and more than 946,000 probes for the detection of copy number variation. The SNP Array 6.0 represents more genetic variation on a single array than any other product, providing maximum panel power and the highest physical coverage of the genome.
The need for GWA Current understanding of disease etiology is limited
Therefore, candidate genes or pathways are insufficient
Current understanding of functional variants is limited Therefore, the focusing on nonsynonymous changes is not sufficient
Results from linkage studies are often inconsistent and
broad Therefore, the utility of identified linkage regions is limited
GWA studies offer an effective and objective approach Better chance to identify disease associated variants
Improve understanding of disease etiology
Improve ability to test gene-gene interaction and predict disease risk
Xu JF, 2007
GWA is promising Many diseases and traits are influenced by genetic factors
i.e., they are caused by sequence variants in the genome
Over 12 millions SNPs are known in the genome i.e., some SNPs will be directly or indirectly associated with causal
variants
The cost of SNP Genotyping is reduced i.e., it is affordable to genotype a large number of SNPs in the genome
Large numbers of cases and controls are available i.e., there is statistical power to detect variants with modest effect
When the above conditions are met… …associated SNPs will have different frequencies between cases
GWA is challenging Many diseases and traits are influenced by genetic factors
But probably due to multiple modest risk variants
They confer a stronger risk when they interact
True associated SNPs are not necessary highly significant
Too many SNPs are evaluated
False positives due to multiple tests
Single studies tend to be underpowered
False negatives
Considerable heterogeneity among studies
Phenotypic and genetic heterogeneity
False positives due to population stratification
Xu, 2007
Genome coverage
Two major platforms for GWA Illumina: HumanHap300, HumanHap550, and
HumanHap1M Affymetrix: GeneChip 100K, 500K, 1M, and 2.3M
Genome-wide coverage The percentage of known SNPs in the genome that are in
LD with the genotyped SNPs Calculated based on HapMap Calculated based on ENCODE
Xu, 2007
Strategies for pre-association analysis
Quality control Filter SNPs by genotype call rates Filter SNPs by minor allele frequencies Filter SNPs by testing for Hardy-Weinberg
Equilibrium
Data Analysis
Single SNP analysis using pre-specified genetic models 2 x 3 table (2-df) Additive model (1-df), and test for
additivity All possible genetic models (recessive,
dominant)
Data Analysis
Haplotype analysis Gene-gene and gene-environment
interactions Interaction with main effect
Logistic regression
Interaction without main effect: data mining Classification and recursive tree (CART) Multifactor Dimensionality Reduction (MDR)
Sample size needs as a function of genotype prevalence and OR for main effects
Boffeta, 2007
False Positives False positives: too many dependent tests
Adjust for number of tests Bonferroni correction
Nominal significance level = study-wide significance / number of tests
Nominal significance level = 0.05/500,000 = 10-7
Effective number of tests Take LD into account
Permutation procedure Permute case-control status
Mimic the actual analyses
Obtain empirical distribution of maximum test statistic under null hypothesis
False Positives
False discovery rate (FDR) Expected proportion of false discoveries
among all discoveries
Offers more power than Bonferroni
Holds under weak dependence of the tests
False Positives
Bayesian approach Taking a priori into account, False-
Positive Report Probability (FPRP)
Confirmation in independent study populations
The approach may limit the number of false positives
Confirmation is needed to dissect true from false positives
Replication, examine the results from the 2nd stage only Joint analysis, combining data from 1st stage with 2nd stage Multiple stages
Issues of GWAS
Population stratification Multiple Testing: False Positives Gene-Environmental Interaction High Costs
Kingsmore, 2008
Kingsmore, 2008
GWAS
Proposed GWAS of Lung Cancer among Non-smokers
Motives and Conceptual Framework For Study of Genetic Susceptibility to Lung Cancer among Non-smokers
About 16% of the male smokers and 10% of female smokers will eventually develop lung cancer, which suggest exposures to other environmental carcinogens and individual genetic susceptibility may play an important role among non smoking lung cancer.
It is suggested that 26% of lung cancer are associated with genetic susceptibility Lichtenstein P, et al. NEJM, 2000)
We hypothesize that the variation of genetic susceptibility or single nucleotide polymorphisms (SNPs) of genes in inflammation, DNA repair, and cell cycle control pathways may be important on the development of lung cancer among non-smokers.
Lichtenstein P, Holm NV, Verkasalo PK, Iliadou A, Kaprio J, Koskenvuo M, Pukkala E, Skytthe A, Hemminki K. NEJM, 2000
If DNA damage not repaired
DNA damage repaired
If loose cell cycle control
Defected DNA repair gene
G1
S
G2
M
P53
Cyclin D1
P16
Environmental Carcinogens / Procarcinogens Exposures
PAHs, Xenobiotics,
Arene, Alkine, etc
Active carcinogens Detoxified carcinogens
DNA Damage Normal cell
Carcinogenesis Programmed cell death
Tobacco consumption Occupational Exposures
Environmental Exposure
CYP1A1
GSTP1
mEH mEHNQO1
XRCC1
GSTM1
Theoretical model of gene-gene/environmental interaction pathway for lung cancer
Ile105Val Ala114Val
Tyr113HisHis139Arg
Tyr113HisHis139Arg
Pro187Ser
MspIIle462Val
Arg194Trp, Arg399Gln, Arg280His
Null
Ala146ThrArg72Pro
G870A
G0
500K SNP CoverageMedian intermarker distance: 3.3 kbMean intermarker distance: 5.4 kbAverage Heterozygosity 0.30Average minor allele frequency 0.22
SNPs in genes 196,38480% of genome within 10kb of a SNP
Figure 1. The effects of SNPs on the Risk of Lung Cancer among Smokers and Non-smokers
0
1
2
3
4
5
6
7
8
BRCA1 CHEK1 XRCC3 INFG IL-10 ALDH2
Smokers
Non-Smokers
ETS Exp
Non ETS Exp
OR
Hypothesis
The overall hypothesis is that multiple sequence variants in the genome are associated with the risk of lung cancer among non-smokers. Specifically, we hypothesize that a number of common nonsmoking lung cancer risk-modifying SNPs are in strong LD with the SNPs arrayed on the 500K GeneChip®.
Executive Committee
DNA Repair Working
Group Coordinator
Familial Cases Working Group
Coordinator
Rare Histology Working Group
Coordinator
Young Onset Working Group
Coordinator
Nonsmokers Working Group
Coordinator
DNA Repair Working Group
Members
Nonsmokers Working Group
Members
Familial Cases Working Group
Members
Rare Histology Working Group
Members
Young Onset Working Group
Members
Figure 2. Structure and Governance of ILCCO
Specific Aims Aim 1. To perform exploratory tests for
association between 500K SNPs across the genome and lung cancer risk among 200 non-smoking lung cancer patients and 200 controls.
Aim 2. To perform first stage of confirmatory association tests between lung cancer risk and more than 1,000 SNPs implicated in Aim 1 among an independent set of 600 pairs of cases and controls.
Specific Aims Aim 3. To perform second stage of confirmatory
association tests between lung cancer risk and more than 500 SNPs that were replicated in Aim 2 among an additional 600 cases and 600 controls. Additional SNPs will also be added from our ongoing pathway specific analyses of DNA repair, cell cycle regulation, inflammation and metabolic pathways based on non-smokers in our lung cancer study.
Aim 4. To perform fine mapping association studies in the flanking regions of each of the 30-100 SNPs confirmed in Aim 3 among the entire 1,400 cases and 1,400 controls. The large number of cases with non-smoking lung cancer in this study population also allows us to identify SNPs that are associated with risk of the disease among nonsmokers.
Specific Aims Aim 5. To explore the generalizability of
the SNPs identified in Specific Aims 1-4 within a Chinese population of 600 nonsmoking lung cancer cases and 600 nonsmoking controls. The relatively homogeneous Chinese population not only allows us to further confirm the associations, but also improves our ability to finely map the SNPs associated with lung cancer risk among non-smokers.
Discussion: Costs
Affy 500 k SNP chip $1000/case2000 x $1000=$2m1000 x $1000=$1m500 x $1000=$0.5 M
500 x 3000 (SNP) x $0.15=$225, 000
500 x 30 (SNP) x $0.15 =$2,250