Genome-Wide Association Studies (GWAS) Epidemiology 243 Molecular Epidemiology of Cancer Spring 2008

Genome-Wide Association Studies (GWAS)

Epidemiology 243Molecular Epidemiology of

CancerSpring 2008

Association Studies of Genetic Factors 1st generation

Very small studies (<100 cases) Usually not epidemiologic study design; 1-2 SNPs

2nd generation Small studies (100-500 cases) More epi focus; a few SNPs

3rd generation Large molecular epi studies (>500 cases) Proper epi design; pathways

4th generation Consortium-based pooled analyses (>2000 cases) GxE analyses

5th generation Post-GWS studies

Boffeta, 2007

International Lung Cancer Consortium (ILCCO)

Goodman

Thun

Benhamou

ChenBerwick

Schwarts

Le Marchand

Kiyohara

McLaughlin

Zhang

Wiencke

Yang

Stucker

Boffetta

Spitz

Tajima

Risch

Brennan

Wichmann

Wild

Landi

3 cohort studies

17 population based case-control studies

13 hospital based case-control studies

2 studies with mixed controls

1 cross-sectional study

Vineis

Harris

Christiani

Lan

Hong

Lazarus

Issues in genetic association studies

Many genes ~25,000 genes, many can be candidates

Many SNPs ~12,000,000 SNPs, ability to predict functional SNPs is

limited Methods to select SNPs:

Only functional SNPs in a candidate gene Systematic screen of SNPs in a candidate gene Systematic screen of SNPs in an entire pathway Genomewide screen Systematic screen for all coding changes

Introduction A genome-wide association study is an approach

that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease.

Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses.

http://www.genome.gov/20019523

Definition of GWAS

A genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease (such as cancer) or condition.

Potential of GWAS Whole genome information, when combined

with epidemiological, clinical and other phenotype data, offers the potential for increased understanding of basic biological processes affecting human health, improvement in the prediction of disease and patient care, and ultimately the realization of the promise of personalized medicine.

In addition, rapid advances in understanding the patterns of human genetic variation and maturing high-throughput, cost-effective methods for genotyping are providing powerful research tools for identifying genetic variants that contribute to health and disease.

Potential of GWAS

Selection of SNPs(Genome-wide association studies) Molecular

Higher requirements: Affymetrix and Illumina

Analytical Highest requirements: Data management, automation

Advantages No biological assumptions and can identify novel

genes/pathways Excellent chance to identify risk alleles Utility in individual risk assessment

Disadvantages High costs Concern of multiple tests

SNP Selection

SNP Selection

Affymetrix® Genome-Wide Human SNP Array The new Affymetrix® Genome-Wide

Human SNP Array 6.0 features 1.8 million genetic markers, including more than 906,600 single nucleotide polymorphisms (SNPs) and more than 946,000 probes for the detection of copy number variation. The SNP Array 6.0 represents more genetic variation on a single array than any other product, providing maximum panel power and the highest physical coverage of the genome.

The need for GWA Current understanding of disease etiology is limited

Therefore, candidate genes or pathways are insufficient

Current understanding of functional variants is limited Therefore, the focusing on nonsynonymous changes is not sufficient

Results from linkage studies are often inconsistent and

broad Therefore, the utility of identified linkage regions is limited

GWA studies offer an effective and objective approach Better chance to identify disease associated variants

Improve understanding of disease etiology

Improve ability to test gene-gene interaction and predict disease risk

Xu JF, 2007

GWA is promising Many diseases and traits are influenced by genetic factors

i.e., they are caused by sequence variants in the genome

Over 12 millions SNPs are known in the genome i.e., some SNPs will be directly or indirectly associated with causal

variants

The cost of SNP Genotyping is reduced i.e., it is affordable to genotype a large number of SNPs in the genome

Large numbers of cases and controls are available i.e., there is statistical power to detect variants with modest effect

When the above conditions are met… …associated SNPs will have different frequencies between cases

GWA is challenging Many diseases and traits are influenced by genetic factors

But probably due to multiple modest risk variants

They confer a stronger risk when they interact

True associated SNPs are not necessary highly significant

Too many SNPs are evaluated

False positives due to multiple tests

Single studies tend to be underpowered

False negatives

Considerable heterogeneity among studies

Phenotypic and genetic heterogeneity

False positives due to population stratification

Xu, 2007

Genome coverage

Two major platforms for GWA Illumina: HumanHap300, HumanHap550, and

HumanHap1M Affymetrix: GeneChip 100K, 500K, 1M, and 2.3M

Genome-wide coverage The percentage of known SNPs in the genome that are in

LD with the genotyped SNPs Calculated based on HapMap Calculated based on ENCODE

Xu, 2007

Strategies for pre-association analysis

Quality control Filter SNPs by genotype call rates Filter SNPs by minor allele frequencies Filter SNPs by testing for Hardy-Weinberg

Equilibrium

Data Analysis

Single SNP analysis using pre-specified genetic models 2 x 3 table (2-df) Additive model (1-df), and test for

additivity All possible genetic models (recessive,

dominant)

Data Analysis

Haplotype analysis Gene-gene and gene-environment

interactions Interaction with main effect

Logistic regression

Interaction without main effect: data mining Classification and recursive tree (CART) Multifactor Dimensionality Reduction (MDR)

Sample size needs as a function of genotype prevalence and OR for main effects

Boffeta, 2007

False Positives False positives: too many dependent tests

Adjust for number of tests Bonferroni correction

Nominal significance level = study-wide significance / number of tests

Nominal significance level = 0.05/500,000 = 10-7

Effective number of tests Take LD into account

Permutation procedure Permute case-control status

Mimic the actual analyses

Obtain empirical distribution of maximum test statistic under null hypothesis

False Positives

False discovery rate (FDR) Expected proportion of false discoveries

among all discoveries

Offers more power than Bonferroni

Holds under weak dependence of the tests

False Positives

Bayesian approach Taking a priori into account, False-

Positive Report Probability (FPRP)

Confirmation in independent study populations

The approach may limit the number of false positives

Confirmation is needed to dissect true from false positives

Replication, examine the results from the 2nd stage only Joint analysis, combining data from 1st stage with 2nd stage Multiple stages

Issues of GWAS

Population stratification Multiple Testing: False Positives Gene-Environmental Interaction High Costs

Kingsmore, 2008

Kingsmore, 2008

GWAS

Proposed GWAS of Lung Cancer among Non-smokers

Motives and Conceptual Framework For Study of Genetic Susceptibility to Lung Cancer among Non-smokers

About 16% of the male smokers and 10% of female smokers will eventually develop lung cancer, which suggest exposures to other environmental carcinogens and individual genetic susceptibility may play an important role among non smoking lung cancer.

It is suggested that 26% of lung cancer are associated with genetic susceptibility Lichtenstein P, et al. NEJM, 2000)

We hypothesize that the variation of genetic susceptibility or single nucleotide polymorphisms (SNPs) of genes in inflammation, DNA repair, and cell cycle control pathways may be important on the development of lung cancer among non-smokers.

Lichtenstein P, Holm NV, Verkasalo PK, Iliadou A, Kaprio J, Koskenvuo M, Pukkala E, Skytthe A, Hemminki K. NEJM, 2000

If DNA damage not repaired

DNA damage repaired

If loose cell cycle control

Defected DNA repair gene

G1

S

G2

M

P53

Cyclin D1

P16

Environmental Carcinogens / Procarcinogens Exposures

PAHs, Xenobiotics,

Arene, Alkine, etc

Active carcinogens Detoxified carcinogens

DNA Damage Normal cell

Carcinogenesis Programmed cell death

Tobacco consumption Occupational Exposures

Environmental Exposure

CYP1A1

GSTP1

mEH mEHNQO1

XRCC1

GSTM1

Theoretical model of gene-gene/environmental interaction pathway for lung cancer

Ile105Val Ala114Val

Tyr113HisHis139Arg

Tyr113HisHis139Arg

Pro187Ser

MspIIle462Val

Arg194Trp, Arg399Gln, Arg280His

Null

Ala146ThrArg72Pro

G870A

G0

500K SNP CoverageMedian intermarker distance: 3.3 kbMean intermarker distance: 5.4 kbAverage Heterozygosity 0.30Average minor allele frequency 0.22

SNPs in genes 196,38480% of genome within 10kb of a SNP

Figure 1. The effects of SNPs on the Risk of Lung Cancer among Smokers and Non-smokers

0

1

2

3

4

5

6

7

8

BRCA1 CHEK1 XRCC3 INFG IL-10 ALDH2

Smokers

Non-Smokers

ETS Exp

Non ETS Exp

OR

Hypothesis

The overall hypothesis is that multiple sequence variants in the genome are associated with the risk of lung cancer among non-smokers. Specifically, we hypothesize that a number of common nonsmoking lung cancer risk-modifying SNPs are in strong LD with the SNPs arrayed on the 500K GeneChip®.

Executive Committee

DNA Repair Working

Group Coordinator

Familial Cases Working Group

Coordinator

Rare Histology Working Group

Coordinator

Young Onset Working Group

Coordinator

Nonsmokers Working Group

Coordinator

DNA Repair Working Group

Members

Nonsmokers Working Group

Members

Familial Cases Working Group

Members

Rare Histology Working Group

Members

Young Onset Working Group

Members

Figure 2. Structure and Governance of ILCCO

Specific Aims Aim 1. To perform exploratory tests for

association between 500K SNPs across the genome and lung cancer risk among 200 non-smoking lung cancer patients and 200 controls.

Aim 2. To perform first stage of confirmatory association tests between lung cancer risk and more than 1,000 SNPs implicated in Aim 1 among an independent set of 600 pairs of cases and controls.

Specific Aims Aim 3. To perform second stage of confirmatory

association tests between lung cancer risk and more than 500 SNPs that were replicated in Aim 2 among an additional 600 cases and 600 controls. Additional SNPs will also be added from our ongoing pathway specific analyses of DNA repair, cell cycle regulation, inflammation and metabolic pathways based on non-smokers in our lung cancer study.

Aim 4. To perform fine mapping association studies in the flanking regions of each of the 30-100 SNPs confirmed in Aim 3 among the entire 1,400 cases and 1,400 controls. The large number of cases with non-smoking lung cancer in this study population also allows us to identify SNPs that are associated with risk of the disease among nonsmokers.

Specific Aims Aim 5. To explore the generalizability of

the SNPs identified in Specific Aims 1-4 within a Chinese population of 600 nonsmoking lung cancer cases and 600 nonsmoking controls. The relatively homogeneous Chinese population not only allows us to further confirm the associations, but also improves our ability to finely map the SNPs associated with lung cancer risk among non-smokers.

Discussion: Costs

Affy 500 k SNP chip $1000/case2000 x $1000=$2m1000 x $1000=$1m500 x $1000=$0.5 M

500 x 3000 (SNP) x $0.15=$225, 000

500 x 30 (SNP) x $0.15 =$2,250

Documents

Genome-Wide Association Studies (GWAS) Epidemiology 243 Molecular Epidemiology of Cancer Spring 2008