Report- Genome wide association studies

University of Agricultural SciencesDepartment of Genetics and Plant Breeding

GKVK, Bangalore-65

PG Seminar: GPB 581(0+1)

On

Genome wide association studies.

Submitted By: Varsha Gayatonde

Sr.MSc, PALB 2235

Dept. of Genetics & Plant Breeding

UAS, GKVK, Bangalore

Submitted To: Dr. R.Nandini

Associate Professor

Dept. of Genetics & Plant Breeding

UAS, GKVK, Bangalore

Department of Genetics and Plant BreedingUniversity of Agricultural Sciences

GKVK, Bangalore-560065

Contents

SI. No. Title Content

1 Introduction to mapping.

2 Terminologies

3 Brief History

4 Association mapping

5 Comparison of GWAS and Biparental mapping

6 Concept of Linkage disequilibrium(LD)

7 Factors affecting LD and use in plant system.

8 Genome wide association studies.

9 Methodologies.

10 Challenges while conducting GWAS

11 Advantages and disadvantages

12 GWAS studies in Arabidopsis

13 Studies on Rice

14 Maize smut studies

15GWAS studies on MYB related traits and in other crops.

16 Current association challenges

18 Conclusion

19 References

DEPARTMENT OF GENETICS AND PLANT BREEDINGGKVK, UNIVERSITY OF AGRICULTURAL SCIENCES, BANGALORE – 560 065

First Seminar: GPB 581 (0+1)

GENOME WIDE ASSOCIATION STUDIES

Synopsis

Genome wide association is a study design in which many markers spread across a genome, are genotyped and test a statistical association with a phenotype are performed locally along the genome. It is also an examination of many common genetic variants in different individuals to see if any variant is associated with a trait.

The first prospects for whole genome association studies began in early 2002¹. This LD based association mapping started with human beings, later in Arabidopsis, rice, grapevine, wheat, soybean, maize, tomato and other model organisms./ HapMap,’ the multi-country effort to identify, catalog common human genetic variant put a milestone to extend application to other organisms in order to make GWAS powerful. SNPs need to be chosen widely distributed in a way, that reflects the genetic variation. Selection of suitable and desirable markers yield fine mapping² and the genome wide chips, which enabled increased coverage of markers improving power in association signals. But this doesn’t necessarily imply increased power of detecting association loci. The other drawbacks here are need of large population size, pooling and cost of preparing DNA samples and less knowledge about the risk of a trait.

To overcome this drawback recently researchers upgraded the statistical approaches, proper imputation of genotypes and advanced approaches like nested association mapping, candidate gene association approach, and the web application of GWAS (GWAPP in Arabidopsis)³. Despite of its drawbacks still GWAS is famous due to its dropping genotyping costs, which is likely to drive association studies away from candidate gene based studies. This will likely tinvolve whole genome resequencing of all the individuals in a population, will allow assessment of the effect of point mutation,insertions,deletions and large structure variation.

Many studies conducted using GWAS as a tool worldwide on different factors like temperature effect on cobs, agronomic variants, agroclimatic diversities, flowering and grain yield traits, disease diversity etc. A major benefit of GWAS is one time genotyping and repeated phenotyping in different environmental conditions help to study /n’ number of traits within a short period over a large area. The rapid development of high throughput sequencing technology is that, population choice for GWAS studies will no longer be restricted to current model organisms and will slowly become more forced on which species are more relevant for answering biological questions. References:

1. Ozoki, K., 2001, A high throughput SNP typing system for GWAS, Springer. 16:1134-1137.

2. Tohn, P, A., 2009, Validating and refining GWAS signals, Nature. 10:318-329.

Name: VARSHA ID:PALB 2235

Date: 29/10/2013 Time: 10:00 AM

Introduction

The level of the genetic diversity is pivotal for world food security and survival of human civilization on earth. Domestication resulted as improved cultivars in several crops to produce food for the better supply of the human diet. Presently 150 plant species cultivated in agriculture, twelve provide about 75% of human food and four produce 50% of human diet. According to FHO report, ∼800 million people are suffering from food deficiency. An attention to improve agricultural production to eliminate or, at least, reduce the feeding problems.

The narrow genetic base of modern crop cultivars is the serious obstacle to sustain and improve crop productivity due to rapid vulnerability to potentially new biotic and abiotic stresses. Plant germplasm resources comprising of wild plant species, modern cultivars, and their crop wild relatives, are the important reservoirs of natural genetic variations.

Originated from a number of historical genetic events as a respond to environmental stresses and selection through crop domestication

• The objective of genetic mapping is to identify simply inherited markers in close proximity to genetic factors affecting quantitative traits (Quantitative trait loci, or QTL).

• This localization relies on processes that create a statistical association between marker and QTL alleles and processes that selectively reduce that association as a function of the marker distance from the QTL.

Why we need genome mapping?

Gene mapping in the map of genes present inside our chromosome. In Eukaryotes genes are condensed tightly inside the compact system. We have to know which gene is answering for the trait of interest. Expression of genotypes give us phenotypes. We can’t look in to gene and genotypes, though it is our disability. So to know that we have to calculate mathematically. The further extension of mapping technology is to know the traits in a more easier, cheaper and within a short period of time.

Genome

• The genome is all the DNA in a cell. All the DNA on all the chromosomes ,includes genes, intergenic sequences, repeats, Specifically, it is all the DNA in an organelle.

• Eukaryotes can have 2-3 genomes; Nuclear genome, Mitochondrial genome, Plastid genome respectively. If not specified, “genome” usually refers to the nuclear genome.

Terminologies

False negative: the declaration of an outcome as statistically non-significant, when the effect is actually genuine.

False positive: the declaration of an outcome as statistically significant, when there is no true effect.

Linkage: refers to coinheritance of different loci within a genetic distance on the chromosome.

Linkage equilibrium: LE is a random association of alleles at different loci and equals the product of allele frequencies within haplotypes.

Linkage disequilibrium: LD is a non-random association of alleles at different loci, describing the condition with non-equal frequency of haplotypes in a population.

Minor allele Frequency(MAF):The frequency of the less common alleles of a polymorphic locus. Its value lies between 0 to 0.5,and can be vary between populations.

Odd ratio: Measurement of association that is commonly used in case control studies. Defined as odd of exposure to the susceptible genetic variant in case compared with that in controls. If OR significantly greater than 1,then the genetic variant is associated with a disease.

Association Mapping

Association mapping, a high resolution method for mapping quantitative trait loci based on linkage disequilibrium. Association refers to covariance of a marker polymorphism and a trait of interest.

The first association study to attempt a genome scanning plants was conducted in sea beet (Beta vulgaris ssp. maritima), a wild

relative of sugar beet (Beta vulgaris ssp. vulgaris).The first association study of a quantitative trait based on a candidate gene was the analysis of flowering time and the dwarf8 (d8) gene in maize. . Association mapping is based on the principle of Linkage disequilibrium (LD) and is based on the entire population.

How it works?

A group of unrelated individuals normally presents variation for many phenotypic aspects, thus several traits can be studied in the same population using the same genotypic data. A higher proportion of molecular markers are likely to be polymorphic, providing better genome coverage than any biparental map. Elite lines are used for study, multi-year and multi-location phenotypic data may be available at no additional cost.

Goal of association mapping

Identification of susceptibility variant, replication in differecohort/population, understanding of genetic function at cell level,this can lead to identification of durable targets, development of drug for prevention better understanding of the cellular processes that are involved in disease treatments.

Association mapping offers many advantages over linkage analysis:

• much higher mapping resolution;

• greater allele number and broader reference population;

• less research time in establishing an association

• Utilizes existing individuals.

• Multi-trial phenotypic data stored in databases can be used.

Limitations

• Resources for phenotyping and statistical issues.

• Population structure results in spurious associations.

Two types of association mapping:

• Success of either methods depends on population size and degree of LD

• Genome wide scanning Markers spanned across the genome, Moderate to extensive LD. If LD is high, GWA is useful with low resolution mapping.

• Candidate gene scanning Sequencing only candidate gene which has low LD

Flowchart of a gene association study

.

Biparental mapping GWAS

1) No cross required, works with existing germplasm.

2) Phenotypic data can be already available.

3) High resolution.4) More than 2 alleles are tested.5) Many loci for a single trait are

concurrently analyzed.

6) Comparatively low.

1) Experimental cross required. 2) Phenotypes to be collected.3) Limited mapping resolution.4) Essentially 2 alleles are tested5) Constraints to segregating loci 6) between parental lines.

High detection power

What are genome-wide association studies?

Genome-wide association studies are a relatively new way for scientists to identify genes involved in human disease. This method searches the genome for small variations, called single nucleotide polymorphisms or SNPs (pronounced “snips”), that occur more frequently in people with a particular disease than in people without the disease. Each study can look at hundreds or thousands of SNPs at the same time. Researchers use data from this type of study to pinpoint genes that may contribute to a person’s risk of developing a certain disease.

Because genome-wide association studies examine SNPs across the genome, they represent a promising way to study complex, common diseases in which many genetic variations contribute to a person’s risk. This approach has already identified SNPs related to several complex conditions including diabetes, heart abnormalities, Parkinson disease, and Crohn disease. Researchers hope that future genome-wide association studies will identify more SNPs associated with chronic diseases, as well as variations that affect a person’s response to certain drugs and influence interactions between a person’s genes and the environment.

Synonyms

Genome-wide case-control studies; Genome-wide genetic association analysis.

Genome-wide association studies (GWAS) are projects to investigate the statistical association between phenotypes and a dense set of genetic markers (Genetic Marker) that capture a substantial amount of genetic variations in the genome, using a large number of matched samples.

Phenotypes can be qualitative traits such as disease status or quantitative traits such as blood pressure. Statistical association between disease status and alleles of a genetic marker is carried out by categorical data analysis.

Genetic markers are usually genotyped by microarray chips. Whether a substantial genetic variation in the genome, including common, rare, and

structural variations, is captured by the set of markers depends on the number of markers and their chromosome locations.

The typical number of single nucleotide polymorphism (SNP) markers used in a current GWAS,

societies depends on the exploitation of genetic recombination and allelic diversity for crop improvement, and many of the world’s farmers depend directly on the harvests of the genetic diversity they sow for food and fodder as well as the next seasons seed (Smale et al., 2004).

The considerable genetic diversity of traditional varieties of crops is the most immediately useful and economically valuable part of global biodiversity. Subsistence farmers use landraces as a key component of their cropping systems. Such farmers account for about 60% of agricultural land use and provide approximately 15-20% of the world’s food (Francis, 1986). In addition, landraces are the basic raw materials used by plant breeders for developing modern varieties. Over the last few decades, awareness of the rich diversity of exotic or wild germplasm has increased. This has lead to a more intensive use of this germplasm in breeding and thereby yields of many crops increased dramatically.

Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.

Design: it identifies the population structure, Select case subjects (those with disease),Select control subjects (healthy),Genotype a million SNPs for each subject, Determine which SNP is associated, Encoded data ,Ranking SNPs.

History of GWAS

Successful study published in 2005, with investigating patients age related molecular degeneration.

Prior to GWAS in 2000 Inheritance studies of linkage families. Then the revolution occurred in HapMap2003, which is the variety of sequencing techniques to discover and catalog SNPs in different population.

Human Genome Project

The Human Genome Project was declared complete in April 2003. An initial rough draft of the human genome was available in June 2000 and by February 2001 a working draft had been completed and published followed by the final sequencing mapping of the human genome on April 14, 2003. Although this was reported to be 99% of the human genome with 99.99% accuracy a major quality assessment of the human genome sequence was published on May 27, 2004 indicating over 92% of sampling exceeded

99.99% accuracy which is within the intended goal. Further analyses and papers on the HGP continue to occur.

Hap Map

Hap Map is a Multi-country effort to identify, catalog common human genetic variants. Developed to better understand and catalogue LD patterns across the genome in several populations. Genotyped ~4 million SNPs on samples of African, east Asian, European ancestry. All genotype data in a publicly available data base. here we can download the genotype data. It is able to examine LD patterns across genome, Can estimate approximate coverage of a given SNP chip and Can represent 80-90% of common SNPs with~300,000 tag SNPs for European or Asian samples and~500,000 tag SNPs for African samples.

Thousand genome project

Another spinoff from Human genome project, the 1000 genome project launched in 2008.A 3 year project covered most of the countries worldwide. It mainly targeted the African countries. This 1000 genome project showing researchers how dynamic the human genome really is and why it is so.

Concepts Underlying the Study Design

Single Nucleotide Polymorphisms

The modern unit of genetic variation is the single nucleotide polymorphism or SNP. SNPs are single base-pair changes in the DNA sequence that occur with high frequency in the human genome. For the purposes of genetic studies, SNPs are typically used as markers of a genomic region, with the large majority of them having a minimal impact on biological systems. SNPs can have functional consequences, however, causing amino acid changes, changes to mRNA transcript stability, and changes to transcription factor binding affinity. SNPs are by far the most abundant form of genetic variation in the human genome. SNPs are notably a type of common genetic variation; many SNPs are present in a large proportion of human populations. SNPs typically have two alleles, meaning within a population there are two commonly occurring base-pair possibilities for a SNP location. The frequency of a SNP is giving in terms of the minor allele frequency or the frequency of the less common allele. For example, a SNP with a minor allele (G) frequency of 0.40 implies that 40% of a population

has the G allele versus the more common allele (the major allele), which is found in 60% of the population.

Linkage Disequilibrium

Linkage disequilibrium (LD) is a property of SNPs on a contiguous stretch of genomic sequence that describes the degree to which an allele of one SNP is inherited or correlated with an allele of another SNP within a population. The term linkage disequilibrium was coined by population geneticists in an attempt to mathematically describe changes in genetic variation within a population over time. It is related to the concept of chromosomal linkage, where two markers on a chromosome remain physically joined on a chromosome through generations of a family. Recombination events within a family from generation to generation break apart chromosomal segments. This effect is amplified through generations, and in a population of fixed size undergoing random mating, repeated random recombination events will break apart segments of contiguous chromosome (containing linked alleles) until eventually all alleles in the population are in linkage equilibrium or are independent. Thus, linkage between markers on a population scale is referred to as linkage disequilibrium.

The rate of LD decay is dependent on multiple factors, including the population size, the number of founding chromosomes in the population, and the number of generations for which the population has existed. As such, different human sub-populations have different degrees and patterns of LD. African-descent populations are the most ancestral and have smaller regions of LD due to the accumulation of more recombination events in that group. European-descent and Asian descent populations were created by founder events (a sampling of chromosomes from the African population), which altered the number of founding chromosomes, the population size, and the generational age of the population. These populations on average have larger regions of LD than African-descent groups. Many measures of LD have been proposed, though all are ultimately related to the difference between the observed frequency of co-occurrence for two alleles (i.e. a two-marker haplotype) and the frequency expected if the two markers are independent. The two commonly used measures of linkage disequilibrium are D’ and r².

LD in animal system

LD in Humans:

LD has been studied extensively in humans (Homo sapiens) - Pritchard & Przeworski

There is tremendous heterogeneity in human LD estimates because of differences in loci, marker types (microsatellites versus SNPs), sample populations, and chromosome type (sex chromosomes versus autosomes).

LD in other animal systems:

LD studies have also been conducted in cattle and Fruit flies (Drosophila melanogaster)

Extensive LD reported in the Dutch black and white dairy cattle populations Globalization of semen trading.

LD in plant systems

MAIZE: In maize (Zea mays ssp. mays), several studies have been conducted to investigate. LD over a wide range of population and marker types. The patterns of LD vary substantially with the population chosen.

Ten million investigated sequence diversity at 21 loci on chromosome 1 in a diverse group of maize germplasm.

ARABIDOPSIS: The LD pattern in Arabidopsis is a sharp contrast to the pattern in maize. During last five years most studies described LD in Wheat and Barley, besides single reports on rice, rye grass, soybean, sugarcane and sorghum.

The factors, which lead to an increase in LD, include

Inbreeding, Small population size, Genetic isolation between lineages, Population subdivision, Low recombination rate, Population admixture, Natural and artificial selection, Balancing selection, etc.

The factors, which lead to a decrease/disruption in LD, include Outcrossing, High recombination rate, High mutation rate, etc.

Stages of GWAS designs

1) One stage design-First time used in HapMap project where for this cost involvement was more. It is a process of genotyping all samples on all the markers.

2) Two stage design-It reduces the genotyping requirements and reduces the false positive rate.

3) Multistage designs-Joint analysis has more power than replication. p-value in Stage 1 must be liberal. CaTs power calculator. Here signals from an initial, First-stage GWA are used to define a subset of SNPs that are retyped in additional second stage samples. Lower cost do not gain power.

Analysis of GWAS

Most common approach: look at each SNP one-at-a-time. Possibly add in multi-marker information. Further investigate / report top SNPs only Or backwards replication…Most commonly trend test.Log additive model, logistic regression are the foremost methods to analyze GWAS. Adjust for potential population stratification.

Basics for GWAS

Calculate the odd ratio• If 2 events are considered• odds of A and B is

OR = Odds(A)/Odds(B) = Pr(A)/Pr(~A) / Pr(B)/Pr(~B) • Odd ratio of many Events comparing the 2 groups

OR = Odds(D|G=1)/Odds(D|G=0) = = Pr(D|G=1)/Pr(~D|G=1) / Pr(D|G=0)/Pr(~D|G=0) = Pr(D|G=1)/Pr(D|G=0) x Pr(~D|G=0)/Pr(~D|G=1) = RR x Pr(~D|G=0)/Pr(~D|G=1).

• Symmetry in odds ratioOR = Odds(D|G=1)/Odds(D|G=0) = Odds(G|D=1)/Odds(G|D=0).

Testing Significance by using Chi-square test and Rank SNP by P-value. (Statistical test of association).

Challenges we are going to address while conducting GWAS

Multiple hypothesis testing. In GWAS the number of statistical tests is commonly is on the order of 10⁶. At significance level of 0.01we would

expect 10,000 false positive. Thus individual p-value <0.01are not significant anymore. Correction of multiple hypothesis testing is critical.

Population structure-Confounding structure leads to false positive. It requires favorable conditions like Statistical power and resolution Small samples, large number of hypothesis, increased power, testing compound hypothesis. Association Test Single Locus AnalysisWhen a well-defined phenotype has been selected for a study population, and genotypes are collected using sound techniques, the statistical analysis of genetic data can begin. The de facto analysis of genome-wide association data is a series of single-locus statistic tests, examining eachSNP independently for association to the phenotype. The statistical test conducted depends on a variety of factors, but first and foremost, statistical tests are different for quantitative traits versus case/controlstudies. Quantitative traits are generally analyzed using generalized linear model (GLM) approaches, most commonly the Analysis of Variance (ANOVA), which is similar to linear regression with a categorical predictor variable, in this case genotype classes. The null hypothesis of an ANOVAusing a single SNP is that there is no difference between the trait means of any genotype group. The assumptions of GLM and ANOVA are 1) the trait is normally distributed; 2) the trait variance within each group is the same (the groups are homoscedastic); 3) the groups are inde-pendent. Dichotomous case/control traits are generally analyzed using either contingency table methods or logistic regression. Contingency table tests examine and measure the deviation from independence that is expected under the null hypothesis that there is no association between the phenotype and genotype classes. The most ubiquitous form of this test is the popular chi-square test (and the related Fisher’s exact test). Logistic regression is an extension of linear regression where the outcome of a linear model is transformed using a logistic function that predicts the probability of having case status given a genotype class. Logistic regression is often the preferred approach because it allows for adjustment for clinical covariates (and other factors), and can provide adjusted odds ratios as a measure of effect size. Logistic regression has been extensively developed, and

numerous diagnostic procedures are available to aid interpretation of the model.

For both quantitative and dichotomous trait analysis (regardless of the analysis method), there are a variety of ways that genotype data can be encoded or shaped for association tests. The choice of data encoding can have implications for the statistical power of a test, as the degrees of freedom for the test may change depending on the number of genotype-based groups that are formed. Allelic association tests examine the association between one allele of the SNP and the phenotype.Genotypic association tests examine the association between genotypes (or genotype classes) and the phenotype. The genotypes for a SNP can also be grouped into genotype classes or models, such as dominant, recessive, multiplicative, or additive models. Each model makes different assumptions about the genetic effect in the data assuming two alleles for a SNP, A and a,a dominant model (for A) assumes thathaving one or more copies of the A allele increases risk compared to a (i.e. Aa or AA genotypes have higher risk). The recessive model (for A) assumes that two copies of the A allele are required to alter risk, so individuals with the AA genotype are compared to individuals with Aa and aa genotypes. The multiplicative model (for A) assumes that if there is 36risk for having a single A allele, there is a 96risk for having two copies of the A allele: in this case if the risk for Aa is k, the risk forAA is k2. The additive model (for A) assumes that there is a uniform, linear increase in risk for each copy of the A allele, so if the risk is 36for Aa, there is a 66risk for AA - in this case the risk for Aa is k and the risk for AA is 2k.A common practice for GWAS is to examine additive models only, as the additive model has reasonable power to detect both additive and dominant effects, but it is important to note that an additive model may be underpowered to detect some recessive effects . Rather than choosing one model a priori, some studies evaluate multiple genetic models coupled with an appropriate correction for multiple testing.

Multi-Locus Analysis

In addition to single-locus analyses, genome-wide association studies provide an enormous opportunity to examine interactions among

genetic variants throughout the genome. Multi-locus analysis, however, is not nearly as straightforward as conducting single-locus tests, and presents numerous computational, statistical, and logistical challenges. Because most GWAS genotype between 500,000 and one million SNPs, examining all pair-wise combinations of SNPs is a computationally intractable approach, even for highly efficient algorithms. One approach to this issue is to reduce or filter the set of genotyped SNPs, eliminating redundant information.

A simple and common way to filter SNPs is to select a set of results from a single-SNP analysis based on an arbitrary significance threshold and exhaustively evaluate interactions in that subset. This can be perilous, however, as selecting SNPs to analyze based on main effects will prevent certain multi-locus models from being detected – so called ‘‘purely epistatic’’ models with statistically undetectable marginal effects. With these models, a large component of the heritability is concentrated in the interaction rather than in the main effects. In other words, a specific combination of markers incurs a significant change in disease risk. The benefits of this analysis are that it performs an unbiased analysis for interactions within the selected set of SNPs. It is also far more computationally and statistically tractable than analyzing all possible combinations of markers.

Missing heritability

In many complex diseases there are numerous genetic variants which have been identified. But for many of the recent studies these common variants only explain a small fraction of the increased risk. Most of those that have been identified have no established biological relevance to the disease and often they are not located inside ’active’ genes . From the last years of GWAS it is clear that the common variants fail to explain the majority of the genetic heritability of most human diseases. This suggests that the hypothesis of ’Common disease, common variant’ is not as valid as was previously believed. The problem is that the biological reality does not correspond to the study design and assumption of GWAS, and the solution is not to increase the sample size even further but to improve the study design and statistical methods. One possible explanation to the missing heritability could be some kind of interaction between different genes(epistasis). These interactions could be hard to detect when analyzing one SNP at the time, as the marginal effect of a single SNP will be small. Another explanation is that part of the increased risk can be

explained by many rare variants, which are present among less than 1 % of the population. This suggests that there could be heterogeneity, where different genetic profiles can cause diseases that are diagnostically the same.

Genetic Interactions

A general definition of genetic interaction (epistasis) is that the effect (pene-trance) of one locus varies according to the genotype present at another locus. To detect interactions we need to define how a ’natural’ combined effect of two risk loci would be expressed in the organism. The concept of gene-gene interactions is not new, but still it is confusing since the term is used in various ways. Biological interaction or epistasis was defined first by Bateson in 1909 . In that example one of the alleles at one locus G is preventing the alleles at locus B from being expressed in the organism. This relation does not necessarily have to be symmetric. This definition is similar to the definition biologists use to examine a biological interaction between proteins, where proteins interact to regulate several cellular processes. In statistics the definition of interaction is usually a deviation from a linear model. In 1918 Fisher made a statistical definition of epistasis [28], as deviation from additivity in effects of the alleles at different loci on a quantitative trait.

This definition is more similar to the classical statistical definition of interaction and do not quite correspond to the biological definition of epistasis. These definitions get troublesome when the trait is binary, in these cases the mathematical modelling often focus on the penetrances. Hence the definitions of epistasis need to be modified. For binary traits an example could be that both allele A and allele B at two different loci are needed to develop the trait. In this case A is epistatic to B, and B is epistatic to A, hence the epistasis is symmetric in contrast to the definition by Bateson.

A classic way to represent lack of epistasis has been the heterogeneity model, a person gets the trait by possessing (at least) one of the predisposing genotypes. This definition actually falls under Bateson’s definition of epistasis, for example if a person has both risk variants (situated at different loci) the effect of allele A will be masked by

allele B - another confusing issue about these genetic interactions. There are two types of genetic heterogeneity, allelic heterogeneity is when several mutations on the same allele cause the same disease. Locus heterogeneity means that mutations in several unrelated loci can cause the same disorder.

The above example of locus heterogenetity could be generalized to a situation without full penetrance, that is 0 <fi,j < 1 for some of the penetrances. Mathematically, locus heterogeneity can be expressed as

fij = i + $j # i$j

where ↵i and $j are the penetrance factors for the two genetic variants.Locus heterogeneity is similar to a daisy chain, where it is enough for one of the components to break (caused by having at least one of the risk variants) for the entire system to malfunction, i.e. to obtain the disease. There are two other common two-locus models for binary traits, the multiplicative model and the additive model. The multiplicative model can be expressed as

fij = ↵i$j ,this model is often considered as epistatic. Both the additive model

fij = ↵i + $j , and the heterogeneity model are thought of as non-epistatic by most authors. However, some authors considers epistasis as departure from the multiplicative model. Further problems appear when considering that both the multiplicative and the heterogeneity models become additive with suitable log transformations. It will be difficult to really model the true epistatic interactions in complex diseases, and discovered epistatic effects may have limited input to the understanding of the disease. Still, models that allow for interactions can improve the statistical power of detecting the genetic risk variants .The main issue in finding interactions, independent of how you define epistasis, is how you should detect it in complex diseases when analyzing millions of genetic markers. Assume that the disease is caused by different mutations on different loci in various families, and these genes have a strong effect in each of the subpopulations. Then the heterogenetic risk genes will probably show a very weak marginal effect when the markers are analyzed one at the time. For epistatic interactions it will be very computationally demanding to examine all possible gene-gene interactions, in addition to the issue of correcting for testing multiple hypotheses. One way to handle this is to first test for marginal main effects for each marker in the sample, and

hope that the genes involved in interactions will also show at least a modest marginal effect. Then the results from this analysis is combined with biological knowledge to suggest a number of candidates for interaction analysis.

Data Imputation

To conduct a meta-analysis properly, the effect of the same allele across multiple distinct studies must be assessed. This can prove difficult if different studies use different genotyping platforms (which use different SNP marker sets). As this is often the case, GWAS datasets can be imputed to generate results for a common set of SNPs across all studies. Genotype imputation exploits known LD patterns and haplotype frequencies from the HapMap or 1000 Genomes project to estimate genotypes for SNPs not directly genotyped in the study.

The concept is similar in principle to haplotype phasing algorithms, where the contiguous set of alleles lying on a specific chromosome is estimated. Genotype imputation methods extend this idea to human populations. First, a collection of shared haplotypes within the study sample is computed to estimate haplotype frequencies among the genotyped SNPs. Phased haplotypes from the study sample are compared to reference haplotypes from a panel of much more dense SNPs, such as the HapMap data. The matched reference haplotypes contain genotypes for surrounding markers that were not genotyped in the study sample. Because the study sample haplotypes may match multiple reference haplotypes, surrounding genotypes may be given a score or probability of a match based. On the haplotype overlap. For example, rather than assign an imputed SNP a single allele A, the probability of possible alleles is reported (0.85 A,0.12 C,0.03 T)based on haplotype frequencies. This information can be used in the analysis of imputed data to take into account uncertainty in the genotype estimation process, typically using Bayesian analysis approaches. Popular algorithms for genotype imputation include BimBam , IMPUTE , MaCH ,and Beagle . Much like conducting a meta-analysis, genotype imputation must be conducted with great care. The reference panel (i.e. the 1000 Genomes data or the HapMap project) must contain haplotypes drawn from the same population as the study sample in order to facilitate a proper haplotype

match. If a study was conducted using individuals of Asian descent, but only European descent populations are represented in the reference panel, the genotype imputation quality will be poor as there is a lower probability of a haplotype match. Also, the reference allele for each SNP must be identical in both the study sample and the reference panel. Finally, the analysis of imputed genotypes should account for the uncertainty in genotype state generated by the imputation process.

Statistical methods in GWAS

If a genetic marker is associated to a particular disease, then the genotype or allele frequencies will be different among affected and healthy individuals. A commonly used test for searching for associated SNPs in case-control studies is a Pearson % test applied to a 2-by-2 table of allele counts in the two groups. For complex traits it is commonly assumed that the contribution to the genetic effect from each SNP is roughly additive, i.e. the penetrance for heterozygous are somewhere in between the penetrance for the two homozygotes. This test is powerful for additive models, whereof the popularity of this test in these studies. Other common tests include a Pearson %test comparing the genotype frequencies instead of allele frequencies, Cochran Armitage test for trend in penetrances, and logistic regression. The Transmission Disequilibrium Test (TDT) is an association test using data from families with at least one affected child. This test was introduced by Spielman et al. , and the test evaluates the transmission of an allele from a heterozygous parent to the offspring. The TDT is based on the assumption that each of the two alleles M1 and M2 at a locus is transmitted with equal probability to the offspring, hence for a sample of heterozygous parents we expect approximately half of them to transmit the alleleM1. If one of the alleles is transmitted more often among families where the children have a genetic disease, we suspect that the allele is associated to the disease. Let b denote the number of heterozygous parents who transmits alleleM1 to their offspring, and c the number of heterozygous parents who transmits allele M2. Conditioned on b + c, b is is binomially distributed, but usually the test statistic has the following form

T = (b # c)2 b + c, This test asymptotically follows a % distribution and is equivalent to a Pearson%2-test.

Logistic Regression

Generalized Linear Models (GLMs) extend the ordinary regression model to other response variables than the Normal distributed. GLMs are applicable if the response variable has a distribution which belongs to the natural exponential family. One of those distributions is the Binomial distribution, and with Logistic Regression we model the binomial probability p(x)= P(Y =1|x) as

logp(x) 1 # p(x)= ↵ +Xj$jxj Here xj denotes the value of the jth element in the predictor x. In the simple Logistic regression with one binary predictor x, $ is equal to the log odds ratio$ = p(x = 1)/(1 # p(x = 1)) p(x = 0)/(1 # p(x = 0)).

In retrospective (individuals are sampled based on their affection status) studies the effect parameter $ will be the same as in the prospective (sampling based on the predictors) design, if we assume that the sampling probability is independent of x. This is one of the main reasons for using this method in biomedical studies. Another advantage with the logistic regression is that it is easy to include several predictor in the analysis and make inference for interactions between genes and environment, as well as gene-gene interactions. Schaid described a univariate method for case-parent data, modelling genotype relative risks with conditional logistic regression using three pseudo controls based on the parents’ untransmitted alleles. This method can be generalized to two loci. For case-control data logistic regression can be used to analyse interactions by comparing the saturated model to an additive model,

specified on the form of. The additive logistic model is roughly equivalent to the heterogeneity mode if the relative risk (RR) or odds ratio (OR) is of moderate size. However, North et. al show examples of heterogeneity models which are marginally recessive (marginal RR⇡ 150), in this case the logistic regression yields non-zero interaction estimates. Hence, to really examine deviations from the heterogeneity model (and not the multiplicative or logistic model) more advanced methods need to be applied.

GWAS studied in Various crops:

1.Arabidopsis

Genome-Wide Association Mapping in Arabidopsis Identifies Previously Known Flowering Time and Pathogen Resistance Genes.

A very large number of spurious genotype–phenotype correlations are found, especially for traits that vary geographically. For example, plants from northern latitudes flower later; however, in addition to sharing genetic variants that make them flower late, they also tend to share variants across the genome, making it difficult to determine which genes are responsible for flowering. This notwithstanding, several previously known genes were successfully identified in this study, and the researchers are optimistic about the prospects for association mapping in this species.

They checked flowering time and pathogen resistance in a sample of 95 accessions for which genome wide polymorphism data were available. In spite of an extremely high rate of false positives due to population structure, we were able to identify known major genes for all phenotypes tested, thus demonstrating the potential of genome-wide association mapping in A. thaliana and other species with similar patterns of variation. The rate of false positives differed strongly between traits, with more clinal traits showing the highest rate. However, the false positive rates were always substantial regardless of the trait, highlighting the necessity of an appropriate genomic control in association studies.

The columns on the left give the genotype and associated phenotype for four loci, for each of the 95 accessions. The four loci are the flowering time locus FRI (þ, wild-type; 1, Ler null allele; 2, Col null allele, for which the associated phenotype is flowering time in long-day conditions without vernalization (late flowering is indicated by height and color of bar), and the three pathogen resistance loci Rps5, Rpm1, and Rps2 (þ, wild-type;, null allele, for which the associated phenotypes are hypersensitive response to the appropriate bacterial avr gene (red indicates resistance, black indicates susceptibility, and missing data are indicated by missing bar). The tree on the right illustrates the genetic relationships between the accessions.

GWAPP: A Web Application for Genome-Wide Association Mapping in Arabidopsis.

GWAPP, an interactive Web-based application for conducting GWAS in A. thaliana. Using an efficient implementation of a linear mixed model, traits measured for a subset of 1386 publicly available ecotypes can be uploaded and mapped with a mixed model and other methods in just a couple of minutes. GWAPP features an extensive, interactive, and user-friendly interface that includes interactive Manhattan plots and linkage disequilibrium plots. It also facilitates exploratory data analysis by implementing features such as the inclusion of candidate polymorphisms in the model as cofactors.

Fig 1(A) The filter box allows the user to exclude specific accessions as well as change the name and the description of the data set.(B) The data set list displays information for each accession in the data set. In edit mode, the user can use the checkbox to add and remove accessions from the data set.(C) A Google map shows the locations of all accessions in the data set. Clicking on one marker will show a pop-up with information about the name and ID of the selected accession.(D) The geographic distribution map (GeoMap) shows the geographic distribution of the accessions in the data set. Moving the mouse over a country will show the number of accessions located in that region.

Fig2.The result view displays GWAS plots for each of the five chromosomes. Each GWAS plot itself consists of three panels. The top panel (A) contains a scatterplot. The positions on the chromosome are on the x axis and the score on the y axis. The dots in the scatterplot represent SNPs (E).A horizontal dashed line (H) shows the 5% FDR threshold. At the top of the GWAS results view, a search box for genes is displayed (D). These genes will be displayed as a colored band (red in the figure). The second panel (B) shows the gene annotation and is only shown for a specific zoom range (<1.5Mb). It will display genes, gene features, and gene names. Moving the mouse over a gene will display additional information in a pop-up (F), and clicking on a gene will open the TAIR page for the specific gene. Panel (C) displays various chromosome-wide statistics. The region highlighted by a yellow band (I) is shown in the scatterplot and in the gene annotation. The gear icon opens a pop-up (G) with the available statistics the user can choose from.

Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa.o

Here the earlier approaches revealed that the Biparental and QTL approaches are not scalable to investigate the genetic potential and tremendous phenotypic variation of more than 12000 accessions available in public germplasm repositories.

Here they took global collection of 413 diverse(sativa) varieties from 82 contries using high quality custom designed 441000 oligonucleotides phenotyping array.For these accesions they phenotyped 34 morphological, developmental and agronomic traits over 2 consicutive field seasons.

This mapping stretegy evaluated variation both within and among 4 of the major subgroups of rice, revealing significant heterogenity of genetic architechture among groups as well as gene by environmental effect.

Fig. Phenotypic distribution and genome-wide association scan for plant height. ( a ) Quantile – Quantile plots for both na ï ve and mixed

model for plant height in all samples. ( b ) Boxplot showing the differences in plant height among subpopulations. Box edges represent the upper and lower quantile with median value shown as bold line in the middle of the box. Whiskers represent 1.5 times the quantile of the data. Individuals falling outside the range of the whiskers shown as open dots. ( c ) Histogram of plant height in all samples. Dashed black line represents the null distribution. ( d ) Genome-wide P -values from the mixed model and na ï ve method. x axis shows the SNPs along each chromosome; y axis is the− log 10 (P -value) for the association. Dots in ( a ) and ( c ) indicate SNPs with P -values <1 × 10 −4 in the mixed model and the top 50 SNPs in the naïve method; SNPs within 200 kb range of known genes are in red; other significant SNPs are in blue. Candidate gene locations shown as red vertical dashed lines with names on top.

A Genome-Wide Association Study Identifies Genomic Regions for Virulence in the Non-Model Organism Heterobasidion annosum s.s

The dense single nucleotide polymorphisms (SNP) panels needed for genome wide association (GWA) studies have hitherto been expensive to establish and use on non-model organisms. To overcome this, we used a next generation sequencing approach to both establish SNPs and to determine genotypes. We conducted a GWA study on a fungal species, analyzing the virulence of Heterobasidion annosum s.s., a necrotrophic pathogen, on its hosts Picea abies and Pinus sylvestris. From a set of 33,018 single nucleotide polymorphisms (SNP) in 23 haploid isolates, twelve SNP markers distributed on seven contigs were associated with virulence (P,0.0001). Four of the contigs harbour known virulence genes from other fungal pathogens and the remaining three harbour novel candidate genes. Two contigs link closely to virulence regions recognized previously by QTL mapping in the congeneric hybrid H. irregulare6H. occidentale. The study demonstrates the efficiency of GWA studies for dissecting important complex traits of small populations of non-model haploid organisms with small genomes.

Genome-wide association study (GWAS) of resistance to head smut in maize

Head smut, caused by the fungus Sphacelotheca reiliana (Kühn) Clint, is a devastating global disease in maize, leading to severe quality and yield loss each year. The present study is the first to conduct a genome-wide association study (GWAS) of head smut resistance using

the Illumina MaizeSNP50 array. Out of 45,868 single nucleotide polymorphisms in a panel of 144 inbred lines, 18 novel candidate genes were associated with head smut resistance in maize. These candidate genes were classified into three groups, namely, resistance genes, disease response genes, and other genes with possible plant disease resistance functions. The data suggested a complicated molecular mechanism of maize resistance against S. reiliana. This study also suggested that GWAS is a useful approach for identifying causal genetic factors for head smut resistance in maize.

Fig. Manhattan plots of a mixed linear model (MLM) for resistance to head smut. Plots above the blue horizontal dashed line show the genome-wide significance with a moderately stringent threshold of −log (1/45,868). Plots above the red horizontal dashed line show the genome-wide significance with stringent threshold of −log(0.05/45,868). The different colors indicate plots for different chromosomes, which follow the order: chromosome 1–chromosome 10. The plots with the −log10 (P) Value above 8 were not shown

Advantages of GWAS

1) Biological pathway of the trait does not have to be known.2) Potential to discover novel candidate genes, not identified through

other methodological approaches.3) Encourage the formation of collaborative consortia to recruit sufficient

number of participants for analysis, which tend to continue their collaboration with subsequent analysis.

4) Rules act at specific genetic association.

5) Provides data on ancestry of each subject, which assists in matching case subject with control subject.

6) Provides data on 2 types of structural variants-sequence and copy number variations- which provides more robust data.

7) It is large enough to identify mutations explaining a few percent of phenotypic variance.

Disadvantages

1) Results need replication in independent samples in different population.

2) A large study of population is required.3) GWAS detect association not causation.4) Identifying specific location not complete gene. Many variants

identified are nowhere near a protein coding gene or are within genes that were not previously believed to associate with a trait or condition.

5) Falls on common variants.6) Detect any variant that are common(>5%) in a population.7) Typically for any particular trait, the cumulative effect of multiple SNPs

only explain a small function of an individual risk of a train.

Still why GWAS is popular?

The dropping genotyping costs are likely to drive association studies away from candidate genes. It involves whole genome resequencing of all the individuals in a population, will allow an assessment of point mutation, insertions deletions and large structure variation such as copy number variation Eg. Resequencing of Arabidopsis lyrata. In future this will help in RNA-seq data to include in e-QTL mapping in GWAS studies.

Population choice for GWAS studies will no longer restricted to model organisms will slowly become more focused on the spp which are more relevant in answering biological questions. The accuracy of GWAS depends on 1 time genotyping and repeated phenotyping in different environmental conditions.

Output of GWAS

• To ensure greatest utility of GWAS result in the future,all phenotype and genotype data should to be made public and be deposited in public databases.

• As such file format and minimum information standards should to be established, such as those available for sequence data or microarray experiments. Priority to storage and dissemination of phenotypic and genotypic data.

Future perspectives

Despite the caveats outlined above, it seems that genome-wide association studies of the role of common variants in complex disease will be carried out in the near future. Initial studies will define more accurately the principal factors, which have been summarized above, that can reduce the power of such studies. In these studies, large sample sizes should be used, biases taken into account, multiple-testing issues addressed and replication studies carried out, therefore optimizing experimental design, statistical power and cost efficiency. Close evaluation of the yields of true susceptibility loci in relation to the cost of such rigorously designed studies will determine whether the genome-wide analyses of common SNPs is a worthwhile approach in the continuing dissection of the genetic basis of common disease.

Summary and conclusions

The past year has seen a remarkable shift in our capacity to dissect the genetic basis of common diseases and continuous traits of biomedical significance. The GWA approach has proven itself extremely well-suited to the identification of common SNP-based variants with modest to large effects on phenotype. Careful implementation and appropriate interpretation has resulted in discoveries that have proven more robust than many had anticipated. Growing numbers of novel susceptibility loci have been identified, shedding light on the fundamental mechanisms that influence disease predisposition, and much is being learned about the complex relationships between changes in genome sequence and phenotypic variation.

However, we are far from the end of this particular voyage, and recent discoveries are nothing more than initial forays into the terra incognita of our genomes. We remain unable to explain more than a small proportion of observed familial clustering for most multifactorial traits, a fact that emphasizes the need to extend analysis to a more complete range of potential susceptibility variants, and to support more explicit modelling of

the joint effects of genes and environment. Many of the greatest challenges to be faced in the years ahead lie not so much in the identification of the association signals themselves, but in defining the molecular mechanisms through which they influence disease risk and/or phenotypic expression.

Reference

Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58. doi: 10.1038/ nature09298.

Atwell S, Huang YS, Vilhjalmsson BJ, Willems G, Horton M, et al. (2010)Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631.

Connelly CF, Akey JM (2012) On the prospects of whole-genome association mapping in Saccharomyces cerevisiae. Genetics.

Cooper GM, Johnson JA, Langaee TY, Feng H, Stanaway IB, et al. (2008) A genome-wide scan for common genetic variants with a large influence on warfarin maintenance dose. Blood 112: 1022–1027. doi: 10.1182/blood-2008-01- 134247

Cumagun CJR, Bowden RL, Jurgenson JE, Leslie JF, Miedaner T (2004) Genetic mapping of pathogenicity and aggressiveness of Gibberella zeae (Fusarium graminearum) toward wheat. Phytopathology 94: 520–526.

Edwards AO, Ritter R, III, Abel KJ, Manning A, Panhuysen C, et al. (2005) Complement factor H polymorphism and age-related macular degeneration. Science 308: 421–424. doi: 10.1126/ science.1110189.

Ellison CE, Hall C, Kowbel D, Welch J, Brem RB, et al. (2011) Population genomics and local adaptation in wild isolates of a model microbial eukaryote.

Freedman, M. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).

Freimer, N. & Sabatti, C. The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiology. Nature Genet. 36, 1045–1051 (2004).A clear and unbiased review of the main current genetic mapping strategies that discusses analyses using extended pedigrees, affected sib-pairs and association.

Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. Doi:10.1038 /nature09534.

Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, et al. (2008) ORegAnno: an open- access community-driven resource for regulatory annotation. Nucleic Acids Res 36: D107-D113. doi: 10.1093/nar/gkm967

Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, et al. (2005) Complement factor H variant increases the risk of age-related macular degeneration. Science 308: 419–421. doi: 10.1126/science.1110359.

Hall D, Tegstrom C, Ingvarsson PK (2010) Using association mapping to dissect the genetic basis of complex traits in plants. Brief Funct Genomics 9: 157–165.

Hawthorne B, Rees-George J, Bowen J, Ball R (1997) A single locus with a large effect on virulence in Nectria haematococca MPI. Fungal Genet Newsl 44: 24–26.

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009)Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367.

IRGSP .he map-based sequence of the rice genome. Nature 436 , 793– 800( 2005).

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308: 385–389. doi: 10.1126/science.1109557.

Lander, E.S. & Schork, N.J.Genetic dissection of complex traits. Science 265,2037–2048 (1994).

Li, Y., Huang, Y.S., Bergelson, J., Nordborg, M., and Borevitz, J.O.(2010).Association mapping of local climate-sensitive quantitative trait loci in Arabidopsis thaliana. Proc.Natl.Acad.Sci.USA 107: 21199–21204.

Lind M, Dalman K, Stenlid J, Karlsson B, Olson A (2007) Identification of quantitative trait loci affecting virulence in the basidiomycete Heterobasidion annosum s.l. Curr Genet 52: 35–44.

Lind M, van der Nest M, Olson A ˚ , Brandstro ¨m-Durling M, Stenlid J (2012) A 2nd generation linkage map of Heterobasidion annosum s.l. based on in silico anchoring of AFLP markers. PLoS One 7: e48347.

Liti G, Carter DM, Moses AM, Warringer J, Parts L, et al. (2009) Population genomics of domestic and wild yeasts. Nature 458: 337–341.

Lohmueller, K. et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33,177–182 (2003).

Muller LAH, Lucas JE, Georgianna DR, McCusker JH (2011) Genome-wide association analysis of clinical vs. nonclinical origin provides insights into Saccharomyces cerevisiae pathogenesis. Mol Ecol 20: 4085–4097.

Neafsey DE, Barker BM, Sharpton TJ, Stajich JE, Park DJ, et al. (2010)Population genomic sequencing of Coccidioides fungi reveals recent hybridization and transposon control. Genome Res 20: 938–946.

Olson A, Stenlid J (2001) Plant pathogens - Mitochondrial control of fungalhybrid virulence. Nature 411: 438–438.

Ozoki, K., 2001, A high throughput SNP typing system for GWAS, Springer. 16:1134-1137.

Pandelova I, Ciuffetti LM (2005) A proteomics-based approach for identification of the Tox D gene. Fungal Genet Newsl 52.

Price,A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies .Nat. Genet.38, 906 –909 (2006).

Santoyo F, Gonzalez AE, Terron MC, Ramirez L, Pisabarro AG (2008)Quantitative linkage mapping of lignin-degrading enzymatic activities in Pleurotus ostreatus. Enzyme Microb Technol 43: 137–143.

Tohn, P, A., 2009, Validating and refining GWAS signals, Nature. 10:318-329.

Umit Seren. And Bjarni., 2012, GWAPP:A web application for genome wide association mapping in Arabidopsis, The plant cell J. 24:4793-4805.

Yamamoto , T., Yonemaru J.&Yano,M .Towards the understanding of complex traits in rice: substantially or superficially? DNA Res.16, 141–154 ( 2009).

Yan J, Shah T, Warburton ML, Buckler ES, McMullen MD, et al. (2009) Genetic characterization and linkage disequilibrium estimation of a global maize collection using SNP markers. PLoS One 4.

Zhang H, Zhao Q, Liu K, Zhang Z, Wang Y, et al. (2009) MgCRZ1,a transcription factor of Magnaporthe grisea, controls growth, development and is involved in full virulence. FEMS Microbial Lett 293: 160–169.

Education

Report- Genome wide association studies