QTL mapping

1

QTL mapping

Ades 2008, NHGRI

Simple Mendelian traits are caused by a single locus, and come in the ‘all-or-none’ flavor.

A Quantitative Trait is one in which many loci contribute. The phenotype can therefore vary in a ‘quantitative’ manner.

Modified from Mike White slides, 2010

2

Goals of QTL mapping

Ades 2008, NHGRI

To identify the loci that contribute to phenotypic

variation

1. Cross two parents with extreme phenotypes

2. Score the progeny for the phenotype

3. Genotype the progeny at markers across the genome

4. Associate the observed phenotypic variation with the underlying genetic variation

5. Ultimate goal: identify causal polymorphisms that explain the phenotypic variation

Modified from Mike White slides, 2010

3

Backcross

Broman and Sen 2009

Phenotype: Drug tolerance

80% 20% viability

Usually have at least 100 individuals

4

Intercross

Broman and Sen 2009

Phenotype: Drug tolerance

80% 20% viability

5

Backcross vs. Intercross

• An intercross recovers all three possible genotypes (AA, BB, AB). This allows detection of dominance with both alleles and provides estimates of the degree of dominance.

• A backcross has more power to detect QTL with fewer individuals.

• A backcross may be the only possible scheme when crossing two different species.

6

Genetic map: specific markers spaced across the genome

Markers can be:

•SNPs at particular loci

•Variable-length repeatse.g. ALU repeats

•ALL polymorphisms (if have whole genomes)

Ideally, markers shouldbe spaced every 10-20 cM

and span the whole genome

7

Genotype data: Determine allele at all markers in each F2

8

Phenotype data

9Broman and Sen 2009

1. Missing Data ProblemUse marker data to infer intervening genotypes

2. Model Selection ProblemHow do the QTL across the genome combine with the covariates to

generate the phenotype?

Test which markers correlate with the phenotype

10

Marker regression: simple T-test (or ANOVA) at each marker

Marker 1: no QTL Marker 2: significant QTL (population means are different)

Test which markers correlate with the phenotype

11

Marker regression

• Simple test – standard T-test/ANOVA

• Covariates (e.g. Gender, Environment) are easy to incorporate

• No genetic map necessary, since test is done separately on each marker

Advantages:

Disadvantages:

• Any individuals with missing marker data must be omitted from analysis

• Does not effectively consider positions between markers

• Does not test for genetic interactions (e.g. epistasis)

• The effect size of the QTL (i.e. power to detect QTL) is reduced by incomplete linkage to the marker

• Difficult to pinpoint QTL position, since only the marker positions are considered

12

Interval mapping

• In addition to examining phenotype-genotype associations at markers, look for associations between makers by inferring the genotype

Q

• The methods for calculating genotype probabilities between markers typically use hidden Markov models to account for additional factors, such as genotyping errors

• Lander and Botstein 1989

13

Interval mapping

Broman and Sen 2009

14

Interval mapping

• Takes account of missing genotype information – all individuals are included

• Can scan for QTL at locations in between markers

• QTL effects are better estimated

Advantages:

Disadvantages:

• More computation time required

• Still only a single-QTL model – cannot separate linked QTL or examine for interactions among QTL

15

LOD scores

• Measure of the strength of evidence for the presence of a QTL at each marker location

LOD(λ) = log10 likelihood ratio comparing the hypothesis of a QTL at position λ versus that of no QTL

Pr(y|QTL at λ, µAAλ, µABλ, σλ)

Pr(y|no QTL, µ, σ) { }log10

Ph

en

oty

pe

LOD 3 means that the TOP model is 103 times more likely than

the BOTTOM model

16

LOD curves

How do you know which peaks are really significant?

17

LOD threshold

Broman and Sen 2009

•Consider the null hypothesis that there are no QTLs genome-wide

one locationgenome-wide

1. Randomize the phenotype labels on the relative to the genotypes2. Conduct interval mapping and determine what the maximum LOD score is

genome-wide3. Repeat a large number of times (1000-10,000) to generate a null distribution

of maximum LOD scores

18

LOD threshold

• 1000 permutations10% ‘Genome-wide Error Rate’ = LOD 3.19

(means that at this LOD cutoff 10% of peaks could be random chance)5% GWER = LOD 3.52

• Boundary of the peak is often taken as points that cross (Max LOD – 1.5) (or - 1.8 for an intercross)

•Often these regions are very large & encompass many (hundreds) of genes

19

Lessons from QTL mapping studies about Genetic Architecture

* Often have a few big effect QTL and many small modifier QTLwith small effects on the phenotype

need lots of power (good phenotypic measurements and many individuals) to detect QTLs with small effects

* Recombination in F2’s can reveal negative effects segregating in the parentse.g. can find resistant-parent allele associated with sensitivity

MacKay review: often have loci with complementary effects found nearby

* Effects of an allele can be context dependentEnvironment-specific effects: Gene x Environment (GxE) interactionsGenomic context: epistatic (i.e. gene-gene) interactions are likely very

common … but difficult to detect

An alternative approach: Genome Wide Association Studies (GWAS)

Here the phenotypes and genotypes come from manydifferent individuals from a population

Identify SNPs that are significantly associated with the traitacross a bunch of individuals

An alternative approach: Genome Wide Association Studies (GWAS) across many individuals

Str

ains

Genotypes for 65 strains

Phenotypes for 65 strains

Population Structure

PhylogeneticRelatedness

RandomError

RandomError

Typically use a mixed linear model to test for significance

Phenotypic variance y = μ + a + other stuff + Error

Phenotypicmean

Additive Genetic Effects

across all involved genes

Phe

noty

pe

GenotypeAA TT

Identify SNPs that are significantly associated with the trait

23

A very important control for both types of mapping:controlling for covariates

Sometimes a SNP can appear correlated with phenotypic variation … but it can be due to some other feature that co-varies with the SNP and the phenotype

The clearest example: population structure

Other examples:- gender of the individuals- shared environments for subgroups- an example from our yeast studies:

ploidy differences when some F2s are haploidand some are diploid

24

Example: S. cerevisiae strains (Liti et al. 2009)

Oak strains

Vineyard strains

GenotypeAA TT

Phe

noty

pe

Mixed linear model identifies SNPs with a significant p-value.Often plot the –log(p) across the genome (Manhattan plot)

Again, the p-value cutoff comes from permutations(randomize the strain-phenotype labels and perform mapping

on randomized data 10,000 times)

How to find the causative SNP/polymorphism in giant regions?

Often very challenging to find which SNP(s) or polymorphisms(copy-number differences, rearrangements, etc) are causal

Some strategies people use:- Look at what’s known about the genes in the peak

CAUTION: very easy to get led by what ‘seems likely’

- Look at signatures of selection within the populatione.g. differences in FST

- Look for derived alleles

- Look for coding changes, genes in the region with severe expressiondifferences

- Combine with other datae.g. other mapping studies (QTL + GWAS), genomic datasets

Documents

QTL mapping