R/qtl workshop

R/qtl workshop

(part 1)

Karl BromanBiostatistics and Medical InformaticsUniversity of Wisconsin – Madison

kbroman.org

github.com/kbroman

@kwbroman

Backcross

P1 P2

F1P1

BC

2

Intercross

P1 P2

F1 F1

F2

3

Phenotype data

Body weight

30 35 40 45

Heart weight

80 100 120 140 160 180 200

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

● ●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

30 35 40 45

80

100

120

140

160

180

200

Body weight

Hea

rt w

eigh

t

Sugiyama et al. (2002) Physiol Genomics 10:5–124

Genetic map

80

60

40

20

0

Chromosome

Loca

tion

(cM

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

5

Genotype data

20 40 60 80

50

100

150

Markers

Indi

vidu

als

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

6

Goals

• Identify quantitative trait loci (QTL)(and interactions among QTL)

• Interval estimates of QTL location

• Estimated QTL effects

7

Statistical structure

QTL

Markers Phenotype

Covariates

The missing data problem: Markers←→ QTL

The model selection problem: QTL, covariates −→ phenotype

8

ANOVA at marker loci

• Also known as markerregression.

• Split mice into groupsaccording to genotype at amarker.

• Do a t-test / ANOVA.

• Repeat for each marker.

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

30

35

40

45

Genotype at D15Mit184B

ody

wei

ght

CC CB BB

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Genotype at D6Mit259

CC CB BB

10

ANOVA at marker loci

Advantages• Simple.

• Easily incorporates covariates.

• Easily extended to more complexmodels.

• Doesn’t require a genetic map.

Disadvantages• Must exclude individuals with

missing genotype data.

• Imperfect information about QTLlocation.

• Suffers in low density scans.

• Only considers one QTL at a time.

11

Interval mapping

Lander & Botstein (1989)

• Assume a single QTL model.

• Each position in the genome, one at a time, is posited as the putative QTL.

• Let q = 1/0 if the (unobserved) QTL genotype is BB/AB.

(Or 2/1/0 if the QTL genotype is BB/AB/AA in an intercross.)

Assume y|q ∼ N(µq, σ)

• Given genotypes at linked markers, y ∼ mixture of normal dist’ns withmixing proportions Pr(q | marker data):

12

Genotype probabilities

A B − B A A

?

Calculate Pr(q | marker data), assuming

• No crossover interference

• No genotyping errors

Or use the hidden Markov model (HMM) technology

• To allow for genotyping errors

• To incorporate dominant markers

• (Still assume no crossover interference.)

13

The normal mixtures

M1 M2Q

7 cM 13 cM

• Two markers separated by 20 cM,with the QTL closer to the leftmarker.

• The figure at right shows thedistributions of the phenotypeconditional on the genotypes atthe two markers.

• The dashed curves correspond tothe components of the mixtures. 20 40 60 80 100

Phenotype

BB/BB

BB/AB

AB/BB

AB/AB

µµ0

µµ0

µµ0

µµ0

µµ1

µµ1

µµ1

µµ1

14

Interval mapping

Let pij = Pr(qi = j|marker data)

yi|qi ∼ N(µqi, σ2)

Pr(yi|marker data, µ0, µ1, σ) =∑

j pij f(yi;µj, σ)

where f(y;µ, σ) = exp[−(y − µ)2/(2σ2)]/√2πσ2

Log likelihood: l(µ0, µ1, σ) =∑

i log Pr(yi|marker data, µ0, µ1, σ)

Maximum likelihood estimates (MLEs) of µ0, µ1, σ:

values for which l(µ0, µ1, σ) is maximized.

15

EM algorithm

Dempster et al. (1977)

E step:

Let w(k)ij = Pr(qi = j|yi,marker data, µ̂(k−1)0 , µ̂

(k−1)1 , σ̂(k−1))

=pij f(yi;µ̂

(k−1)j ,σ̂(k−1))

∑j pij f(yi;µ̂

(k−1)j ,σ̂(k−1))

M step:

Let µ̂(k)j =

∑i yiw

(k)ij /∑

iw(k)ij

σ̂(k) =√∑

i

∑jw

(k)ij (yi − µ̂

(k)j )2/n

The algorithm:

Start with w(1)ij = pij; iterate the E & M steps until convergence.

16

LOD scores

The LOD score is a measure of the strength of evidence for the presence of a QTLat a particular location.

LOD(λ) = log10 likelihood ratio comparing the hypothesis of aQTL at position λ versus that of no QTL

= log10

{Pr(y|QTL at λ,µ̂0λ,µ̂1λ,σ̂λ)

Pr(y|no QTL,µ̂,σ̂)

}

µ̂0λ, µ̂1λ, σ̂λ are the MLEs, assuming a single QTL at position λ.

No QTL model: The phenotypes are independent and identically distributed(iid) N(µ, σ2).

18

LOD curves

0

1

2

3

4

5

6

7

Chromosome

LOD

sco

re

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Body weight

Heart weight

19

Interval mapping

Advantages• Takes proper account of missing

data.

• Allows examination of positionsbetween markers.

• Gives improved estimates of QTLeffects.

• Provides pretty graphs.

Disadvantages• Increased computation time.

• Requires specialized software.

• Difficult to generalize.

• Only considers one QTL at a time.

21

Haley-Knott regression

A quick approximation to Interval Mapping.

E(yi|qi) = µq

E(yi|Mi) = E[ E(yi|qi) |Mi] =∑

j Pr(q = j|Mi)µj

=∑

j pijµj

Regress y on pi, pretending the residual variation is normally distributed (withconstant variance).

LOD =n

2log10

(RSS0

RSS1

)

22

The normal mixtures

M1 M2Q

7 cM 13 cM

• Two markers separated by 20 cM,with the QTL closer to the leftmarker.

• The figure at right shows thedistributions of the phenotypeconditional on the genotypes atthe two markers.

• The dashed curves correspond tothe components of the mixtures. 20 40 60 80 100

Phenotype

BB/BB

BB/AB

AB/BB

AB/AB

µµ0

µµ0

µµ0

µµ0

µµ1

µµ1

µµ1

µµ1

23

Haley-Knott results

0

2

4

6

8

Chromosome

LOD

sco

re

2 7 15 16

IM

HK

24

H-K with selectivegenotyping

90

100

110

120

Genotype

phen

otyp

e

● ●●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

AA AB

90

100

110

120

Genotype

phen

otyp

e

● ●●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

AA AB

25

LOD thresholds

Large LOD scores indicate evidence for the presence of a QTL

Question: How large is large?

LOD threshold = 95 %ile of distr’n of max LOD, genome-wide, if there areno QTLs anywhere

Derivation: • Analytical calculations (L & B 1989)

• Simulations (L & B 1989)

• Permutation tests (Churchill & Doerge 1994)

27

Null distribution of theLOD score

• Null distribution derived bycomputer simulation of backcrosswith genome of typical size.

• Dashed curve: distribution of LODscore at any one point.

• Solid curve: distribution ofmaximum LOD score,genome-wide.

0 1 2 3 4 5

LOD score

28

Permutation test

genotypedata

markersin

divi

dual

s

phen

otyp

esLOD scores

maximumLOD score

29

Permutation results

0 1 2 3 4 5 6

Genome−wide maximum LOD score

30

LOD support intervals

0 10 20 30 40 50

0

1

2

3

4

5

6

7

Map position (cM)

LOD

sco

re

1.5

1.5−LOD support interval

34

Modelling multiple QTL

• Reduce residual variation =⇒ increased power

• Separate linked QTL

• Identify interactions among QTL

36

Epistasis in BC

0

20

40

60

80

100

Ave

. phe

noty

peAdditive

●

●

●

●

A H

QTL 1

A

H

QTL 2

0

20

40

60

80

100

Ave

. phe

noty

pe

Epistatic

●

●

●

●

A H

QTL 1

A

H

QTL 2

37

Epistasis in F2

0

20

40

60

80

100

Ave

. phe

noty

pe

Additive

●

●

●

●

●

●

●

●

●

A H B

QTL 1

A

H

B

QTL 2

0

20

40

60

80

100

Ave

. phe

noty

pe

Epistatic

●

●

●

●

● ●

●

●

●

A H B

QTL 1

A

H

B

QTL 2

38

References

• Broman KW (2001) Review of statistical methods for QTL mapping in experimentalcrosses. Lab Animal 30:44–52A review for non-statisticians.

• Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer As-sociates, Sunderland, MA, chapter 15Chapter on QTL mapping.

• Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitativetraits using RFLP linkage maps. Genetics 121:185–199The seminal paper.

• Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative traitmapping. Genetics 138:963–971LOD thresholds by permutation tests.

• Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter 11.An old but excellent general genetics textbook with a very interesting discussion ofepistasis.

39

URLs

http://kbroman.org

http://kbroman.org/pages/teaching.html

http://www.biostat.wisc.edu/~kbroman/D3/em_alg

http://www.biostat.wisc.edu/~kbroman/D3/lod_and_effect

http://www.biostat.wisc.edu/~kbroman/D3/lod_random

40

Documents

R/qtl workshop