Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
R/qtl workshop
(part 1)
Karl BromanBiostatistics and Medical InformaticsUniversity of Wisconsin – Madison
kbroman.org
github.com/kbroman
@kwbroman
Backcross
P1 P2
F1P1
BC
2
Intercross
P1 P2
F1 F1
F2
3
Phenotype data
Body weight
30 35 40 45
Heart weight
80 100 120 140 160 180 200
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
● ●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
30 35 40 45
80
100
120
140
160
180
200
Body weight
Hea
rt w
eigh
t
Sugiyama et al. (2002) Physiol Genomics 10:5–124
Genetic map
80
60
40
20
0
Chromosome
Loca
tion
(cM
)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
5
Genotype data
20 40 60 80
50
100
150
Markers
Indi
vidu
als
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
6
Goals
• Identify quantitative trait loci (QTL)(and interactions among QTL)
• Interval estimates of QTL location
• Estimated QTL effects
7
Statistical structure
QTL
Markers Phenotype
Covariates
The missing data problem: Markers←→ QTL
The model selection problem: QTL, covariates −→ phenotype
8
ANOVA at marker loci
• Also known as markerregression.
• Split mice into groupsaccording to genotype at amarker.
• Do a t-test / ANOVA.
• Repeat for each marker.
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
35
40
45
Genotype at D15Mit184B
ody
wei
ght
CC CB BB
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Genotype at D6Mit259
CC CB BB
10
ANOVA at marker loci
Advantages• Simple.
• Easily incorporates covariates.
• Easily extended to more complexmodels.
• Doesn’t require a genetic map.
Disadvantages• Must exclude individuals with
missing genotype data.
• Imperfect information about QTLlocation.
• Suffers in low density scans.
• Only considers one QTL at a time.
11
Interval mapping
Lander & Botstein (1989)
• Assume a single QTL model.
• Each position in the genome, one at a time, is posited as the putative QTL.
• Let q = 1/0 if the (unobserved) QTL genotype is BB/AB.
(Or 2/1/0 if the QTL genotype is BB/AB/AA in an intercross.)
Assume y|q ∼ N(µq, σ)
• Given genotypes at linked markers, y ∼ mixture of normal dist’ns withmixing proportions Pr(q | marker data):
12
Genotype probabilities
A B − B A A
?
Calculate Pr(q | marker data), assuming
• No crossover interference
• No genotyping errors
Or use the hidden Markov model (HMM) technology
• To allow for genotyping errors
• To incorporate dominant markers
• (Still assume no crossover interference.)
13
The normal mixtures
M1 M2Q
7 cM 13 cM
• Two markers separated by 20 cM,with the QTL closer to the leftmarker.
• The figure at right shows thedistributions of the phenotypeconditional on the genotypes atthe two markers.
• The dashed curves correspond tothe components of the mixtures. 20 40 60 80 100
Phenotype
BB/BB
BB/AB
AB/BB
AB/AB
µµ0
µµ0
µµ0
µµ0
µµ1
µµ1
µµ1
µµ1
14
Interval mapping
Let pij = Pr(qi = j|marker data)
yi|qi ∼ N(µqi, σ2)
Pr(yi|marker data, µ0, µ1, σ) =∑
j pij f(yi;µj, σ)
where f(y;µ, σ) = exp[−(y − µ)2/(2σ2)]/√2πσ2
Log likelihood: l(µ0, µ1, σ) =∑
i log Pr(yi|marker data, µ0, µ1, σ)
Maximum likelihood estimates (MLEs) of µ0, µ1, σ:
values for which l(µ0, µ1, σ) is maximized.
15
EM algorithm
Dempster et al. (1977)
E step:
Let w(k)ij = Pr(qi = j|yi,marker data, µ̂(k−1)0 , µ̂
(k−1)1 , σ̂(k−1))
=pij f(yi;µ̂
(k−1)j ,σ̂(k−1))
∑j pij f(yi;µ̂
(k−1)j ,σ̂(k−1))
M step:
Let µ̂(k)j =
∑i yiw
(k)ij /∑
iw(k)ij
σ̂(k) =√∑
i
∑jw
(k)ij (yi − µ̂
(k)j )2/n
The algorithm:
Start with w(1)ij = pij; iterate the E & M steps until convergence.
16
LOD scores
The LOD score is a measure of the strength of evidence for the presence of a QTLat a particular location.
LOD(λ) = log10 likelihood ratio comparing the hypothesis of aQTL at position λ versus that of no QTL
= log10
{Pr(y|QTL at λ,µ̂0λ,µ̂1λ,σ̂λ)
Pr(y|no QTL,µ̂,σ̂)
}
µ̂0λ, µ̂1λ, σ̂λ are the MLEs, assuming a single QTL at position λ.
No QTL model: The phenotypes are independent and identically distributed(iid) N(µ, σ2).
18
LOD curves
0
1
2
3
4
5
6
7
Chromosome
LOD
sco
re
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Body weight
Heart weight
19
Interval mapping
Advantages• Takes proper account of missing
data.
• Allows examination of positionsbetween markers.
• Gives improved estimates of QTLeffects.
• Provides pretty graphs.
Disadvantages• Increased computation time.
• Requires specialized software.
• Difficult to generalize.
• Only considers one QTL at a time.
21
Haley-Knott regression
A quick approximation to Interval Mapping.
E(yi|qi) = µq
E(yi|Mi) = E[ E(yi|qi) |Mi] =∑
j Pr(q = j|Mi)µj
=∑
j pijµj
Regress y on pi, pretending the residual variation is normally distributed (withconstant variance).
LOD =n
2log10
(RSS0
RSS1
)
22
The normal mixtures
M1 M2Q
7 cM 13 cM
• Two markers separated by 20 cM,with the QTL closer to the leftmarker.
• The figure at right shows thedistributions of the phenotypeconditional on the genotypes atthe two markers.
• The dashed curves correspond tothe components of the mixtures. 20 40 60 80 100
Phenotype
BB/BB
BB/AB
AB/BB
AB/AB
µµ0
µµ0
µµ0
µµ0
µµ1
µµ1
µµ1
µµ1
23
Haley-Knott results
0
2
4
6
8
Chromosome
LOD
sco
re
2 7 15 16
IM
HK
24
H-K with selectivegenotyping
90
100
110
120
Genotype
phen
otyp
e
● ●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
AA AB
90
100
110
120
Genotype
phen
otyp
e
● ●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
AA AB
25
LOD thresholds
Large LOD scores indicate evidence for the presence of a QTL
Question: How large is large?
LOD threshold = 95 %ile of distr’n of max LOD, genome-wide, if there areno QTLs anywhere
Derivation: • Analytical calculations (L & B 1989)
• Simulations (L & B 1989)
• Permutation tests (Churchill & Doerge 1994)
27
Null distribution of theLOD score
• Null distribution derived bycomputer simulation of backcrosswith genome of typical size.
• Dashed curve: distribution of LODscore at any one point.
• Solid curve: distribution ofmaximum LOD score,genome-wide.
0 1 2 3 4 5
LOD score
28
Permutation test
genotypedata
markersin
divi
dual
s
phen
otyp
esLOD scores
maximumLOD score
29
Permutation results
0 1 2 3 4 5 6
Genome−wide maximum LOD score
30
LOD support intervals
0 10 20 30 40 50
0
1
2
3
4
5
6
7
Map position (cM)
LOD
sco
re
1.5
1.5−LOD support interval
34
Modelling multiple QTL
• Reduce residual variation =⇒ increased power
• Separate linked QTL
• Identify interactions among QTL
36
Epistasis in BC
0
20
40
60
80
100
Ave
. phe
noty
peAdditive
●
●
●
●
A H
QTL 1
A
H
QTL 2
0
20
40
60
80
100
Ave
. phe
noty
pe
Epistatic
●
●
●
●
A H
QTL 1
A
H
QTL 2
37
Epistasis in F2
0
20
40
60
80
100
Ave
. phe
noty
pe
Additive
●
●
●
●
●
●
●
●
●
A H B
QTL 1
A
H
B
QTL 2
0
20
40
60
80
100
Ave
. phe
noty
pe
Epistatic
●
●
●
●
● ●
●
●
●
A H B
QTL 1
A
H
B
QTL 2
38
References
• Broman KW (2001) Review of statistical methods for QTL mapping in experimentalcrosses. Lab Animal 30:44–52A review for non-statisticians.
• Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer As-sociates, Sunderland, MA, chapter 15Chapter on QTL mapping.
• Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitativetraits using RFLP linkage maps. Genetics 121:185–199The seminal paper.
• Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative traitmapping. Genetics 138:963–971LOD thresholds by permutation tests.
• Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter 11.An old but excellent general genetics textbook with a very interesting discussion ofepistasis.
39
URLs
http://kbroman.org
http://kbroman.org/pages/teaching.html
http://www.biostat.wisc.edu/~kbroman/D3/em_alg
http://www.biostat.wisc.edu/~kbroman/D3/lod_and_effect
http://www.biostat.wisc.edu/~kbroman/D3/lod_random
40