View
215
Download
0
Category
Tags:
Preview:
Citation preview
1
Probability models for short DNA sequences
BioInfoSummer@ANU, December 2 2003
Terry Speed & Xiaoyue ZhaoUniversity of California at Berkeley
& WEHI Melbourne
2
Synopsis
Some biology
Background & previous work
Models and modeling
Results and discussion
Future work
Acknowledgements
3
4
A look at gene structure
5
Beginning of the splicing process
splice donor splice acceptor
6
Protein-DNA interactions
usually involve some degree of sequence specificity
7
Examples of 5’splice (donor) sitesTCGGTGAGTTGGGTGTGTCCGGTCCGTATG GTAAGATCT GTAAGTCAGGTAGGACAGGTAGGGAAGGTAAGGAGGGTATGGTGGGTAAGGGAGGTTAGT CATGTGAGT
exon intron
8
Probability models for short DNA motifs
Short: ~6-20 base pairsDNA motifs: enhancers, promoters,
terminators, splicing signals, translation initiation sites, centromeres, ...
Why probability models?• to characterize the motifs• to help identify them • for incorporation into larger models
9
Our aim and some notation
L: length of the the DNA sequence motif.
Xi: discrete random variable at position i, taking values from the set = {A, C, G, T}.
Given a number of instances of a DNA
sequence motif x = (x1, …, xL) of length L, we want a model for the probability P(x) of x.
We denote by xij (i < j) the sequence
(xj , xj-1, …., xi ) in reverse time order.
10
Weight matrix models, Staden (1984)
Base -3 -2 -1 0 +1 +2 +3 +4 +5
A 33 61 10 0 0 53 71 7 16
C 37 13 3 0 0 3 8 6 16
G 18 12 80 100 0 42 12 81 22
T 12 14 7 0 100 2 9 6 46
A weight matrix for donor sites. Entries are percentagesEssentially a mutual independence model.
An improvement over the consensus CAGGTAAGT.
11
Beyond independence
Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (non-stationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models.
Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees.
[The interpolated Markov models of Salzberg et al (1998) address the same problem.]
12
Variable Length Markov Models
Factorize P(x) in the usual telescopic way:
P(X1=x1) 2L P(Xl = xl | X1
l-1 = x1l-1),
then simplify this using context functions cl, l=2,..L, to
P(X1=x1) 2L P(Xl = xl | cl(X1
l-1)) = cl(x1l-1)),
where cl : x1l-1 xl-m
l-1 is suitably defined on l-1 tuples.
13
VLMM, cont.
Here cl : l-1 i=0l-1 i, and m = ml is given by
ml(x1l-1) = min {r: P(Xl = x | X1
l-1 = x1l-1) =
P(Xl = x | Xl-rl-1 = xl-r
l-1) for all x }.
The function cl defines the sequence-specific context, and ml defines the sequence-specific memory or order of the Markov property for position l.
14
VLMM: an illustrative example
A full set of 16 contexts of order 2.
Pruned set of 12 contexts: P(X3|X2=C,X1=G) = P(X3|X2=C), etc.
15
VLMM cont.
A VLMM for a DNA sequence motif of length L is specified by • a distribution for X1, and, for l = 2,…L,• a constrained distribution for Xl given Xl-1,…,X1. That is, we need L-1 context functions, or trees.
But, there is a difficulty here.
16
Sequence dependencies(interactions) are not always local
The methods outlined so far all fail to incorporate long-range ( ≥4 bp) interactions. New model types are needed.
3-dimensional folding; DNA, RNA & protein interactions
17
Modeling “long-range” dependency
The principal work in this area is Burge & Karlin’s (1997) maximal dependence decomposition (MDD).
More recently, Cai et al (2000) and Barash et al (2003)
used Bayes networks (BN). Ellrott et al (2002) optimized the sequence order in
which a stationary Markov chain models the motif. We have adapted this last idea, to give permuted
variable length Markov models (PVLMM). Potamianos & Jelinek (1998) have related work on decision trees.
18
Part of a context (decision) tree for position -2 of a splice donor PVLMM
Sequence order: +2(A/G) +5(T) -1(G) +4(G) -2(A) +3(A) -3(A)
Node #s:counts; Edge #s:split variables.
19
Maximal Dependence Decomposition
MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the most dependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used.
The details can vary according to the splitting criterion (Burge & Karlin used 2), the actual splits (binary, etc), and the stopping rule.
However, the result is always a single tree.
20
Parts of MDD trees for splice donors
In each case, splits are into the most frequent nt vs the others.
21
GTCTAAAGCGTaattAAAGTAAGACGAAAGCAAaattAAAGTGCGTCTAAAGCgaGCGAAAAGCGAGCGTAAAGCAGTAGAAAAGGCGaattAAAGTACCACAAAAGCCCGCCCAAAgatctGACAAAGCGTGCGGAAAgatcaattAAAGCAACAAAAAAGGCGtAAAAAAGGCTCAGCAAAGACgGGAAAAAGCAAAGCAAAAGTGCGCAGAAAGTCA
20 instances of P$DOF3_01
22
Modelling P$DOF3_01
MDD
PVLMM(D)
SnSp
.15
.60
.55
WMM
23
Issues in modeling short DNA motifs
In any study of this kind, essential items are:
• the model class (e.g. WMM, PVLMM, or MDD)• the way we search through the model class (e.g. by
forward selection or MCMC)• the way we compare models when searching (e.g.
by 2,AIC, BIC, NML), and finally,• the way we assess the final model in relation to our
aims (e.g. by cross-validation).
We always need interesting, high-quality datasets.
24
Splice site dataset
Human splice donor sequences from SpliceDB, Burset et al (2001):
• 15,155 canonical donor sites of length 9, with GT conserved at positions 0 and 1
• 47,495 false donor sites from the set of all sequences which lie within 40 bp on both sides of the characteristic donor dinucleotide GT.
25
Model classes and model search
For illustrative purposes, we will compare VLMM, PVLMM (direct and decision), and MDD for splice donor recognition.
Here we search using a simple procedure: recursively choosing the best extension of a current model, or forward selection. Our slower alternative moves through the models using reversible jump Markov chain Monte Carlo (RJMCMC).
26
Model comparison
We fit the models using maximum likelihood, and compare fitted models using both AIC and BIC, standard penalties for model complexity.
Better than either of these two is approximate normalized maximum likelihood (NML), Barron et al (1998). We use mixture models for the data with Jeffreys’ (Dirichlet) priors.
27
Model assessment:Stand-alone splice motif recognition
M: motif model B: background model
Given a sequence x = (x1, …, xL), we predict x to be a motif (here splice donor) if
log {P(x | M) / P(x | B)} > c,
for a suitably chosen threshold value c.
28
Model assessment: terms
TP: true positives, FP: false positivesTN: true negatives, FN: false negatives
Sensitivity (sn) and specificity (sp) are given by
sn = TP / [TP + FN] sp = TP / [TP + FP].
A 5-fold cross-validation is used in assessing performance.
29Comparison of PVLMM decision tree (NML, ord = 5),
MDD (chi-square), WAM and WMM.
Sp vs Sn
Optimal permutation: +2 +5 -1 +4 -2 +3 -3
30
Model assessment: Integrated recognition
For this assessment, we integrate the splice donor models into SLAM, Pachter et al (2002), a eukaryotic cross-species gene finder.
The training data consists of 3,735 aligned human and mouse gene sequences. The resulting SLAM model is then tested on the Rosetta set of 117 single human gene sequences.
31
VLMM MDD PVLMM
Sensitivity 94.5 94.2 94.3
Specificity 99.7 99.7 99.7
VLMM(D) MDD PVLMM(D)
Correct 362/470 365/465 371/465
Partial 92/470 83/465 79/465
Wrong 18/470 17/465 16/465
Missing 23/465 23/465 23/465
Sensitivity 362/465 365/465 371/465
Specificity 360/470 365/465 371/465
Results at the nucleotide level
Results at the exon level
32
Interpretation of PVLMM model selected
We use sequence logos to provide simple interpretations of our selected PVLMM, including the optimal permutation.
33
Beginning of the splicing process
splice donor splice acceptor
34
Base -3 -2 -1 0 +1 +2 +3 +4 +5
A 33 61 10 0 0 53 71 7 16
C 37 13 3 0 0 3 8 6 16
G 18 12 80 100 0 42 12 81 22
T 12 14 7 0 100 2 9 6 46
U1 snRNA G U C C A U U C A
Sequence logo for human splice donor sites
35
U1 sn RNA G U C C A U U C A
Optimal permutation: +2 +3 -1 +4 -2 +5 -3
“Long-range” dependence in the chosen model
36
The context tree for +5
+5
37
Transcription factor binding sites
These are of great interest, their signals are very weak, and we typically have only a few instances.
We have studied 43 TFBS with effective length ≤ 9
and ≥ 20 instances.
In 17/43 cases we are able to improve upon WMM, the current standard; in 26/43, we cannot.
38
Transcription factor binding site (TFBS) recognition
We extracted all known TFBS from the TRANSFAC database with a) length ≤9, and b) ≥ 20 known sites. In all this gave 1,419 sites corresponding to 43 TF.
Next we randomly inserted each site into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes.
Finally, we used the PVLMM, MDD and WMM to scan these sequences within a 10-fold cross-validation framework, to select a number of top-scoring sequences as putative binding sites. We always made this number equal to the true number in the sequences, and so sn = sp.
39
Three transcription factor binding sites
40
TFBS: three results
P$DOF3.01 V$CIZ.01 P$EMBP1.Q2
wmm pvlmm mdd wmm pvlmm mdd wmm pvlmm mdd
.15 .55 .60 .46 .64 .62 .77 .91 .77
Entry: sensitivity/specificity
Of the 43 TF, our dependence methods led to no improvement in 27 cases.
41
TFBS: results for 16/43
42
Some future work
• More fully elucidate the relationships between all the different tree-based model classes
• Comprehensively test alternative strategies for searching and comparing models
• Find the best combination for splice donors and several other motif detection problems
• Joint modelling of human and mouse sites
• Joint modelling of multiple motifs in one species
43
Acknowledgements
Xiaoyue Zhao, UCB
Sourav Chatterji, UCB
The SLAM team:
Simon Cawley, Affymetrix
Lior Pachter, UCB
Marina Alexandersson, FCC
44
ReferencesBiological Sequence AnalysisR Durbin, S Eddy, A Krogh and G MitchisonCambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S BrunakThe MIT Press, 1998
Post-Genome InformaticsM KanehisaOxford University Press, 2000
45
Recommended