45
1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI Melbourne

1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

Embed Size (px)

Citation preview

Page 1: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

1

Probability models for short DNA sequences

BioInfoSummer@ANU, December 2 2003

Terry Speed & Xiaoyue ZhaoUniversity of California at Berkeley

& WEHI Melbourne

Page 2: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

2

Synopsis

Some biology

Background & previous work

Models and modeling

Results and discussion

Future work

Acknowledgements

Page 3: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

3

Page 4: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

4

A look at gene structure

Page 5: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

5

Beginning of the splicing process

splice donor splice acceptor

Page 6: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

6

Protein-DNA interactions

usually involve some degree of sequence specificity

Page 7: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

7

Examples of 5’splice (donor) sitesTCGGTGAGTTGGGTGTGTCCGGTCCGTATG GTAAGATCT GTAAGTCAGGTAGGACAGGTAGGGAAGGTAAGGAGGGTATGGTGGGTAAGGGAGGTTAGT CATGTGAGT

exon intron

Page 8: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

8

Probability models for short DNA motifs

Short: ~6-20 base pairsDNA motifs: enhancers, promoters,

terminators, splicing signals, translation initiation sites, centromeres, ...

Why probability models?• to characterize the motifs• to help identify them • for incorporation into larger models

Page 9: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

9

Our aim and some notation

L: length of the the DNA sequence motif.

Xi: discrete random variable at position i, taking values from the set = {A, C, G, T}.

Given a number of instances of a DNA

sequence motif x = (x1, …, xL) of length L, we want a model for the probability P(x) of x.

We denote by xij (i < j) the sequence

(xj , xj-1, …., xi ) in reverse time order.

Page 10: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

10

Weight matrix models, Staden (1984)

Base -3 -2 -1 0 +1 +2 +3 +4 +5

A 33 61 10 0 0 53 71 7 16

C 37 13 3 0 0 3 8 6 16

G 18 12 80 100 0 42 12 81 22

T 12 14 7 0 100 2 9 6 46

A weight matrix for donor sites. Entries are percentagesEssentially a mutual independence model.

An improvement over the consensus CAGGTAAGT.

Page 11: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

11

Beyond independence

Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (non-stationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models.

Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees.

[The interpolated Markov models of Salzberg et al (1998) address the same problem.]

Page 12: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

12

Variable Length Markov Models

Factorize P(x) in the usual telescopic way:

P(X1=x1) 2L P(Xl = xl | X1

l-1 = x1l-1),

then simplify this using context functions cl, l=2,..L, to

P(X1=x1) 2L P(Xl = xl | cl(X1

l-1)) = cl(x1l-1)),

where cl : x1l-1 xl-m

l-1 is suitably defined on l-1 tuples.

Page 13: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

13

VLMM, cont.

Here cl : l-1 i=0l-1 i, and m = ml is given by

ml(x1l-1) = min {r: P(Xl = x | X1

l-1 = x1l-1) =

P(Xl = x | Xl-rl-1 = xl-r

l-1) for all x }.

The function cl defines the sequence-specific context, and ml defines the sequence-specific memory or order of the Markov property for position l.

Page 14: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

14

VLMM: an illustrative example

A full set of 16 contexts of order 2.

Pruned set of 12 contexts: P(X3|X2=C,X1=G) = P(X3|X2=C), etc.

Page 15: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

15

VLMM cont.

A VLMM for a DNA sequence motif of length L is specified by • a distribution for X1, and, for l = 2,…L,• a constrained distribution for Xl given Xl-1,…,X1. That is, we need L-1 context functions, or trees.

But, there is a difficulty here.

Page 16: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

16

Sequence dependencies(interactions) are not always local

The methods outlined so far all fail to incorporate long-range ( ≥4 bp) interactions. New model types are needed.

3-dimensional folding; DNA, RNA & protein interactions

Page 17: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

17

Modeling “long-range” dependency

The principal work in this area is Burge & Karlin’s (1997) maximal dependence decomposition (MDD).

More recently, Cai et al (2000) and Barash et al (2003)

used Bayes networks (BN). Ellrott et al (2002) optimized the sequence order in

which a stationary Markov chain models the motif. We have adapted this last idea, to give permuted

variable length Markov models (PVLMM). Potamianos & Jelinek (1998) have related work on decision trees.

Page 18: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

18

Part of a context (decision) tree for position -2 of a splice donor PVLMM

Sequence order: +2(A/G) +5(T) -1(G) +4(G) -2(A) +3(A) -3(A)

Node #s:counts; Edge #s:split variables.

Page 19: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

19

Maximal Dependence Decomposition

MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the most dependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used.

The details can vary according to the splitting criterion (Burge & Karlin used 2), the actual splits (binary, etc), and the stopping rule.

However, the result is always a single tree.

Page 20: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

20

Parts of MDD trees for splice donors

In each case, splits are into the most frequent nt vs the others.

Page 21: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

21

GTCTAAAGCGTaattAAAGTAAGACGAAAGCAAaattAAAGTGCGTCTAAAGCgaGCGAAAAGCGAGCGTAAAGCAGTAGAAAAGGCGaattAAAGTACCACAAAAGCCCGCCCAAAgatctGACAAAGCGTGCGGAAAgatcaattAAAGCAACAAAAAAGGCGtAAAAAAGGCTCAGCAAAGACgGGAAAAAGCAAAGCAAAAGTGCGCAGAAAGTCA

20 instances of P$DOF3_01

Page 22: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

22

Modelling P$DOF3_01

MDD

PVLMM(D)

SnSp

.15

.60

.55

WMM

Page 23: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

23

Issues in modeling short DNA motifs

In any study of this kind, essential items are:

• the model class (e.g. WMM, PVLMM, or MDD)• the way we search through the model class (e.g. by

forward selection or MCMC)• the way we compare models when searching (e.g.

by 2,AIC, BIC, NML), and finally,• the way we assess the final model in relation to our

aims (e.g. by cross-validation).

We always need interesting, high-quality datasets.

Page 24: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

24

Splice site dataset

Human splice donor sequences from SpliceDB, Burset et al (2001):

• 15,155 canonical donor sites of length 9, with GT conserved at positions 0 and 1

• 47,495 false donor sites from the set of all sequences which lie within 40 bp on both sides of the characteristic donor dinucleotide GT.

Page 25: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

25

Model classes and model search

For illustrative purposes, we will compare VLMM, PVLMM (direct and decision), and MDD for splice donor recognition.

Here we search using a simple procedure: recursively choosing the best extension of a current model, or forward selection. Our slower alternative moves through the models using reversible jump Markov chain Monte Carlo (RJMCMC).

Page 26: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

26

Model comparison

We fit the models using maximum likelihood, and compare fitted models using both AIC and BIC, standard penalties for model complexity.

Better than either of these two is approximate normalized maximum likelihood (NML), Barron et al (1998). We use mixture models for the data with Jeffreys’ (Dirichlet) priors.

Page 27: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

27

Model assessment:Stand-alone splice motif recognition

M: motif model B: background model

Given a sequence x = (x1, …, xL), we predict x to be a motif (here splice donor) if

log {P(x | M) / P(x | B)} > c,

for a suitably chosen threshold value c.

Page 28: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

28

Model assessment: terms

TP: true positives, FP: false positivesTN: true negatives, FN: false negatives

Sensitivity (sn) and specificity (sp) are given by

sn = TP / [TP + FN] sp = TP / [TP + FP].

A 5-fold cross-validation is used in assessing performance.

Page 29: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

29Comparison of PVLMM decision tree (NML, ord = 5),

MDD (chi-square), WAM and WMM.

Sp vs Sn

Optimal permutation: +2 +5 -1 +4 -2 +3 -3

Page 30: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

30

Model assessment: Integrated recognition

For this assessment, we integrate the splice donor models into SLAM, Pachter et al (2002), a eukaryotic cross-species gene finder.

The training data consists of 3,735 aligned human and mouse gene sequences. The resulting SLAM model is then tested on the Rosetta set of 117 single human gene sequences.

Page 31: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

31

VLMM MDD PVLMM

Sensitivity 94.5 94.2 94.3

Specificity 99.7 99.7 99.7

VLMM(D) MDD PVLMM(D)

Correct 362/470 365/465 371/465

Partial 92/470 83/465 79/465

Wrong 18/470 17/465 16/465

Missing 23/465 23/465 23/465

Sensitivity 362/465 365/465 371/465

Specificity 360/470 365/465 371/465

Results at the nucleotide level

Results at the exon level

Page 32: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

32

Interpretation of PVLMM model selected

We use sequence logos to provide simple interpretations of our selected PVLMM, including the optimal permutation.

Page 33: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

33

Beginning of the splicing process

splice donor splice acceptor

Page 34: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

34

Base -3 -2 -1 0 +1 +2 +3 +4 +5

A 33 61 10 0 0 53 71 7 16

C 37 13 3 0 0 3 8 6 16

G 18 12 80 100 0 42 12 81 22

T 12 14 7 0 100 2 9 6 46

U1 snRNA G U C C A U U C A

Sequence logo for human splice donor sites

Page 35: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

35

U1 sn RNA G U C C A U U C A

Optimal permutation: +2 +3 -1 +4 -2 +5 -3

“Long-range” dependence in the chosen model

Page 36: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

36

The context tree for +5

+5

Page 37: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

37

Transcription factor binding sites

These are of great interest, their signals are very weak, and we typically have only a few instances.

We have studied 43 TFBS with effective length ≤ 9

and ≥ 20 instances.

In 17/43 cases we are able to improve upon WMM, the current standard; in 26/43, we cannot.

Page 38: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

38

Transcription factor binding site (TFBS) recognition

We extracted all known TFBS from the TRANSFAC database with a) length ≤9, and b) ≥ 20 known sites. In all this gave 1,419 sites corresponding to 43 TF.

Next we randomly inserted each site into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes.

Finally, we used the PVLMM, MDD and WMM to scan these sequences within a 10-fold cross-validation framework, to select a number of top-scoring sequences as putative binding sites. We always made this number equal to the true number in the sequences, and so sn = sp.

Page 39: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

39

Three transcription factor binding sites

Page 40: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

40

TFBS: three results

P$DOF3.01 V$CIZ.01 P$EMBP1.Q2

wmm pvlmm mdd wmm pvlmm mdd wmm pvlmm mdd

.15 .55 .60 .46 .64 .62 .77 .91 .77

Entry: sensitivity/specificity

Of the 43 TF, our dependence methods led to no improvement in 27 cases.

Page 41: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

41

TFBS: results for 16/43

Page 42: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

42

Some future work

• More fully elucidate the relationships between all the different tree-based model classes

• Comprehensively test alternative strategies for searching and comparing models

• Find the best combination for splice donors and several other motif detection problems

• Joint modelling of human and mouse sites

• Joint modelling of multiple motifs in one species

Page 43: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

43

Acknowledgements

Xiaoyue Zhao, UCB

Sourav Chatterji, UCB

The SLAM team:

Simon Cawley, Affymetrix

Lior Pachter, UCB

Marina Alexandersson, FCC

Page 44: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

44

ReferencesBiological Sequence AnalysisR Durbin, S Eddy, A Krogh and G MitchisonCambridge University Press, 1998.

Bioinformatics The machine learning approach

P Baldi and S BrunakThe MIT Press, 1998

Post-Genome InformaticsM KanehisaOxford University Press, 2000

Page 45: 1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

45