1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI

1

Probability models for short DNA sequences

BioInfoSummer@ANU, December 2 2003

Terry Speed & Xiaoyue ZhaoUniversity of California at Berkeley

& WEHI Melbourne

2

Synopsis

Some biology

Background & previous work

Models and modeling

Results and discussion

Future work

Acknowledgements

3

4

A look at gene structure

5

Beginning of the splicing process

splice donor splice acceptor

6

Protein-DNA interactions

usually involve some degree of sequence specificity

7

Examples of 5’splice (donor) sitesTCGGTGAGTTGGGTGTGTCCGGTCCGTATG GTAAGATCT GTAAGTCAGGTAGGACAGGTAGGGAAGGTAAGGAGGGTATGGTGGGTAAGGGAGGTTAGT CATGTGAGT

exon intron

8

Probability models for short DNA motifs

Short: ~6-20 base pairsDNA motifs: enhancers, promoters,

terminators, splicing signals, translation initiation sites, centromeres, ...

Why probability models?• to characterize the motifs• to help identify them • for incorporation into larger models

9

Our aim and some notation

L: length of the the DNA sequence motif.

Xi: discrete random variable at position i, taking values from the set = {A, C, G, T}.

Given a number of instances of a DNA

sequence motif x = (x1, …, xL) of length L, we want a model for the probability P(x) of x.

We denote by xij (i < j) the sequence

(xj , xj-1, …., xi ) in reverse time order.

10

Weight matrix models, Staden (1984)

Base -3 -2 -1 0 +1 +2 +3 +4 +5

A 33 61 10 0 0 53 71 7 16

C 37 13 3 0 0 3 8 6 16

G 18 12 80 100 0 42 12 81 22

T 12 14 7 0 100 2 9 6 46

A weight matrix for donor sites. Entries are percentagesEssentially a mutual independence model.

An improvement over the consensus CAGGTAAGT.

11

Beyond independence

Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (non-stationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models.

Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees.

[The interpolated Markov models of Salzberg et al (1998) address the same problem.]

12

Variable Length Markov Models

Factorize P(x) in the usual telescopic way:

P(X1=x1) 2L P(Xl = xl | X1

l-1 = x1l-1),

then simplify this using context functions cl, l=2,..L, to

P(X1=x1) 2L P(Xl = xl | cl(X1

l-1)) = cl(x1l-1)),

where cl : x1l-1 xl-m

l-1 is suitably defined on l-1 tuples.

13

VLMM, cont.

Here cl : l-1 i=0l-1 i, and m = ml is given by

ml(x1l-1) = min {r: P(Xl = x | X1

l-1 = x1l-1) =

P(Xl = x | Xl-rl-1 = xl-r

l-1) for all x }.

The function cl defines the sequence-specific context, and ml defines the sequence-specific memory or order of the Markov property for position l.

14

VLMM: an illustrative example

A full set of 16 contexts of order 2.

Pruned set of 12 contexts: P(X3|X2=C,X1=G) = P(X3|X2=C), etc.

15

VLMM cont.

A VLMM for a DNA sequence motif of length L is specified by • a distribution for X1, and, for l = 2,…L,• a constrained distribution for Xl given Xl-1,…,X1. That is, we need L-1 context functions, or trees.

But, there is a difficulty here.

16

Sequence dependencies(interactions) are not always local

The methods outlined so far all fail to incorporate long-range ( ≥4 bp) interactions. New model types are needed.

3-dimensional folding; DNA, RNA & protein interactions

17

Modeling “long-range” dependency

The principal work in this area is Burge & Karlin’s (1997) maximal dependence decomposition (MDD).

More recently, Cai et al (2000) and Barash et al (2003)

used Bayes networks (BN). Ellrott et al (2002) optimized the sequence order in

which a stationary Markov chain models the motif. We have adapted this last idea, to give permuted

variable length Markov models (PVLMM). Potamianos & Jelinek (1998) have related work on decision trees.

18

Part of a context (decision) tree for position -2 of a splice donor PVLMM

Sequence order: +2(A/G) +5(T) -1(G) +4(G) -2(A) +3(A) -3(A)

Node #s:counts; Edge #s:split variables.

19

Maximal Dependence Decomposition

MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the most dependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used.

The details can vary according to the splitting criterion (Burge & Karlin used 2), the actual splits (binary, etc), and the stopping rule.

However, the result is always a single tree.

20

Parts of MDD trees for splice donors

In each case, splits are into the most frequent nt vs the others.

21

GTCTAAAGCGTaattAAAGTAAGACGAAAGCAAaattAAAGTGCGTCTAAAGCgaGCGAAAAGCGAGCGTAAAGCAGTAGAAAAGGCGaattAAAGTACCACAAAAGCCCGCCCAAAgatctGACAAAGCGTGCGGAAAgatcaattAAAGCAACAAAAAAGGCGtAAAAAAGGCTCAGCAAAGACgGGAAAAAGCAAAGCAAAAGTGCGCAGAAAGTCA

20 instances of P$DOF3_01

22

Modelling P$DOF3_01

MDD

PVLMM(D)

SnSp

.15

.60

.55

WMM

23

Issues in modeling short DNA motifs

In any study of this kind, essential items are:

• the model class (e.g. WMM, PVLMM, or MDD)• the way we search through the model class (e.g. by

forward selection or MCMC)• the way we compare models when searching (e.g.

by 2,AIC, BIC, NML), and finally,• the way we assess the final model in relation to our

aims (e.g. by cross-validation).

We always need interesting, high-quality datasets.

24

Splice site dataset

Human splice donor sequences from SpliceDB, Burset et al (2001):

• 15,155 canonical donor sites of length 9, with GT conserved at positions 0 and 1

• 47,495 false donor sites from the set of all sequences which lie within 40 bp on both sides of the characteristic donor dinucleotide GT.

25

Model classes and model search

For illustrative purposes, we will compare VLMM, PVLMM (direct and decision), and MDD for splice donor recognition.

Here we search using a simple procedure: recursively choosing the best extension of a current model, or forward selection. Our slower alternative moves through the models using reversible jump Markov chain Monte Carlo (RJMCMC).

26

Model comparison

We fit the models using maximum likelihood, and compare fitted models using both AIC and BIC, standard penalties for model complexity.

Better than either of these two is approximate normalized maximum likelihood (NML), Barron et al (1998). We use mixture models for the data with Jeffreys’ (Dirichlet) priors.

27

Model assessment:Stand-alone splice motif recognition

M: motif model B: background model

Given a sequence x = (x1, …, xL), we predict x to be a motif (here splice donor) if

log {P(x | M) / P(x | B)} > c,

for a suitably chosen threshold value c.

28

Model assessment: terms

TP: true positives, FP: false positivesTN: true negatives, FN: false negatives

Sensitivity (sn) and specificity (sp) are given by

sn = TP / [TP + FN] sp = TP / [TP + FP].

A 5-fold cross-validation is used in assessing performance.

29Comparison of PVLMM decision tree (NML, ord = 5),

MDD (chi-square), WAM and WMM.

Sp vs Sn

Optimal permutation: +2 +5 -1 +4 -2 +3 -3

30

Model assessment: Integrated recognition

For this assessment, we integrate the splice donor models into SLAM, Pachter et al (2002), a eukaryotic cross-species gene finder.

The training data consists of 3,735 aligned human and mouse gene sequences. The resulting SLAM model is then tested on the Rosetta set of 117 single human gene sequences.

31

VLMM MDD PVLMM

Sensitivity 94.5 94.2 94.3

Specificity 99.7 99.7 99.7

VLMM(D) MDD PVLMM(D)

Correct 362/470 365/465 371/465

Partial 92/470 83/465 79/465

Wrong 18/470 17/465 16/465

Missing 23/465 23/465 23/465

Sensitivity 362/465 365/465 371/465

Specificity 360/470 365/465 371/465

Results at the nucleotide level

Results at the exon level

32

Interpretation of PVLMM model selected

We use sequence logos to provide simple interpretations of our selected PVLMM, including the optimal permutation.

33

Beginning of the splicing process

splice donor splice acceptor

34

Base -3 -2 -1 0 +1 +2 +3 +4 +5

A 33 61 10 0 0 53 71 7 16

C 37 13 3 0 0 3 8 6 16

G 18 12 80 100 0 42 12 81 22

T 12 14 7 0 100 2 9 6 46

U1 snRNA G U C C A U U C A

Sequence logo for human splice donor sites

35

U1 sn RNA G U C C A U U C A

Optimal permutation: +2 +3 -1 +4 -2 +5 -3

“Long-range” dependence in the chosen model

36

The context tree for +5

+5

37

Transcription factor binding sites

These are of great interest, their signals are very weak, and we typically have only a few instances.

We have studied 43 TFBS with effective length ≤ 9

and ≥ 20 instances.

In 17/43 cases we are able to improve upon WMM, the current standard; in 26/43, we cannot.

38

Transcription factor binding site (TFBS) recognition

We extracted all known TFBS from the TRANSFAC database with a) length ≤9, and b) ≥ 20 known sites. In all this gave 1,419 sites corresponding to 43 TF.

Next we randomly inserted each site into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes.

Finally, we used the PVLMM, MDD and WMM to scan these sequences within a 10-fold cross-validation framework, to select a number of top-scoring sequences as putative binding sites. We always made this number equal to the true number in the sequences, and so sn = sp.

39

Three transcription factor binding sites

40

TFBS: three results

P$DOF3.01 V$CIZ.01 P$EMBP1.Q2

wmm pvlmm mdd wmm pvlmm mdd wmm pvlmm mdd

.15 .55 .60 .46 .64 .62 .77 .91 .77

Entry: sensitivity/specificity

Of the 43 TF, our dependence methods led to no improvement in 27 cases.

41

TFBS: results for 16/43

42

Some future work

• More fully elucidate the relationships between all the different tree-based model classes

• Comprehensively test alternative strategies for searching and comparing models

• Find the best combination for splice donors and several other motif detection problems

• Joint modelling of human and mouse sites

• Joint modelling of multiple motifs in one species

43

Acknowledgements

Xiaoyue Zhao, UCB

Sourav Chatterji, UCB

The SLAM team:

Simon Cawley, Affymetrix

Lior Pachter, UCB

Marina Alexandersson, FCC

44

ReferencesBiological Sequence AnalysisR Durbin, S Eddy, A Krogh and G MitchisonCambridge University Press, 1998.

Bioinformatics The machine learning approach

P Baldi and S BrunakThe MIT Press, 1998

Post-Genome InformaticsM KanehisaOxford University Press, 2000

45

Documents

1 Probability models for short DNA sequences BioInfoSummer@ANU, December 2 2003 Terry Speed & Xiaoyue Zhao University of California at Berkeley & WEHI