Biological Sequence Pattern Analysis

Biological Sequence Pattern Analysis

Liangjiang (LJ) Wang

[email protected]

March 8, 2005

PLPTH 890 Introduction to Genomic Bioinformatics

Lecture 16

Outline

• Basic concepts and biological problems.

• Regular expression for:

– Pattern matching (sequence motifs),

– Pattern discovery (promoter elements).

• Position Weight Matrix (PWM) for:

– Pattern matching (TransFac, TESS, etc),

– Pattern discovery (MEME, Gibbs sampling).

• Hidden Markov Models (HMMs) for protein domain analysis (next lecture).

Biological Sequence Patterns

• In nucleotide sequences:

– Transcription start and termination sites,

– Promoter cis regulatory elements,

– Intron/exon splice sites,

– Translation start and stop sites,

– mRNA cis regulatory elements.

• In protein sequences:

– Functional motifs such as signal peptides,

– Conserved protein domains.

Promoter cis Regulatory Elements• Cells respond to various stimuli by regulating the

expression of particular genes.

• Transcription factors regulate gene expression by binding to specific

DNA sequence motifs.

• Transcription factor binding sites are often short (5 – 25 bases) and degenerate DNA motifs.

• Co-regulated genes may have common regulatory motifs in their promoters.

H2

H1

L

H2

L

H1

DNA

MyoD HLH Dimer

CAACTGAC

How to Represent a Sequence Pattern?

• Regular expressions:– A pattern is represented by a string of characters

such as TATAAAA (the TATA box).– Ambiguous characters, wild-cards and gaps are

allowed, but no position-specific information.

• Position Weight Matrices (PWM):– Also called Position-Specific Score Matrix (PSSM).– Often an ungapped pattern specified by a table.

• Stochastic models:– Hidden Markov Models (HMM), neural nets, etc.– Based on probability / machine learning theory.

Pattern Matching vs. Pattern Discovery• Pattern matching:

– Scanning a nucleotide or protein sequence for matches to a known pattern.

– How to get better sensitivity and specificity is the major consideration.

• Pattern discovery:– Given a set of sequences, discovering a

pattern that is shared by the sequences. It is unknown in advance about what is the pattern.

– Using search or learning approaches.– A much harder problem than pattern matching.

Pattern Matching with RegExp• Regular Expression (RegExp) can represent:

– Ambiguous character: e.g., [AG] or R.– Wild-card: e.g., X for any amino acids.– Gap: e.g., x(i, j) in PROSITE patterns.

• Pattern matching with regular expression is straightforward, but sometimes very useful.

• For example, find all the Arabidopsis proteins which contain the following motif:

[RK][LVI]X{5}[QH][LA]

(These proteins may be targeted to peroxisome)

Patmatch at TAIR (http://www.arabidopsis.org/)

Pattern Discovery Using RegExp

Enumerate all the possible regular expression patterns with ambiguous characters.

Count the occurrences of all the patterns in the input sequences (word counting).

Compute statistical significance based on the background distribution.

(The method works for simple patterns such as short nucleotide motifs, but not for long and/or complex patterns)

e.g., CWTNC, CRTGTW, YCGGAYRRAWG, …… over {A, C, G, T, R, Y, S, W, M, K, V, H, D, B, N}

e.g., z-score:

)(

)(

X

XENz

Applications to Promoter Analysis• The RegExp pattern enumeration method has

been used to find cis regulatory motifs that are statistically overrepresented in a given promoter sequence dataset:

– Sinha and Tompa, 2002. Discovery of novel transcription factor binding sites by statistical overrepresentation. NAR, 30:5549-5560.

– YMP is available at http://wingless.cs.washington.edu/YMF/YMFWeb/YMFInput.pl.

• Complete search: all motifs in the search space are enumerated and tested for statistical overrepresentation.

Problems with RegExp

• Do not specify the relative frequencies of nucleotides at a position.

• Cannot express the relative importance of a position for the pattern.

• Cannot capture a possible relationship between two positions.

A G T C CA G T C CA G T A CA G T A CA G T G GA G T G GA A C T TA A G T T

A R B N B

PWM Representation of a Motif• A motif is assumed to have a fixed width, W.

• In the PWM, pnk is the probability (relative

frequency) of nucleotide n in column k.

• Background probability: pn0 is the probability

of n in the background (i.e., outside the motif).

Equal distribution: pA0 = pC0 = pG0 = pT0 = ¼.

1 2 3 4 5

A 1 0.25 0 0.25 0

C 0 0 0.125 0.25 0.5

G 0 0.75 0.125 0.25 0.25

T 0 0 0.75 0.25 0.25

Have we lost information here?

AGTCCAGTCCAGTACAGTACAGTGGAGTGGAACTTAAGTT

Visualization of PWM Patterns• The pattern captured by an MSA or PWM may

be visualized using a sequence logo.

• Information Content (IC) of the nucleotide PWM at position k is:

)(2log2)(},,,{

2 kEntropyppkICTGCAn

nknk

where pnk is the probability of n at position k.

Assuming equal background probability for A, C, G and T (1/4).

]2,0[)( kIC

Information Content (IC)• IC is a measure of a site’s tolerance for

substitution: high IC, low tolerance.

• If pA1 = 1, pC1 = 0, pG1 = 0, pT1 = 0,

• If pA4 = ¼, pC4 = ¼, pG4 = ¼, pT4 = ¼,

202log2)1(},,,{

121 TGCAn

nn ppIC

022)(log42)4( 41

241 IC

1 2 3 4 5

A 1 0.25 0 0.25 0

C 0 0 0.125 0.25 0.5

G 0 0.75 0.125 0.25 0.25

T 0 0 0.75 0.25 0.25

AGTCCAGTCCAGTACAGTACAGTGGAGTGGAACTTAAGTT

Pattern Matching with PWM• Given a Position Weight Matrix (PWM) of a

pattern, find all the occurrences of the pattern on the input sequence.

• Sliding window analysis:

• How to score a match?

SequenceMatch with the PWM

W

k c

ck

q

pScore

1

pck is the PWM entry at position k and corresponding to character c of the sequence, and qc is the background probability of c.

(Often use log-odd score)

Resources for Promoter Analysis• TransFac (http://www.gene-regulation.com/):

– A database on eukaryotic transcription factors (TF) and their DNA binding sites (PWMs).

– Provide TF classification and search options.

• TESS (Transcription Element Search System at http://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=WELCOME):– A web tool for predicting TF binding sites.

– Using PWMs from TransFac and others.

• SCPD (http://cgsigma.cshl.org/jian/):– The promoter database of Saccharomyces cerevisiae.

– Tools for site prediction and promoter retrieval.

http://www.gene-regulation.com/

http://www.gene-regulation.com/

http://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=WELCOME



Pattern Discovery Using PWM• The Problem:

– Given a set of unaligned sequences, discover a PWM pattern shared by the sequences.

– The pattern locations on the sequences are also unknown in advance.

• Two sets of parameters to estimate (or learn):– PWM of a potential pattern.– Pattern offset matrix.

• Algorithmic approaches:– Expectation Maximization.

– Gibbs sampling. Motif

Sequences

Pattern Offset Matrix• The element Zij of the pattern offset matrix Z is

the probability that the pattern (given in p) starts at position j of sequence i (Xi):

• The probability of a sequence Xi with the

pattern starting at j is:

),|1Pr( pXZZ iijij 1 2 3 4 5

X1 0.1 0.1 0.5 0.2 0.1

X2 0.4 0.2 0.1 0.1 0.2

X3 0.1 0.6 0.1 0.1 0.1

X4 0.2 0.1 0.2 0.3 0.2

L

Wjkn

Wj

jkjkn

j

kniji kkk

ppppZX 0,

1

1,

1

10,),1|Pr(

before motif motif after motif

Expectation Maximization (EM)

Given: length W, sequence dataset

set initial values for p

do {

re-estimate Z from p (E-step)

re-estimate p from Z (M-step)

} until (change in p <ε)

return p, Z

p Z p Z

E

p Z

M

More about the EM Algorithm

• EM is a heuristic algorithm for discovering PWM motifs shared by a set of sequences.

• EM converges to a local maximum in the likelihood of the data given the model p:

• EM usually converges in a small number of iterations.

• EM is sensitive to initial starting point (i.e., the initial values in p).

i

i pX )|Pr(

MEME• MEME (Multiple EM for Motif Elicitation) is

widely used for motif discovery.

• MEME is based on the EM algorithm with several extensions.

• MEME is available at http://meme.sdsc.edu/meme/website/meme.html.

• The dataset contains 30 yeast promoters from a co-regulated gene cluster. These genes are mostly involved in respiration, and are co-regulated in various stress conditions.

• What is the TF binding site in the shared motif?

http://meme.sdsc.edu/meme/website/meme.html

http://meme.sdsc.edu/meme/website/meme.html

The MEME Algorithm

MEME (dataset, W, NSITES, PASSES) {

for i = 1 to PASSES {

for each subsequence in dataset {

run EM for 1 iteration with starting point

derived from this subsequence

choose a motif model with the highest likelihood

run EM to convergence from starting point

which generated that model

print converged model of the shared motif

erase appearances of the motif from the dataset

} }}

MEME Enhancements to the Basic EM Approach

• Trying many starting points by using every distinct subsequences of length w in the dataset.

• Not assuming that there is exactly one motif occurrence in every sequence.

• Allowing multiple motifs to be learned.

Gibbs Sampling

• For motif discovery, Gibbs sampling can be viewed as a stochastic analog of EM:

– In the EM algorithm, we maintained a distribution Zi over the possible motif starting

positions for each sequence;

– In the Gibbs sampling approach, we maintain a specific starting position for each sequence, but keep re-sampling the starting positions.

• Gibbs sampling may be less susceptible to local minima than EM.

A Gibbs Sampling Algorithm

Given: length W, sequence dataset

choose random motif positions for a

do {

pick a sequence Xi

estimate p using motif positions in a

(all sequences but Xi) (update step)

sample a new motif position ai for Xi

(sampling step)} until (change in p <ε)

return p, a

Gibbs Motif Sampler and AlignACE

• Gibbs Motif Sampler:– Based on the work by Lawrence, et al. 1993.

Science, 262:208-214.– Available at

http://bayesweb.wadsworth.org/gibbs/gibbs.html.

• AlignACE:– Based on the Gibbs sampling algorithm with

several extensions.– Available at http://atlas.med.harvard.edu/.

Summary

• For simple sequence patterns, regular expression is a useful tool.

• For some complex sequence patterns, position weight matrix (PWM) is preferred.

• Expectation Maximization (EM) and Gibbs sampling are two useful approaches for sequence pattern discovery.

• Next: protein domain analysis using HMM

Reading

• (Optional) Lawrence et al., 1993. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214.

• Eddy, 2004. What is a hidden Markov model? Nature Biotechnology, 22:1315-1316.

• Eddy, 1998. Multiple alignment and multiple sequence based searches. Trends Guide to Bioinformatics, 15-18.

For This Week’s Lab

• Collect a set of promoter sequences (10-500 sequences in FASTA format) from co-regulated or related genes. The promoter sequences should be the 500-1500 nucleotides upstream of the transcription start sites.

• Collect a set of protein sequences (10-50 sequences in FASTA format) from a gene family or superfamily.

Documents

Biological Sequence Pattern Analysis