Statistical modeling and classification in Biological Sequence Space

Statistical modeling and classification in

Biological Sequence Space

April 26, 04; 9.520 Gene Yeo

Poggio, Burge @MIT

• “Build” models around known biology – In the process, extend knowledge about

known biology

• “Predict” new examples• “Validate” predictions by

– prediction accuracy– experimental validation– higher-level traits of predictions– conservation in other genomes

Framework/Issues

Biological sequences

• DNA, RNA and proteins: macromolecules built up from smaller units.

• DNA: units are the nucleotide residues A, C, G and T• RNA: units are the nucleotide residues A, C, G and

U• Proteins: units are the amino acid residues A, C, D,

E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.

• To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.

• Statistical models can be descriptive and/or predictive.• Given known biological signal-> describe the signal

with statistical modeling & find unknown examples of the same signal – Gene-finding (protein-coding genes)– Noncoding RNA genes– Protein domains

• Warning: although successful, models are not to be taken literally.

• Most important: biological confirmation of predictions is almost always necessary.

Different modelsC

om

ple

xit

y

DNA RNA Protein

Protein structure (a variety of methods)

Splice site motif (WMM, MM, SVM, NN)

Protein gene(HMM,NN)

RNA gene (Covariation,SCFG,NN,SVM)

With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally.

A case study in computational biology: modeling signals in

genes

What is a (protein-coding) gene?

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Some facts about human genes

Comprise about 3% of the genomeAverage gene length: ~ 8,000 bp

Average of 5-6 exons/geneAverage exon length: ~200 bp

Average intron length: ~2,000 bp~8% genes have a single exon

The idea behind a HMM genefinder

• States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..).

• Observations embody state-dependent statistics, such as base composition, dependence, and signal features.

E0 E1 E2

E

poly-A

3'UTR5'UTR

tEi

Es

I0 I 1 I 2

intergenicregion

Forward (+) strand

Reverse (-) strand

Forward (+) strand

Reverse (-) strand

promoter

62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC

62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC

62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA

62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC

62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG

62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC

62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC

62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC

62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC

62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA

62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC

62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA

62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT

62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG

62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC

62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA

62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC

62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG

62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT

62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC

63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT

63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT

63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC

63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC

63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT

63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT

63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT

63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG

63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT

63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG

63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG

63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA

63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT

63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA

63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT

GENSCAN (Burge & Karlin)

Splice sites can be an important signal

Regular expressions can be limiting

CAAGGT AGT

AG 5’ splice junction in eukaryotes

( )TC TC

≥11N AGC 3’ splice junction

Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites.

Position-specific distributions came to represent the variability in motif composition.

Position-specific scoring matrix (PSSM)

S = S1 S2 S3 S4 S5 S6 S7 S8 S9

Odds Ratio R = =

Score s = log2R

P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9)

P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9)

0.50.00.10.11.00.00.10.10.1 T

0.20.80.10.40.01.00.80.20.2 G

0.20.10.10.10.00.00.00.10.4 C0.10.10.70.40.00.00.10.60.3 A+6+5+4+3+2+1-1-2-3Pos

Ok, so we got the genes

• Here’s another catch, there isn’t just one version of each gene.

• But sometimes several

• molecular biology (transcription, splicing)

• signals are modeled as states (HMM) or separately, i.e.PSSMs

Eg. alternative splicing - CD44

Zhu et al Science… (2003)

Human chromosome 11p…

Alternative splicing

• is a major determinant of protein diversity (Lander 2001, Zavolan 2003)

• 30-50% of human diseases involve alt. splicing

Defining constitutive and alternative exons

Constitutiveexon

Skipped exon

3’ alternative exon

5’ alternative exon

Intron retention

Mutually exclusive exons

Fra

gile

X R

ela

ted

Gen

e,

FXR1

Conserved alternative, skipped exon - FXR1

Myoto

nic

Dyst

rop

hy-c

on

tain

ing

WD

Rep

eat,

DMWD

Another example of genes containing CSE: DMWD

Predicting new alternatively spliced exons

1. The problem is ‘ill-posed’

2. High-dimensional space

3. Not overfit data

4. Simple feature selection

5. Unbalanced data set sizes

6. Labels are more “flexible”

Eg. of experimentally validated

Biological sequence space: challenges

• Models that “represent” as much of the biology as possible.

• Biologically motivated features are important• Validating attributes:

– Conservation of events are key in computational biology

– Higher-level consistency with known biology

• Experimental validation of predictions are essential

• “Build” models around known biology – In the process, extend knowledge about

known biology

• “Predict” new examples• “Validate” predictions by

– prediction accuracy– experimental validation– higher-level traits of predictions– conservation in other genomes

Framework/Issues

Secondary Structure Tertiary Structure

Modeling higher order interactions: Yeast Phe

tRNA

If time permits

The Hammerhead Ribozyme

Secondary structure Tertiary structure

Seq1: A C G A A A G U

Seq2: U A G U A A U A

Seq3: A G G U G A C U

Seq4: C G G C A A U G

Seq5: G U G G G A A C

Method of Covariation / Compensatory changes

One example on how to model and predict RNA 2o Structure

• Covariation (using comparative genomics)

Mutual information statistic for pair of columns in a multiple alignment

= fraction of seqs w/ nt. x in col. i, nt. y in col. j

ijM =x,y

(i, j )fx,y∑ 2log x,y

(i, j )f

x

(i )f y

( j )f

x,y

(i, j)f

x

(i )f = fraction of seqs w/ nt. x in col. i

is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary)

ijM

sum over x, y = A, C, G, U

Inferring 2o Structure from Covariation

Stochastic Context-Free Grammars (SCFGs)

• A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA)

Ref:

Durbin et al. “Biological Sequence Analysis” 1998

An SCFG Model of RNA 2o Structure

“Production Rules”:

• P aWb (“pair”)• L aW (“left

bulge/loop”)• R Wa (“right

bulge/loop”)• B SS (“bifurcation”)• S W (“start”)• E (“end”)

last page

• some of the slides were obtained from various places:– available online slides on the web (primarily from

lectures by terry speed).– slides from chris burge, dirk holste

Documents

Statistical modeling and classification in Biological Sequence Space