32
Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT

Statistical modeling and classification in Biological Sequence Space

  • Upload
    maja

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistical modeling and classification in Biological Sequence Space. April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT. Framework/Issues. “Build” models around known biology In the process, extend knowledge about known biology “Predict” new examples “Validate” predictions by - PowerPoint PPT Presentation

Citation preview

Page 1: Statistical modeling and classification in  Biological Sequence Space

Statistical modeling and classification in

Biological Sequence Space

April 26, 04; 9.520 Gene Yeo

Poggio, Burge @MIT

Page 2: Statistical modeling and classification in  Biological Sequence Space

• “Build” models around known biology – In the process, extend knowledge about

known biology

• “Predict” new examples• “Validate” predictions by

– prediction accuracy– experimental validation– higher-level traits of predictions– conservation in other genomes

Framework/Issues

Page 3: Statistical modeling and classification in  Biological Sequence Space

Biological sequences

• DNA, RNA and proteins: macromolecules built up from smaller units.

• DNA: units are the nucleotide residues A, C, G and T• RNA: units are the nucleotide residues A, C, G and

U• Proteins: units are the amino acid residues A, C, D,

E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.

• To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.

Page 4: Statistical modeling and classification in  Biological Sequence Space

• Statistical models can be descriptive and/or predictive.• Given known biological signal-> describe the signal

with statistical modeling & find unknown examples of the same signal – Gene-finding (protein-coding genes)– Noncoding RNA genes– Protein domains

• Warning: although successful, models are not to be taken literally.

• Most important: biological confirmation of predictions is almost always necessary.

Page 5: Statistical modeling and classification in  Biological Sequence Space

Different modelsC

om

ple

xit

y

DNA RNA Protein

Protein structure (a variety of methods)

Splice site motif (WMM, MM, SVM, NN)

Protein gene(HMM,NN)

RNA gene (Covariation,SCFG,NN,SVM)

Page 6: Statistical modeling and classification in  Biological Sequence Space

With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally.

A case study in computational biology: modeling signals in

genes

Page 7: Statistical modeling and classification in  Biological Sequence Space

What is a (protein-coding) gene?

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 8: Statistical modeling and classification in  Biological Sequence Space
Page 9: Statistical modeling and classification in  Biological Sequence Space

Some facts about human genes

Comprise about 3% of the genomeAverage gene length: ~ 8,000 bp

Average of 5-6 exons/geneAverage exon length: ~200 bp

Average intron length: ~2,000 bp~8% genes have a single exon

Page 10: Statistical modeling and classification in  Biological Sequence Space

The idea behind a HMM genefinder

• States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..).

• Observations embody state-dependent statistics, such as base composition, dependence, and signal features.

Page 11: Statistical modeling and classification in  Biological Sequence Space

E0 E1 E2

E

poly-A

3'UTR5'UTR

tEi

Es

I0 I 1 I 2

intergenicregion

Forward (+) strand

Reverse (-) strand

Forward (+) strand

Reverse (-) strand

promoter

62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC

62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC

62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA

62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC

62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG

62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC

62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC

62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC

62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC

62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA

62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC

62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA

62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT

62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG

62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC

62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA

62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC

62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG

62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT

62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC

63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT

63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT

63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC

63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC

63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT

63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT

63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT

63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG

63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT

63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG

63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG

63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA

63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT

63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA

63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT

GENSCAN (Burge & Karlin)

Page 12: Statistical modeling and classification in  Biological Sequence Space

Splice sites can be an important signal

Page 13: Statistical modeling and classification in  Biological Sequence Space

Regular expressions can be limiting

CAAGGT AGT

AG 5’ splice junction in eukaryotes

( )TC TC

≥11N AGC 3’ splice junction

Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites.

Position-specific distributions came to represent the variability in motif composition.

Page 14: Statistical modeling and classification in  Biological Sequence Space

Position-specific scoring matrix (PSSM)

S = S1 S2 S3 S4 S5 S6 S7 S8 S9

Odds Ratio R = =

Score s = log2R

P(S|+) P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9)

P(S|-) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9)

0.50.00.10.11.00.00.10.10.1 T

0.20.80.10.40.01.00.80.20.2 G

0.20.10.10.10.00.00.00.10.4 C0.10.10.70.40.00.00.10.60.3 A+6+5+4+3+2+1-1-2-3Pos

Page 15: Statistical modeling and classification in  Biological Sequence Space

Ok, so we got the genes

• Here’s another catch, there isn’t just one version of each gene.

• But sometimes several

• molecular biology (transcription, splicing)

• signals are modeled as states (HMM) or separately, i.e.PSSMs

Page 16: Statistical modeling and classification in  Biological Sequence Space

Eg. alternative splicing - CD44

Zhu et al Science… (2003)

Human chromosome 11p…

Page 17: Statistical modeling and classification in  Biological Sequence Space

Alternative splicing

• is a major determinant of protein diversity (Lander 2001, Zavolan 2003)

• 30-50% of human diseases involve alt. splicing

Page 18: Statistical modeling and classification in  Biological Sequence Space

Defining constitutive and alternative exons

Constitutiveexon

Skipped exon

3’ alternative exon

5’ alternative exon

Intron retention

Mutually exclusive exons

Page 19: Statistical modeling and classification in  Biological Sequence Space

Fra

gile

X R

ela

ted

Gen

e,

FXR1

Conserved alternative, skipped exon - FXR1

Page 20: Statistical modeling and classification in  Biological Sequence Space

Myoto

nic

Dyst

rop

hy-c

on

tain

ing

WD

Rep

eat,

DMWD

Another example of genes containing CSE: DMWD

Page 21: Statistical modeling and classification in  Biological Sequence Space

Predicting new alternatively spliced exons

1. The problem is ‘ill-posed’

2. High-dimensional space

3. Not overfit data

4. Simple feature selection

5. Unbalanced data set sizes

6. Labels are more “flexible”

Page 22: Statistical modeling and classification in  Biological Sequence Space

Eg. of experimentally validated

Page 23: Statistical modeling and classification in  Biological Sequence Space

Biological sequence space: challenges

• Models that “represent” as much of the biology as possible.

• Biologically motivated features are important• Validating attributes:

– Conservation of events are key in computational biology

– Higher-level consistency with known biology

• Experimental validation of predictions are essential

Page 24: Statistical modeling and classification in  Biological Sequence Space

• “Build” models around known biology – In the process, extend knowledge about

known biology

• “Predict” new examples• “Validate” predictions by

– prediction accuracy– experimental validation– higher-level traits of predictions– conservation in other genomes

Framework/Issues

Page 25: Statistical modeling and classification in  Biological Sequence Space

Secondary Structure Tertiary Structure

Modeling higher order interactions: Yeast Phe

tRNA

If time permits

Page 26: Statistical modeling and classification in  Biological Sequence Space

The Hammerhead Ribozyme

Secondary structure Tertiary structure

Page 27: Statistical modeling and classification in  Biological Sequence Space

Seq1: A C G A A A G U

Seq2: U A G U A A U A

Seq3: A G G U G A C U

Seq4: C G G C A A U G

Seq5: G U G G G A A C

Method of Covariation / Compensatory changes

One example on how to model and predict RNA 2o Structure

• Covariation (using comparative genomics)

Page 28: Statistical modeling and classification in  Biological Sequence Space

Mutual information statistic for pair of columns in a multiple alignment

= fraction of seqs w/ nt. x in col. i, nt. y in col. j

ijM =x,y

(i, j )fx,y∑ 2log x,y

(i, j )f

x

(i )f y

( j )f

x,y

(i, j)f

x

(i )f = fraction of seqs w/ nt. x in col. i

is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary)

ijM

sum over x, y = A, C, G, U

Page 29: Statistical modeling and classification in  Biological Sequence Space

Inferring 2o Structure from Covariation

Page 30: Statistical modeling and classification in  Biological Sequence Space

Stochastic Context-Free Grammars (SCFGs)

• A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA)

Ref:

Durbin et al. “Biological Sequence Analysis” 1998

Page 31: Statistical modeling and classification in  Biological Sequence Space

An SCFG Model of RNA 2o Structure

“Production Rules”:

• P aWb (“pair”)• L aW (“left

bulge/loop”)• R Wa (“right

bulge/loop”)• B SS (“bifurcation”)• S W (“start”)• E (“end”)

Page 32: Statistical modeling and classification in  Biological Sequence Space

last page

• some of the slides were obtained from various places:– available online slides on the web (primarily from

lectures by terry speed).– slides from chris burge, dirk holste