67
1 (c) Mark Gerstein, 2005, Yale, bioinfo.mbb.yale.edu BIOINFORMATICS Sequences #2 Mark Gerstein, Yale University GersteinLab.org/courses/452 (last edit in fall '05)

BIOINFORMATICS Sequences #2

  • Upload
    whitney

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

BIOINFORMATICS Sequences #2. Mark Gerstein, Yale University GersteinLab.org/courses/452 (last edit in fall '05). Basic Alignment via Dynamic Programming Suboptimal Alignment Gap Penalties Similarity (PAM) Matrices Multiple Alignment Profiles, Motifs, HMMs Local Alignment - PowerPoint PPT Presentation

Citation preview

Page 1: BIOINFORMATICS Sequences #2

1

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

BIOINFORMATICSSequences #2

Mark Gerstein, Yale UniversityGersteinLab.org/courses/452

(last edit in fall '05)

Page 2: BIOINFORMATICS Sequences #2

2

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Sequence Topics (Contents)

• Basic Alignment via Dynamic Programming

• Suboptimal Alignment• Gap Penalties• Similarity (PAM) Matrices• Multiple Alignment• Profiles, Motifs, HMMs• Local Alignment• Probabilistic Scoring Schemes• Rapid Similarity Search: Fasta• Rapid Similarity Search: Blast

• Practical Suggestions on Sequence Searching

• Transmembrane helix predictions

• Secondary Structure Prediction: Basic GOR

• Secondary Structure Prediction: Other Methods

• Assessing Secondary Structure Prediction

Page 3: BIOINFORMATICS Sequences #2

3

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Computational Complexity

Page 4: BIOINFORMATICS Sequences #2

4

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Computational Complexity

• Basic NW Algorithm is O(n2) (in speed) M x N squares to fill At each square need to

look back (M’+N’) “black” squares to find max in block

M x N x (M’+N’) -> O(n3) However, max values in

block can be cached, so algorithm is really only O(n2)

• O(n2) in memory too!• Improvements can

(effectively) reduce sequence comparison to O(n) in both

A B C N Y R Q C L C R P M

A 1

Y 1

C 1 1 1

Y 1

N 1

R 5 4 3 3 2 2 0 0

C 3 3 4 3 3 3 3 4 3 3 1 0 0

K 3 3 3 3 3 3 3 3 3 2 1 0 0

C 2 2 3 2 2 2 2 3 2 3 1 0 0

R 2 1 1 1 1 2 1 1 1 1 2 0 0

B 1 2 1 1 1 1 1 1 1 1 1 0 0

P 0 0 0 0 0 0 0 0 0 0 0 1 0

N

M

N’

M’

Core

Page 5: BIOINFORMATICS Sequences #2

5

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

How normal dynamic programming without caching has O(N3) complexity

Page 6: BIOINFORMATICS Sequences #2

6

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Rapid Similarity Search: FASTA and BLAST

Page 7: BIOINFORMATICS Sequences #2

7

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

FASTA• Hash table of short words in the query sequence• Go through DB and look for matches in the query

hash (linear in size of DB)• perl: $where{“ACT”} = 1,45,67,23....• K-tuple determines word size (k-tup 1 is single aa)• by Bill Pearson & D Lipman

VLICTAVLMVLICTAAAVLICTMSDFFD

VLICT = _

Core

Page 8: BIOINFORMATICS Sequences #2

8

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

A More Interesting Dot Matrix

(adapted from R Altman)

Page 9: BIOINFORMATICS Sequences #2

9

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Join together query lookups into

diagonals and then a full alignment

(Adapted from D Brutlag)

Page 10: BIOINFORMATICS Sequences #2

10

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

End of class 2005,11.07 (m-7)

Page 11: BIOINFORMATICS Sequences #2

11

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Basic Blast

• Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410

• Indexes query (also tried indexing DB [blat])

• Starts with all overlapping words from query

• Calculates “neighborhood” of each word using PAM matrix and probability threshold matrix and probability threshold

• Looks up all words and neighbors from query in database index

• Extends High Scoring Pairs (HSPs) left and right to maximal length [Pure hashing type of approach]

• Finds Maximal Segment Pairs (MSPs) between query and database

• Blast 1 does not permit gaps in alignments

Core

Page 12: BIOINFORMATICS Sequences #2

12

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Blast: Extension of Hash Hits

• Extend hash hits into High Scoring Segment Pairs (HSPs)

• Stop extension when total score doesn’t increase

• Extension is O(N). This takes most of the time in Blast

Core

Page 13: BIOINFORMATICS Sequences #2

13

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Blasting against the DB

• In simplest Blast algorithm, find best scoring segment in each DB sequence

• Statistics of these scores determine significance

Number of hash hits is proportional to O(N*M*D), where N is the query size, M is the average DB seq. size, and D is the size of the DB

Page 14: BIOINFORMATICS Sequences #2

14

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Analytic Score Formalism for BlastKarlin-Altschul statistics for occurrence of high-scoring segments (HSPs) in random sequences Extra

Page 15: BIOINFORMATICS Sequences #2

15

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Blast2: Gapped

Blast

Core

Page 16: BIOINFORMATICS Sequences #2

16

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Blast2: Gapped Blast

• Gapped Extension on Diagonals with two Hash Hits

• Statistics of Gapped Alignments follows EVD empirically

Core

Page 17: BIOINFORMATICS Sequences #2

17

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

-Blast• Automatically builds profile

and then searches with this• Also PHI-blast

Parameters: overall threshold, inclusion threshold, interations

Page 18: BIOINFORMATICS Sequences #2

18

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

PSI-Blast

Iteration Scheme

BlastFASTASmith-

WatermanPSI-BlastProfilesHMMs

Spe

ed

Sen

sitiv

ity

Core

Page 19: BIOINFORMATICS Sequences #2

19

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Sequence Searching Issues

Page 20: BIOINFORMATICS Sequences #2

20

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Practical Issues on

DNA Searching

• Examine results with exp. between 0.05 and 10

• Reevaluate results of borderline significance using limited query

• Beware of hits on long sequences

• Limit query length to 1,000 bases

• Segment query if more than 1,000 bases

• Search both strands

• Protein search is more sensitive than nuc. acid, Translate ORFs(codons & subst. matrices)

• BLAST for infinite gap penalty

• Smith-Waterman for cDNA/genome comparisons

• cDNA =>Zero gap-Transition matrices Consider transition matrices

• Ensure that expected value of score is negative

(graphic and some text adapted from D Brutlag)

Page 21: BIOINFORMATICS Sequences #2

21

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

General Protein Search Principles

• Choose between local or global search algorithms

• Use most sensitive search algorithm available

• Original BLAST for no gaps• Smith-Waterman for most

sensitivity• FASTA with k-tuple 1 is a

good compromise• Gapped BLAST for well

delimited regions• PSI-BLAST for families• Initially BLOSUM62 and

default gap penalties

• If no significant results, use BLOSUM30 and lower gap penalties

• FASTA e-value cutoff of .01• Blast p-value cutoff of .0001• Examine results between exp. 0.05

and 10 for biological significance• Ensure expected score is negative• Beware of hits on long sequences or

hits with unusual aa composition• Reevaluate results of borderline

significance using limited query region

• Segment long queries 300 amino acids

• Segment around known motifs

(some text adapted from D Brutlag)

Page 22: BIOINFORMATICS Sequences #2

22

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Significance Dependson Database Size

• The Significance of Similarity Scores Decreases with Database Growth The score between any pair of sequence pair is constant The number of database entries grows exponentially The number of nonhomologous entries >> homologous entries Greater sensitivity is required to detect homologies

Greater s

• Score of 100 might rank as best in database of 1000 but only in top-100 of database of 1000000

DB-1 DB-2

DB-3

QUERY

Page 23: BIOINFORMATICS Sequences #2

23

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Low-Complexity Regions

• Low Complexity Regions Different Statistics for matching

AAATTTAAATTTAAATTTAAATTTAAATTTthanACSQRPLRVSHRSENCVASNKPQLVKLMTHVKDFCV

Automatic Programs Screen These Out (SEG)

Identify through computation of sequence entropy in a window of a

given size

H = f(a) log2 f(a)

• Also, Compositional Bias Matching A-rich query to A-rich DB vs. A-poor DB

LLLLLLLLLLLLL

Core

Page 24: BIOINFORMATICS Sequences #2

24

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Sequence Searching History

Extra

HMMs

(threading)

Page 25: BIOINFORMATICS Sequences #2

25

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Secondary Structure Prediction Overview

• Why interesting? Not tremendous success, but many methods brought to bear. What does difficulty tell about protein structure?

• Start with TM Prediction (Simpler)• Basic GOR Sec. Struc. Prediction• Better GOR

GOR III, IV, semi-parametric improvements, DSC

• Other Methods NN, nearest nbr.

Page 26: BIOINFORMATICS Sequences #2

26

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

What secondary structure

prediction tries to accomplish?

• Not Same as Tertiary Structure Prediction -- no coordinates

• Need torsion angles of terms + slight diff. in torsions of sec. str.

Credits: Rost et al. 1993; Fasman & Gilbert, 1990

Sequence RPDFCLEPPYTGPCKARIIRYFYNAKAGLVQTFVYGGCRAKRNNFKSAEDAMRTCGGAStructure CCGGGGCCCCCCCCCCCEEEEEEETTTTEEEEEEECCCCCTTTTBTTHHHHHHHHHCC

Page 27: BIOINFORMATICS Sequences #2

27

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

TM Helix Identification: the Problem

Page 28: BIOINFORMATICS Sequences #2

28

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

How to use GES to predict proteins

• Transmembrane segments can be identified by using the GES hydrophobicity scale (Engelman et al., 1986). The values from the scale for amino acids in a window of size 20 (the typical size of a transmembrane helix) were averaged and then compared against a cutoff of -1 kcal/mole. A value under this cutoff was taken to indicate the existence of a transmembrane helix.

• H-19(i) = [ H(i-9)+H(i-8)+...+H(i) + H(i+1) + H(i+2) + . . . + H(i+9) ] / 19

Core

Page 29: BIOINFORMATICS Sequences #2

29

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Some TM scales: GES KD I 4.5

V 4.2L 3.8F 2.8C 2.5M 1.9A 1.8G -0.4T -0.7W -0.9S -0.8Y -1.3P -1.6H -3.2E -3.5Q -3.5D -3.5N -3.5 K -3.9R -4.5

F -3.7M -3.4I -3.1L -2.8V -2.6C -2.0W -1.9A -1.6T -1.2G -1.0S -0.6P +0.2Y +0.7H +3.0Q +4.1N +4.8E +8.2K +8.8D +9.2R +12.3

For instance, G from transfer of a Phe amino acid from water to hexane

Goldman, Engleman, SteitzKD – Kyte Dolittle

Page 30: BIOINFORMATICS Sequences #2

30

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Graph showing Peaks in scales

Illustrations Adapted From: von Heijne, 1992; Smith notes, 1997

Core

Page 31: BIOINFORMATICS Sequences #2

31

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Basic Ideas

• f(aa x,obs [in TM helix]) / f(aa x, exp [from DB])

• Substitution matrices

• Pseudo E = Log obs/exp

• Window averaging with log obs/exp instead of dG (GES)

• Refinements:F(x, obs in TM helix at the n-term position)

Page 32: BIOINFORMATICS Sequences #2

32

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

End of class 2005,11.09 (m-8)

Page 33: BIOINFORMATICS Sequences #2

33

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Statistics Based Methods:Persson & Argos

• Propensity P(A) for amino acid A to be in the middle of a TM helix or near the edge of a TM helix

A

A

AnAn

AnAn

AP

)everywhere,()everywhere,(

)TM,()TM,(

)(

Illustration Credits: Persson & Argos, 1994

P(A) = fTM(A)/fSwissProt(A)

Core

Page 34: BIOINFORMATICS Sequences #2

34

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Scale DetailExtra

Page 35: BIOINFORMATICS Sequences #2

35

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Refinements:MaxH

• How to train to find right threshold? Not that many TM helices

• Marginal TM helices are not that hydrophobic but 1/3 of TM's are very hydrophobic, so focus on these.

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

-3.0

0

-2.7

5

-2.5

0

-2.2

5

-2.0

0

-1.7

5

-1.5

0

-1.2

5

-1.0

0

-0.7

5

-0.5

0

-0.2

5

0.0

0

0.2

5

0.5

0

Min H value

Fre

q.

in w

orm

gen

om

e

TM Marginal

Thresholds

Soluble

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Number of TM helices per ORF

Nu

mb

er

of

Wo

rm O

RF

s

marginal

sure

•Sosui, Klein & Delisi, Boyd•Discriminant analysis: set threshold to be best partition of dataset

Solb

TM

Extra

Page 36: BIOINFORMATICS Sequences #2

36

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Secondary Struc. Prediction: GOR

Page 37: BIOINFORMATICS Sequences #2

37

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

GOR: Simplifications

• For independent events just add up the information

• I(Sj ; R1, R2, R3,...Rlast) = Information that first through last residue of protein has on the conformation of residue j (Sj) Could get this just from sequence sim. or if same struc. in DB

(homology best way to predict sec. struc.!)

• Simplify using a 17 residue window: I(Sj=H ; R[j-8], R[j-7], ...., R[j], .... R[j+8])

• Difference of information for residue to be in helix relative to not: I(dSj;y) = I(Sj=H;y)-I(Sj=~H;y) odds ratio: I(dSj;y)= ln P(Sj;y)/P(~Sj;y) I determined by observing counts in the DB, essentially a lod value

Core

Page 38: BIOINFORMATICS Sequences #2

38

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Basic GOR

• Pain & Robson, 1971; Garnier, Osguthorpe, Robson, 1978

• I ~ sum of I(Sj,R[j+m]) over 17 residue window centered on j and indexed by m I(Sj,R[j+m]) = information that residue at position m in

window has about conformation of protein at position j 1020 bins=17*20*3

• In Words Secondary structure prediction can be done using the

GOR program (Garnier et al., 1996; Garnier et al., 1978; Gibrat et al., 1987). This is a well-established and commonly used method. It is statistically based so that the prediction for a particular residue (say Ala) to be in a given state (i.e. helix) is directly based on the frequency that this residue (and taking into account neighbors at +/- 1, +/- 2, and so forth) occurs in this state in a database of solved structures. Specifically, for version II of the GOR program (Garnier et al., 1978), the prediction for residue i is based on a window from i-8 to i+8 around i, and within this window, the 17 individual residue frequencies (singlets).

f(H,+3)/f(~H,+3)

-8 +80 3

Page 39: BIOINFORMATICS Sequences #2

39

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Overview of variables in simple GOR

Page 40: BIOINFORMATICS Sequences #2

40

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Directional Information

helix

strand

coil

Credits: King & Sternberg, 1996

OBS LOD= ln ------- EXP

Core

Page 41: BIOINFORMATICS Sequences #2

41

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Types of Residues

• Group I favorable residues and Group II unfavorable one: • A, E, L -> H; V, I, Y, W, C -> E; G, N, D, S -> C• P complex; largest effect on proceeding residue• Some residues favorable at only one terminus (K)

Credits: King & Sternberg, 1996

Core

Page 42: BIOINFORMATICS Sequences #2

42

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

GOR IV

• I(Sj; R[j+m], R[j+n]) = the frequencies of all 136 (=16*17/2) possible di-residue pairs (doublets) in the window. 20*20*3*16*17/2=163200 pairs

• Parameter Explosion Problem: 1000 dom. struc. * 100 res./dom. = 100k counts, over how many bins

• Dummy counts for low values (Bayes)

All Pairs

All Singletons in 17 residue window

Core

Page 43: BIOINFORMATICS Sequences #2

43

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

How to calculate an entry in the GOR I tables and a comparison to GOR IV

Core

Page 44: BIOINFORMATICS Sequences #2

44

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Spectrum of calculations

Simple - 20 values at position i

Gor I - ~1000 values within 17 res window at i

Gor IV ~ 160K, all pairs within the window

(bin = how many times do I have a helix at i with A at position m=5 and V at position n=-4)

GOR-2010 -- bigger window, triplets!!

GOR-5000 -- all 15mer words, 20^15!!

Page 45: BIOINFORMATICS Sequences #2

45

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

An example of mini-GOR

Also, why secondary structure

prediction is so hard

Core

Page 46: BIOINFORMATICS Sequences #2

46

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

How to set up assessment

•AASDTLVVIPWERE Input Seq•HHHHHEEEECCCHH Pred.•Hhhheeeeeeeech Gold Std.

Page 47: BIOINFORMATICS Sequences #2

47

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Assessment

• Q3 + other assess, 3x3• Q3 = total number of

residues predicted correctly over total number of residues (PPV)

• GOR gets 65% sum of diagonal over total number

of residue -- (14K+5K+21K)/ 64K

• Under predict strands & to a lesser degree, helices: 5.9 v 4.1, 10.9 v 10.6

Credits: Garnier et al., 1996

PPV = number get right / number things you predict

Roc curve = number right / number of known positives

Page 48: BIOINFORMATICS Sequences #2

48

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Training and Testing Set

• Cross Validation: Leave one out, seven-fold

Credits: Munson, 1995; Garnier et al., 1996

4-fold

Page 49: BIOINFORMATICS Sequences #2

49

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Types of Secondary Structure Prediction Methods

• Parametric Statistical struc. = explicit numerical func. of the data (GOR)

• Non-parametric struc. = NON- explicit numerical func. of the data generalize Neural Net, seq patterns, nearest nbr, &c.

• Semi-parametric: combine both• single sequence• multi sequence

with or without multiple-alignment

Core

Page 50: BIOINFORMATICS Sequences #2

50

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

GOR Semi-parametric

Improvements

• Filtering GOR to regularize

Illustration Credits: King & Sternberg, 1996

Core

Page 51: BIOINFORMATICS Sequences #2

51

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Multiple Sequence Methods

• Average GOR over multiple seq. Alignment

• The GOR method only uses single sequence information and because of this achieves lower accuracy (65 versus >71 %) than the current "state-of-the-art" methods that incorporate multiple sequence information (e.g. King & Sternberg, 1996; Rost, 1996; Rost & Sander, 1993).

Illustration Credits: Livingston & Barton, 1996

Page 52: BIOINFORMATICS Sequences #2

52

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

DSC -- an improvement on

GOR

• GOR parms

• + simple linear discriminant analysis on: dist from C-term, N-term insertions/deletes overall composition hydrophobic moments autocorrelate: helices conservation moment

Illustration Credits: King & Sternberg, 1996

Page 53: BIOINFORMATICS Sequences #2

53

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Conservation, k-nn

Patterns of Conservation

k-nearest neighbors

Extraoutside

Inside (conserved)

Page 54: BIOINFORMATICS Sequences #2

54

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Yet more methods….

• Neural networks• struc class predict

Vect dist. between composition vectors

• threading via pair pot• Distant seq comparison• ab initio from md• ab initio from pair pot.

Extra

Page 55: BIOINFORMATICS Sequences #2

55

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Threading

Page 56: BIOINFORMATICS Sequences #2

56

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Fold recognition

Query sequence Library of known folds Best-fit fold

Page 57: BIOINFORMATICS Sequences #2

57

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Why fold recognition?

• Structure prediction made easier by sampling 1,000~10,000 folds, rather than >4100 possible conformations

• Practical importance: fold assignment in genomes• Fold recognition can be done using sequence-based

(BLAST, HMM, profile alignment) or structure-based methods (threading)

Page 58: BIOINFORMATICS Sequences #2

58

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Fold recognition by threading

• Input: A query sequence, a fold library• For each fold template in the library:

Generate alignments between the query sequence and the fold template

Evaluate alignments; choose the best one

• Do this for all folds, choose the best fold

Page 59: BIOINFORMATICS Sequences #2

59

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

What is threading

• Query sequence:

• Thread the sequence onto the fold template

• Use structural properties to evaluate the fit Environment Pairwise interactions

Page 60: BIOINFORMATICS Sequences #2

60

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

End of class 2005,11.14 (m-9)

[lectures continue in structures sections]

Page 61: BIOINFORMATICS Sequences #2

61

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Align sequence to fold: an example

● Align: RVLGFIPTWFALSKY RVLGFIPTWFALSKY to:

Many possible alignments:

12

43

56

7 8 9

10

11

12 13

14

15

16

123456789012--3456

RVLGF--IPTWFALSKY-

1234567890123456

RVLGFIPTWFALSKY-

1234567890123456

-RVLGFIPTWFALSKYDeletion Insertion

RV

GL

FI

P T W

F

A

L S

K

Y

R

LV

GF

I P T

W

F

A L

S

K

Y

V

GL

F

I P

T

W

F S

K

Y

R

A L

Page 62: BIOINFORMATICS Sequences #2

62

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Evaluate alignments using threading energy function

• Etotal = Eenv + Epair + Egap

• Eenv: Total environment energies. Measures compatibility of a residue and its corresponding 3D environment (secondary structure, solvent accessibility)

• Epair: Total pairwise energies. Measures interaction between spatially close residues

• Egap: Gap opening and extension penalities

• Etot represents how compatible a sequence in a given alignment is to structural model represented in terms of environments and pairpot.

Page 63: BIOINFORMATICS Sequences #2

63

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Pairwise and Environment Energies

• Optimum E tot is arrived through a sort of dyn. programming against the model…

Page 64: BIOINFORMATICS Sequences #2

64

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Generalized Substitution Matrices

Generalized Substitution matrix

M(env=buried,aa=Met)=+2 (complementary)

Page 65: BIOINFORMATICS Sequences #2

65

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Structural Alignment (1b) Make a Similarity Matrix

(Generalized Similarity Matrix)

• PAM(A,V) = 0.5 Applies at every position

• S(aa @ i, aa @ J) Specific Matrix for each pair of

residues i in protein 1 and J in protein 2

Example is Y near N-term. matches any C-term. residue (Y at J=2)

• S(i,J) Doesn’t need to depend on a.a.

identities at all! Just need to make up a score for

matching residue i in protein 1 with residue J in protein 2

1 2 3 4 5 6 7 8 9 10 11 12 13

A B C N Y R Q C L C R P M1 A 12 Y 1 5 5 5 5 5 53 C 1 1 14 Y 15 N 16 R 1 17 C 1 1 18 K9 C 1 1 1

10 R 1 111 B 112 P 1

J

i

Page 66: BIOINFORMATICS Sequences #2

66

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Using Dynamic Programming in Threading

Page 67: BIOINFORMATICS Sequences #2

67

(c)

Mar

k G

erst

ein

, 20

05,

Yal

e, b

ioin

fo.m

bb

.yal

e.ed

u

Find the best alignment

• NP-hard problem; needs approximation

• Dynamic programming and the

“frozen approximation” Approximately calculate amino acid preferences for each residue position by

fixing the interaction partners at that position Find best alignment using dynamic programming Update interaction partners for each position; repeat till convergence

• Other optimization techniques Simulated annealing Branch-and-bound, etc.