43
SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information alone An example of two different bioinformatics approaches to the same problem

SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST

  • Upload
    tierra

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Two methods to predict domain boundary sequence positions from sequence information alone. SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST. An example of two different bioinformatics approaches to the same problem. - PowerPoint PPT Presentation

Citation preview

Page 1: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDRAGON: protein 3D prediction-based

DOMAINATION: based on PSI-BLAST

Two methods to predict domain boundary sequence positions from

sequence information alone

An example of two different bioinformatics approaches to the same problem

Page 2: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDRAGON

Richard A. George

Jaap Heringa

George, R.A. & Heringa, J. (2002) J.Mol.Biol. 316,839-851

George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.

 

Combining protein secondary and tertiary structure prediction to predict structural domains in sequence data

Page 3: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure evolutionInsertion/deletion of secondary structural

elements can ‘easily’ be done at loop sites

Page 4: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Flavodoxin family - TOPS diagrams (Flores et al., 1994)

1 2345

1

234

5

Page 5: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure evolutionInsertion/deletion of structural domains can

‘easily’ be done at loop sites

N

C

Page 6: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

A domain is a:

• Compact, semi-independent unit (Richardson, 1981).

• Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973).

• Recurring functional and evolutionary module (Bork, 1992).

“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

Page 7: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.

http

://w

ww

.msh

ri.o

n.ca

/paw

son

Page 8: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Delineating domains is essential for:

• Obtaining high resolution structures (x-ray, NMR)• Sequence analysis • Multiple sequence alignment methods• Prediction algorithms (SS, Class, secondary/tertiary

structure)• Fold recognition and threading• Elucidating the evolution, structure and function of

a protein family (e.g. ‘Rosetta Stone’ method)• Structural/functional genomics• Cross genome comparative analysis

Page 9: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

1 continuous + 2 discontinuous domains

Structural domain organisation can be nasty…

Page 10: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 11: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 12: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 13: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 14: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Distance Regularisation Algorithm for Geometry OptimisatioN

(Aszodi & Taylor, 1994)

Domain prediction using DRAGON

•Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together.

•First constructs a random high dimensional C distance matrix.

•Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

Page 15: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

The DRAGON target matrix is inferred from:

• A multiple sequence alignment of a protein (old)– Conserved hydrophobicity

• Secondary structure information (SnapDRAGON)– predicted by PREDATOR (Frishman & Argos, 1996).– strands are entered as distance constraints from the N-

terminal Cto the C-terminal C

Page 16: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

•The C distance matrix is divided into smaller clusters.

•Seperately, each cluster is embedded into a local centroid.

•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.

3NN

NN

C distancematrix

Targetmatrix

N

CCHHHCCEEE

Multiple alignment

Predicted secondary structure100 randomised

initial matrices

100 predictions Input data

Page 17: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDragon

Generated folds by Dragon

Boundary recognition

Summed and Smoothed Boundaries

CCHHHCCEEE

Multiple alignment

Predicted secondary structure

Page 18: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Domains in structures assigned using method by Taylor (1997)

Domain boundary positions of each model against sequence

Summed and Smoothed Boundaries (Biased window protocol)

SnapDRAGON

1

2

3

Page 19: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Prediction assessment

• Test set of 414 multiple alignments;183 single and 231 multiple domain proteins.

Sequence searches using PSI-BLAST (Altschul et al., 1997) followed by redundancy filtering using OBSTRUCT (Heringa et al.,1992) and alignment by PRALINE (Heringa, 1999)

• Boundary predictions are compared to the region of the protein connecting two domains (min 10 residues)

Page 20: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Continuous set Discontinuous set Full set

SnapDRAGONCoverage 63.9 (± 43.0) 35.4 (± 25.0) 51.8 (± 39.1)

Success 46.8 (± 36.4) 44.4 (± 33.9) 45.8 (± 35.4)

Baseline 1Coverage 43.6 (± 45.3) 20.5 (± 27.1) 34.7 (± 40.8)

Success 34.3 (± 39.6) 22.2 (± 29.5) 29.6 (± 36.6)

Baseline 2Coverage 45.3 (± 46.9) 22.7 (± 27.3) 35.7 (± 41.3)

Success 37.1 (± 42.0) 23.1 (± 29.6) 31.2 (± 37.9)

Average prediction results per protein

Coverage is the % linkers predicted (TP/TP+FN)Success is the % of correct predictions made (TP/TP+FP)

Page 21: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDRAGON

• Is very slow (can be hours for proteins>400 aa) – cluster computing implementation

• Uses consistency in the absence of standard of truth

• Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences

• SnapDRAGON webserver is underway

Page 22: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

DOMAINATIONRichard A. George

Protein domain identification and improved sequence searching using PSI-BLAST

(George & Heringa, Prot. Struct. Func. Genet., in press; 2002)

Integrating protein sequence database searching and domain recognition

Page 23: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Domaination

• Current iterative homology search methods do not take into account that:– Domains may have different ‘rates of

evolution’.– Common conserved domains, such as the

tyrosine kinase domain, can obscure weak but relevant matches to other domain types

– Premature convergence (false negatives)– Matrix migration / Profile wander (false

positives).

Page 24: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

PSI-BLAST• Query sequence is first scanned for the presence of so-

called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment.

• Initially operates on a single query sequence by performing a gapped BLAST search

• Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment.

• Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges

Page 25: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

PSI-BLAST iteration

Q

ACD..Y

PiPx

Query sequence

PSSM

Q Query sequence

Gapped BLAST search

Database hits

Gapped BLAST searchACD..Y

PiPx

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

Page 26: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

DO

MA

INA

TIO

N

Chop and JoinDomains

Page 27: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Post-processing low complexityRemove local fragments with > 15% LC

Page 28: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying domain boundaries

Sum N- and C-termini ofgapped local alignments

True N- and C- termini are counted twice (within 10 residues)

Boundaries are smoothed using twowindows (15 residues long)

Combine scores using biased protocol:

if Ni x Ci = 0then Si = Ni+Cielse Si = Ni+Ci +(NixCi)/(Ni+Ci)

Page 29: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying domain deletions

• Deletions in the query (or insertion in the DB sequences) are identified by– two adjacent segments in the query align to the

same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini)

DBQuery

Page 30: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying domain permutations

• A domain shuffling event is declared – when two local alignments (>35 residues)

within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order.

DB

Query

b a

a b

Page 31: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying continuous and discontinuous domains

•Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain.•An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered asdiscontinuous domains and joined.

Page 32: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Create domain profiles

• A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq).

• A multiple sequence alignment is generated using PRALINE (Heringa 1999).

• Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997).

• The whole process is iterated until no new domains are identified.

Page 33: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Domain boundary prediction accuracy

• Set of 452 multidomain proteins

• 56% of proteins were correctly predicted to have more than one domain

• 42% of predictions are within 20 residues of a true boundary

• 49.9% (44.6%) correct boundary predictions per protein

Page 34: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

• 23.3% of all linkers found in 452 multidomain proteins. Not a surprise since:– Structural domain boundaries will not always

coincide with sequence domain boundaries– Proteins must have some domain shuffling

• For discontinuous proteins 34.2% of linkers were identified

• 30% of discontinuous domains were successfully joined

Page 35: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Change in domain prediction accuracy using various PSI-BLAST E-value cut-offs

Page 36: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Benchmarking versus PSI-BLAST

• A set 452 non-homologous multidomain protein structures.

• Each protein was delineated into its structural domains. Database searches of the individual domains were used as a standard of truth.

• We then tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.

Page 37: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Two sets based on individual domain searches:

• Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query.

• Reference set 2: consists of database sequences found by searching with one or more of the domain sequences

• Therefore set 2 contains many more sequences than set 1

Ref set 1 Ref set 2

Query

DB seqs

Page 38: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Sequences found over Reference sets 1 and 2

PSI-BLASTvs Ref set 1

DOMAINATIONvs Ref set 1

PSI-BLASTvs Ref set 2

DOMAINATIONvs Ref set 2

Seq's found 28581 28921 67300 73274

Seq's missed 618 278 13542 7568

% missed 2.12 0.95 16.8 9.36

Page 39: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Reference 1

• PSI-BLAST finds 97.9% of sequences

• Domaination finds 99.1% of sequences

Reference 2

• PSI-BLAST finds 83.2% of sequences

• Domaination finds 90.6% of sequences

Page 40: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Sequences found over Reference sets 1 and 2 from 15 Smart sequences

PSI-BLASTvs Ref set 1

DOMAINATIONvs Ref set 1

PSI-BLASTvs Ref set 2

DOMAINATIONvs Ref set 2

Seq's found 323 347 3672 5902

Seq's missed 24 0 3438 1202

% missed 6.9 0 48.4 17.0

Page 41: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SSEARCH significance test

• Verify the statistical significance of database sequences found by relating them to the original query sequence.

• SSEARCH (Pearson & Lipman 1988). Calculates an E-value for each generated local alignment.

• This filter will lose distant homologies.

• Use the 452 proteins with known structure.

Page 42: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Significant sequences found in database searches

At an E-value cut-off of 0.1 the performance of DOMAINATION

searches with the full-length proteins is 15% better than PSI-BLAST

Page 43: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Summary

Domains are recurring evolutionary units: by collecting the N- and C- termini of local alignments we can identify domain boundaries.

By finding domains we can significantly improve database search methods

SnapDRAGON is more sensitive than DOMAINATION but at high computational cost