38
Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004 Glasgow

Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Embed Size (px)

Citation preview

Page 1: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Genome Annotation of Protein Function using Structural Data:

Catalytic Residue Information

Janet Thornton

European Bioinformatics Institute

ISMB/ECCB 2004

Glasgow

Page 2: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

From Structure to Functional Annotation

Page 3: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

From Structure To Biochemical Function

Gene Protein 3D Structure Function

Given a protein structure: Where is the functional site? What is the multimeric state of the protein? Which ligands bind to the protein? What is biochemical function?

Page 4: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Automated Structure Comparison

The most powerful method for assigning function from structure is global or partial 3D structure comparison (e.g. Dali, SSAP; SSM)

Hidden Markov Models derived from structural domains can often recognise distant relatives from sequence

Page 5: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Predicting Binding SiteBinding-site analysis: cutA

Most likely binding site

Surface clefts

Residue conservation

Conserved surface patches

Page 6: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Identifying Binding Site Function Using Motifs

- 3D enzyme active site structural motifs (Craig Porter)

- Catalytic Site Atlas - Identification of catalytic residues (Gail Bartlett, Alex Gutteridge)

- Metal binding sites (Malcolm MacArthur)

- Binding site features (Gareth Stockwell)

- Automatically generated templates of ligand-binding and

- DNA binding motifs (Sue Jones, Hugh Shanahan)

- “Reverse” templates (Roman Laskowski)

JESS – fast template search algorithm (Jonathan Barker)

Page 7: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Using information on Catalytic Residues derived from Structures

Catalytic Site Atlas

Using info for annotation of enzymes in genomes

3D Templates

Page 8: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data.

Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton Nucl. Acids. Res. 2004 32: D129-D133.

http://www.ebi.ac.uk/thornton-srv/databases/CSA

Page 9: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Catalytic Site Information

Enzyme reports from primary literature information

-lactamase Class A EC: 3.5.2.6 PDB: 1btl Reaction: -lactam + H2O -amino acid Active site residues: S70, K73, S130, E166 Plausible mechanism:

N

O

OH

N H 2

OH

S e r

L y s

S e r

N H 3 +

O

H

O

N

O

S e r

L y s

S e r

N H 3 +

O

O

NH

O

O

O

OH

H

S e r

L y s

S e r

G l u

OO H

O

OHO

NH

O

H

N H

S e r

L y s

S e r

G l u

Page 10: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Annotates catalytic residues in the PDB Based on a dataset of 514 enzyme families

Representative catalytic site for each family Homologues assigned by Psi-BLAST Limited substitution allowed. Homologues updated monthly.

Literature references Data also available via MSDsite

http://www.ebi.ac.uk/thornton-srv/databases/CSA http://www.ebi.ac.uk/msd-srv/msdsite

Page 11: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

CSA Coverage

512 Representative Sites 9075 PDB Files20001 Catalytic Sites

Class In CSA In PDB

E.C. 1.-.-.- Oxidoreductases. 194 / 271E.C. 2.-.-.- Transferases. 151 / 280E.C. 3.-.-.- Hydrolases. 221 / 421 E.C. 4.-.-.- Lyases. 96 / 122E.C. 5.-.-.- Isomerases. 44 / 63E.C. 6.-.-.- Ligases. 33 / 58

Total 739 / 1215

(Current 512 Enzyme Dataset)

Page 12: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Metal Site Atlas

Annotates Metal Sites in PDB Similar to CSA database Searchable by:

PDB code Swiss-Prot code Homologues.

Dataset includes: Copper, Zinc, Calcium, Iron (excl. hemes),

Cobalt, Magnesium, Manganese, Molybdenum, Nickel and Tungsten.

Page 13: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Metal Site Atlas Contents

Templates: 46 Cu 195 Zn 270 Ca 83 Fe

6 Co 86 Mg 45 Mn 10 Mo 7 Ni 4 W

752 Total Templates

Sites in MSA: 6301 PDB Files 25374 Metal Binding Sites

Page 14: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004
Page 15: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004
Page 16: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004
Page 17: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004
Page 18: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Comparison of CSA v1.0 with Swiss-Prot and PDB Site Annotations

Page 19: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

CSA v1.0 - Literature

EC Wheels

CSA v1.0 – plus homologues

Page 20: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

iCSA: Using Functional Residue Conservation to Improve Function Annotation

Starting with over 500 enzymes from the CSA, with EC numbers and high

quality catalytic site information

Retrieve homologues from BiopendiumTM

Align homologues with query enzyme, using

PSI-BLAST profiles

CLUSTAL W multiple alignments

Smith and Waterman pairwise alignments

Check for conservation of catalytic residues

If all residues are conserved, assign EC from annotated enzyme to homologue

Also deals with mutation, etc. if necessary

Page 21: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Testing the iCSA Method Searches with 517 CSA sites retrieved over 30700 Swiss-

Prot sequences within four iterations of PSI-BLAST These were assigned three digit EC numbers using the

iCSA method The assigned EC numbers were then compared with the

EC annotation given in the Swiss-Prot database The accuracy of EC assignment was compared with the

accuracy achieved using sequence homology (i.e. PSI-BLAST)

CSA query

enzymeSwiss-Prot

HomologuesiCSA filteredhomologues

Homology search

Function assignment

by homology

Function assignment using CSA

iCSAfilter

Page 22: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

EC Assignment Accuracy

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 all

PSI-BLAST iteration

% E

Cs

as

sig

ne

d c

orr

ec

tly

SequencehomologyDescribedmethodCSA

Correct EC assignedAn EC assigned

Page 23: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Improvement in EC Assignment Accuracy, Compared with Homology Alone

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4

PSI-BLAST iteration

% i

mp

rove

men

t

48% overall

AccuracyiCSA-AccuracyHomology

AccuracyHomology

Page 24: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

iCSA vs. Sequence Homology Alone

The accuracy of EC assignment is improved

by using iCSA The improvement in accuracy is more pronounced with more

distant homologues: from 7% at iteration 1 to 88% at iteration 4

Overall, EC assignment accuracy is improved by 48%

Overall, EC assignment accuracy using iCSA is 86%

(vs. 58% using sequence homology alone)

Page 25: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

iCSA EC Coverage

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 all

PSI-BLAST iteration

% c

ove

rag

e%

co

vera

ge

PSI-BLAST iteration

Correct EC assignedHomologues with correct EC

Page 26: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

iCSA vs. Sequence Homology Alone

iCSA coverage is 78% overall

The iCSA is right to reject many of these homologues

even though they have the same EC as the CSA site

used as the query

EC covered by more than one specific catalytic site

Incorrect EC assignment in Swiss-Prot

But misaligned sequences are also possible, especially

with more distant homologues

Page 27: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

iCSA Correctly Rejects Homologues

The iCSA accuracy with the CSA trypsin site is 100%

The benefits of the iCSA method can be seen in the homologues

not assigned the trypsin EC

Trypsin homologues that do not pass the catalytic residue checks

in iCSA include several haptoglobin proteins

Haptoglobin is closely related to trypsin, but is a known non-

enzyme

Sequence homology alone would assign these haptoglobin

sequences the trypsin EC, but iCSA can correctly identify that the

residues for catalysis are not present

Page 28: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Human Genome Annotation

We applied iCSA to the human ENSEMBL sequence database The iCSA directly annotated 2064 sequences with an EC

Only 64% of these have an equivalent Swiss-Prot protein at least 90% pairwise sequence identity and a difference

in length of less than 10% of the shorter sequence

So 743 sequence annotations have been efficiently expanded A further 2257 homologues did not have a conserved site and an

EC was not assigned

73% of the equivalent Swiss-Prot sequences had an alternative EC number to the iCSA query

Homology-based functional assignments in these cases could prove incorrect

Page 29: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Summary

iCSA methodology developed Database currently contains:

7013 PDBs (11710 chains) 18033 Swiss-Prot sequences 4321 Human ENSEMBL sequences 4227 Mouse ENSEMBL sequences

Page 30: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Poster E-37Session 1 (Sunday)

Page 31: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

3D Templates to Characterise Functional Sites

Template searches

(189 enzyme active site templates)

(~600 Metal binding site templates)

Page 32: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

GARTfaseCholesterol oxidaseIIAglc histidine kinase

Carbamoylsarcosineamidohhydrase

Dihydrofolate reductase Ser-His-Aspcatalytic triad

Database of enzyme active site templates

189 templates

Page 33: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

MCSG structure

BioH – unknown function involved in biotin synthesis in E.coli

An example

Structure: Rossmann fold, hence many structural homologues

Expected to be an enzyme

Sequence contains two Gly-X-Ser-X-Gly motifs typical ofacyltransferases and thioesterases

Page 34: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Ser-His-Asp catalytic triad of the lipases with rmsd=0.28Å

(template cut-off is 1.2Å)

CSA template searchOne very strong hit

Experimentally confirmed by hydrolase assays

Novel carboxylesterase acting on short acyl chain substrates

Page 35: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Generation of 3D Active Site Templates for Enzymes in the Catalytic Site Atlas

Gail J Bartlett*, James W Torrance, Craig T Porter, Jonathan A Barker, Alex Gutteridge, Malcolm W MacArthur, Janet M Thornton

EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK* Centre For Bioinformatics, Biochemistry Building, Imperial College London, South Kensington Campus, London SW7 2AZ, UK

1. Introduction

Structural templates can be used to search protein structures for particular patterns of residues, such as catalytic sites. Structural templates are thus a tool for predicting protein function. There are many methods that employ structural templates, but no reliable template libraries. The Catalytic Site Atlas1 is a database of catalytic residues within proteins of known structure. This information can be used to create a template library. We hope to use this library to uncover cases of convergent evolution and to predict function from structure.

2. Objectives

•To use the Catalytic Site Atlas to create a library of structural templates representing catalytic sites

•To assess the effectiveness of these templates for identifying proteins with a particular catalytic function

4. Results

•No correlation between RMSD of template atoms and percentage pairwise sequence identity found within homologous enzyme families

•Majority of RMSD values between templates from homologous family members were below 1Å

•Templates distinguish related enzymes well in most families, with > 75% of relatives having RMSDs better than that of any random match.

•Some families showed wide variation of catalytic residue geometry, making prediction difficult.

•Templates based on C / C atoms performed slightly better than those which used functional atoms.

3. Methods

Template generation and analysis of active site geometry

Two types of template were created (atoms used are highlighted in ball form):

Templates within the same homologous enzyme family were superposed and the distribution of RMSDs examined.

Assessing template effectiveness

The Jess template-matching method2 was used to query all the templates against a non-redundant subset of the PDB. Hits were scored using both RMSD and a statistical significance measure. The effectiveness of hits was measured by comparing scores of hits between relatives with scores from random hits identified in the PDB.

C and C atoms Three “functional” atoms

6. A “bad” template - fructose 1,6-bisphosphatase

It is difficult to construct a sensitive template for fructose 1,6-bisphosphatase because one catalytic residue is on a flexible loop that moves when AMP binds at an allosteric site.

5. A “good” template - aldolase A

Aldolase A relatives superpose well (below right) and there is a clear separation between these and random hits to PDB (below left).

Superposition of homologous family

templates

Open form

Catalytic residues

Flexible loop

AMP

Loop closed

Structures of open form

Structures of closed form

Distribution of RMSDs of hits to aldolase template (based on PDB 1ald)

°

Poster Number I76 - Monday

Page 36: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Template databases HAND CURATED

Enzyme active sites (PROCAT) – 189 templates

Currently being extended

Metal-binding sites – 600 templates

AUTOMATED Ligand-binding sites – 10,000 templates

DNA-binding sites – 800 templates

Page 37: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

ProFunc – function from 3D structure

Homologous sequences of known function

Binding site identification and analysis

Homologous structures of known function

Functional sequence motifsQ-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC]

Enzyme active site 3D-templates

HTH-motifs Electrostatics Surface comparison

… etc

DNA-, ligand- binding and “reverse” templates

Residue conservation analysis

Page 38: Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004

Acknowledgements

CSA: Craig Porter, Gail Bartlett, Alex Gutteridge, Malcolm MacArthur (EBI), Neera Borkakoti

Genome Annotation: Ruth Spriggs, Richard George, Mark Swindells, B. Al-Lazikhani (Inpharmatica)

ProFunc: Roman Laskowski; James Watson (EBI)