Upload
allen-robinson
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Structure-based Analysis of Protein Function
PTPs and Serine Hydrolases
Jacquelyn S. FetrowWake Forest University
Jacquelyn S. FetrowReynolds Professor of Computational BiophysicsDepartments of Physics and Computer Science
Wake Forest University
Need for Improved Proteome Analyses
• Powerful genomics and proteomics methods identify large numbers of protein sequences
• Need to identify biochemical function and functional state accurately
• Need to increase quality of annotations: decrease false positive and false negative identifications
• Except in model organisms, over 50% of all proteins identified by large-scale sequencing projects are annotated as “function unknown”
• Annotations are inadequate and do not adequately describe functional complexity of proteins
• Annotation transfer methods can assign incorrect function in a significant number of cases
Knowing the Sequence is Not Enough to Determine the Function
Knowing the Sequence is Not Enough to Determine the Function
S. cerevisiae Fsh3p
S.pombe DYR_SCHPO
S. cerevisiae DHFR
Structural proteomics approach to function annotation
• Most common method:– structural
superposition– function annotation
transfer based on structural similarity
COX-1 (1cqe)COX-2 (1cx2)
But, Knowing the Structure is Not Enough to Predict the Function
See also:
Martin, et al. 1998, Structure 6:875-884
Hegyi and Gerstein, 1999, J. Mol. Biol. 288:147-164.
Similar Structure, Similar Structure, Similar FunctionSimilar Function
Similar Structure, Similar Structure, Similar FunctionSimilar Function
48%48%48%48% 27%27%27%27%
Similar Structure, Similar Structure, Different FunctionDifferent FunctionSimilar Structure, Similar Structure, Different FunctionDifferent Function
23%23%23%23% Different Structure, Different Structure, Different FunctionDifferent Function
Different Structure, Different Structure, Different FunctionDifferent Function
1.5%1.5%1.5%1.5%
Different Structure, Different Structure, Similar FunctionSimilar Function
Different Structure, Different Structure, Similar FunctionSimilar Function
Koppensteiner, W., Lackner, P. Wiederstein, M., & Sippl,
M. J. Mol. Biol 2000 296:1139.
Analysis of high resolution structures
released in 1998 compared to pre-
1998 PDB structures
But, then, what do we really mean by function?
• Two isoforms of human cyclooxygenase, COX-1 & COX-2
• COX-1 is expressed in healthy tissues; COX-2 is induced in inflammatory response
• COX-1 and COX-2 have ~60% sequence identity, very similar overall structures, and identical catalytic residues
COX-1 (1cqe)COX-2 (1cx2)
• Aspirin/NSAIDs inhibit both isoforms; COX-1 inhibition can lead to gastrointestinal side effects
• Newer COX-2 selective inhibitors (VioxxTM, CelebrexTM) have anti-inflammatory and pain killing benefits of NSAIDs with reduced side effects
Goal: accurate identification of active sites and their similarities and differences
COX-1 (1cqe)COX-2 (1cx2)
1cqe: P RLVLTVRSNLI AQ TF –EFNQLYHWH –R FGM Y- GESMIEMGAPFSLK
1cx2: P –YVLTSRSYLI AQ TF SEFNTLYHWH YR FSL YL GETMVELGAPFSLK
But, then, what do we really mean by function?
Fuzzy Functional Forms and Active Site Profiling
• Advantage: computational method based on structure– Use of structural (not just
sequence) information– Identification of key functional
features (not annotation transfer via global sequence alignment)
– Fast; can be globally applied to protein sequences
• Disadvantage: computational method– Scoring function cutoffs– False positive and negative
rates– Size of FFF library
Fetrow & Skolnick. J. Mol. Biol. (1998) 282: 949-968.Cammer, Hoffman, Speir, Canady, Nelson, Knutson, Gallina, Baxter, Fetrow. J. Mol. Biol. (2003) 334:387-401.
Geometric definition of an FFF
• Defined by three metrics– Key residues (and their
identity) involved in active site chemistry
– Geometric constraints (distances between alpha carbons)
– Allowed variability for geometric constraints
• Training– Against all PDB structures– Relax constraints to identify
all true positive structures, but no false positives
– Cross validation
A
C
B
Fetrow & Skolnick. J. Mol. Biol. (1998) 282: 949-968.Fetrow, Godzik & Skolnick. (1998) J. Mol. Biol. 282:703-711.
Advantages of the FFF approach
• Use of structural information enables:
– Function annotation farther into “twilight zone”
– Identification of similar functional sites in proteins of different structure
• Functional complexity
– Identification of multiple chemistries within a single functional site
– Identification of multiple functions within a protein domain
Serine-threonine phosphataseFFF 1=metal binding siteFFF 2=metal binding site
FFF 3=phosphatase catalytic residues
FFF for redox regulatory siteFFF for redox regulatory site
Fetrow, Siew, Skolnick. FASEB J (1999) 13:1866-74
P30366 C C P P48726 V Y V P48456 A C FP48487 C C P P32838 I Y F Q27889 A C FP48486 C C P P32345 V Y K P16299 A S FP48483 C C P P23595 L Y I P20651 A S FP48488 C C P P48580 L Y I P48453 A S FP48481 C C P P23696 L Y L Q08209 A C FP48480 C C P P11493 L Y L P20652 A C FP48484 C C P P11611 L Y L P48452 A C FP48489 C C P P11082 L Y L P48455 T C FP22198 C C P P48463 L Y L P48454 T C FP48485 T C P P13353 L Y L Q12705 S A FP48482 S C P P05323 L Y L O42773 S N FP23880 C C P P48577 L F I P48457 S A FQ05547 C C P P48579 L Y I Q05681 S C FP12982 C C P Q07099 L Y I P23287 S V FP48461 C C P Q07098 L Y I P14747 S N FP36874 C C P P23778 L Y VP36873 C C P Q06009 L Y VP37139 C C P Q07100 L Y VP08128 C C P P48578 L Y VP08129 C C P P23635 L Y VP48462 C C P P23636 L Y VP37140 C C P P23594 L Y IP13681 C C PP32598 C C PP20654 C C PP23777 C C PP48490 C C PP23733 T C PP23734 T C PP20604 V F A
Comparison of putative redox active site residues
PP1
PP2A
PP2B
Cluster analysis of PP1, PP2A, and PP2B subfamilies
PP1
PP2APP2B
P30366 C C P P48726 V Y V P48456 A C FP48487 C C P P32838 I Y F Q27889 A C FP48486 C C P P32345 V Y K P16299 A S FP48483 C C P P20604 V F A P20651 A S FP48488 C C P P48580 L Y I P48453 A S FP48481 C C P P23696 L Y L Q08209 A C FP48480 C C P P11493 L Y L P20652 A C FP48484 C C P P11611 L Y L P48452 A C FP48489 C C P P11082 L Y L P48455 T C FP22198 C C P P48463 L Y L P48454 T C FP48485 T C P P13353 L Y L Q12705 S A FP48482 S C P P05323 L Y L O42773 S N FP23880 C C P P48577 L F I P48457 S A FQ05547 C C P P48579 L Y I Q05681 S C FP12982 C C P Q07099 L Y I P23287 S V FP48461 C C P Q07098 L Y I P14747 S N FP36874 C C P P23778 L Y VP36873 C C P Q06009 L Y VP37139 C C P Q07100 L Y VP08128 C C P P48578 L Y VP08129 C C P P23635 L Y VP48462 C C P P23636 L Y VP37140 C C P P23594 L Y IP13681 C C P P23595 L Y IP32598 C C PP20654 C C PP23777 C C PP48490 C C PP23733 T C PP23734 T C P
Comparison of putative redox active site residues
PP1PP2A
PP2B
Limitations of the FFF Approach
• FFFs only uses identities of three residues– Leads to false positive identifications
• FFF hit is only yes/no– Does not have a score or confidence
associated with it
• FFFs only identify key residues– Does not identity specificity—substrate or
small molecule specificity
Active site signature: first step in active site profiling
• Use FFF to identify key functional residues
• Extract fragments in structural proximity to FFF residues
• Arrange fragments to form a linear sequence—active site signature
Cammer, Hoffman, Speir, Canady, Nelson, Knutson, Gallina, Baxter, Fetrow. J. Mol. Biol. (2003) 334:387-401.
1mucA_A RHRVFKLKIGA-ASIFALKIAKNGGPVTA--GLYGGTMLEGSIGTLASAHAF--LTWGTELFGPLLL 2mucA_B RHRVFKLKIGA-ASIFALKIAKNGGPVTA--GLYGGTMLEGSIGTLASAHAF--LTWGTELIGPLLL 1bkhC_C RHRVFKLKIGA-ASIFALKIAKNGGPVTA—-GLYGGTMLEGSIGTLASAHAF--LTWGTELFGPLLL 1bkhB_D RHRVFKLKIGA-ASIFALKIAKNGGPVTA—-GLYGGTMLEGSIGTLASAHAF--LTWGTELFGPLLL 3mucA_E RHRVFKLKIG--ASIFALKIAKNGGPVTA—-GLYGGTMLEGSIGTLASAHAF--LTWGTELFGPLLL 1chrA_F RHNRFKVKLGF-VDVFSLKLCNMG—VTIA--ASYGGTMLDGSIGTLASAHAF-SLPFGCELIGPFVL 2chr__G RHNRFKVKLGF-VDVFSLKLCNMGGVTIA--ASYGGTMLDSTIGTSVALQLYS-LPFGCELIGPFVL
Profile segments for 7
enzymes identified by
one FFF
Profile segments for 7
enzymes identified by
one FFF
Align signatures to create active site profile
Examples of residues identical across family
Examples of residues different between family members—possible specificity determinants?
Active Site Profile Score
• Empirically derived function takes into account sequence similarity
• Enables approaches based on active site information– Clustering of functional
families (profile score)– Novel sequence family and
subfamily assignment (pairwise score)
1.0 0.2 0.1
Identity
Strong Weak
1cozA_1 GTFDLLHWGHIKLLEAYRTISTTKIKEE
1cozB_1 GTFDLLHWGHIKLLEAYRTISTTKIKEE
BS002557__1cozA GTFDPPHNGHLLMANDYREVSSTMIRER
**** * **: : : ** :*:* *:*.
N
SSSSScore
n m k l
gWSI 1 1 1 1
Validation of Active Site Profile Score
• 193 real functional families– 193 FFFs applied to known
structures from PDB to identify functional families
– For each protein in each family, extract active site signature
– Align all signatures in a given family to create profile
– Calculate profile score
• 193 decoy functional families– Geometric criteria “relaxed” slightly
to identify first “false positive”– (Automatically identified as part of
training procedure)– Extract signatures, align to create
profile, calculate score
A
C
B
A
C
B
Validation of the active site profile score
Active site profile for serine carboxypeptidasesProfile score=0.42
1ivyA LNGGP--GESYAGIYIVGNGLSLFNIYNLY--N-NGDVDMACNF-GAGHMVPTD1ysc_ LNGGP--GESYAGHY-IGNGLTMAGE-NVYDIRKAGDKDFICNWLNGGHMVPFD1ivyB LNGGP--GESYAGIYIVGNGLSLFNIYNLYA-N-NGDVDMACNF-GAGHMVPTD1cpy_ LNGGP-AGASYAGHYIIGNGLTMAG--NVYDIR-AGDKDFICNWLNGGHMVPFD ***** * **** * :*** *:* . ** *: ** ...**** *
1ac5_ LNGGPC-GESYAGQY-IGNGWI-----NMYNFN-NGDKDLICNN-NASHMVPFD
Validation of Active Site Profile Score
1ivyA LNGGP--GESYAGIYIVGNGLSLFNIYNLY--N-NGDVDMACNF-GAGHMVPTD1ysc_ LNGGP--GESYAGHY-IGNGLTMAGE-NVYDIRKAGDKDFICNWLNGGHMVPFD1ivyB LNGGP--GESYAGIYIVGNGLSLFNIYNLYA-N-NGDVDMACNF-GAGHMVPTD1cpy_ LNGGP-AGASYAGHYIIGNGLTMAG--NVYDIR-AGDKDFICNWLNGGHMVPFD ***** * **** * :*** *:* . ** *: ** ...**** * 1ac5_ LNGGPC-GESYAGQY--IGNGWI-----NMYNFN-NGDKDLICNN---NASHMVPFD1ivyA LNGGP--GESYAGIYI-VGNGLSLFNIYNLY--N-NGDVDMACNF---GAGHMVPTD1ysc_ LNGGP--GESYAGHY--IGNGLTMAGE-NVYDIRKAGDKDFICNWL--NGGHMVPFD1ivyB LNGGP--GESYAGIYI-VGNGLSLFNIYNLYA-N-NGDVDMACNF---GAGHMVPTD1cpy_ LNGGP-AGASYAGHYI-IGNGLTMAG--NVYDIR-AGDKDFICNWL--NGGHMVPFD1c4xA LHGAG--GNSMGGAVTLMGSVG-----SFVY----HGRQDRIVPLTLDRCGHWAQLE *:*. * * .* :*. :* * * .* . :
1ac5_ LNGGPC-GESYAGQY-IGNGWI-----NMYNFN-NGDKDLICNN-NASHMVPFDSerine carboxypeptidase
profileScore=0.42
Serine carboxypeptidase
decoy profileScore=0.14
Validation of Active Site Profile Score
• Profile score compared to decoy profile score shows clear separation for most families
• Separation less distinct when decoy is functionally related to FFF family
• Profile score ≥0.25 considered significant
A
-0.4-0.2
0
0.20.40.60.8
11.2
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171
FFF Functional Family
Pro
file
Sco
re
B
0
0.2
0.4
0.6
0.8
1
1.2
173 176 179 182 185 188 191 194
FFF Functional Family
Pro
file
Sc
ore
True profiles
Decoy profiles
Prospective validation of the method
• Human protein tyrosine phosphatases (PTPs)– PTPs are important signal
transduction proteins– Analysis demonstrates
accuracy and throughput• Yeast serine hydrolases
– Serine hydrolases are crucial for many cellular processes
– Analysis demonstrates experimental validation of sensitivity and accuracy of function annotations
– Performance compared to other tools
Method for genome analysis
• Download protein sequences encoded by human or yeast genome• Run Prospector (Skolnick, et al) fold recognition program• For any protein sequence that aligns with structure used to create FFF:
– Take top 20 alignments (top five hits for four scoring functions)– Determine if FFF residues conserved
• If yes:– Predict FFF function– Identify active site signature– Align and calculate pairwise profile score
PTP Functional Family
• Catalytic site is found in multiple protein structures
• Active site structure is conserved
2hnp, a classical PTP 1vhr, a dual specificity PTP 1phr, a low molecular weight PTP
Annotation of human genome sequences for PTP function
• Identified over 150 human PTPs– Comparison to experimentally-verified
PTPs shows that over 95% of known PTPs identified: false negative rate < 5%
• Over 40 unique PTPs identified– Sequences that are not recognized
as PTPs by any other method (including BLAST, Blocks, Prints and Pfam)
0%
20%
40%
60%
80%
100%
Unique to FFF
FFF + other tools
How good are these function assignments?How good are these function assignments?
Functional Characterization of PTP Proteins
• Clone, express, and purify
• Test PTPs for biochemical function
• Progress (before termination of project)
– 49 soluble PTP domains purified
– 37 PTPs active in vitro
– Four active PTPs that were not previously recognized by other methods (including no recognizable similarity to any PTP in the public databases)
500x10-6
400
300
200
100
V (
A405nm
/sec)
2015105PNPP (mM)
Hydrolysis of pNPP by PTP #1
0
20
40
60
80
100
120
140
160
180
TOTAL Structure-basednovels
Clean novels
Proteins in Target Set
Nu
mb
er o
f P
rote
ins
Target Set
Soluble Protein
Active in vitro
30%
15%65%35%
75%
66%
• False positive rate cannot be absolutely determined; PTP project shows:– Total PTP proteins: 49 soluble proteins, with 37 active
in pNPP hydrolysis assay (~25% not validated in assay)
– PTP proteins unrecognized by other methods: 6 soluble proteins, with 4 active in pNPP hydrolysis assay (~33% not validated in assay)
– Maximum false positive rate: ~25-33%
• Why a maximum?– Only one substrate and assay condition tested– Small sample set
Functional Characterization of PTP Proteins
• Identified over 150 human PTPs
• Identify active site signature from each PTP sequence
• Align to create active site profile for PTP family
• Cluster to identify subfamilies of PTPs
0%
20%
40%
60%
80%
100%
Unique to FFF
FFF + other tools
Active Site Profiling of Human PTPs: Identification of Sub-families
Active Site Profiling of Human PTPs: Identification of Sub-families
--Novel PTP#5--Blast (global sequence similarity) indicates that PTP#5 is dual specificity PTP
--Clustering of active site profile indicates “PTP#5” falls into class 1
ClassicalPTPs
Dual specificity PTPs and PTEN
Low molecular weight PTPs
All PTPs
Subfamily 1
Subfamily 2
Subfamily 3
Subfamily 4
Subfamily 7
Subfamily 8
Subfamily 5
Subfamily 6
Active Site Profiling of Human PTPs: Identification of Sub-families
Active Site Profiling of Human PTPs: Identification of Sub-families
Summary of human PTP annotation project
• 150 PTPs identified in human genome– Over 95% of previously annotated PTPs identified
(false negative rate <5%)– Of those tested in our lab, 75% exhibited PTP
function
• 40 proteins not identified by other methods (BLAST, Blocks, Pfam)– Of those tested, 66% exhibited PTP function
• Maximum false positive rate: 25-33%• Active site profiling subclassifies proteins
differently than global sequence alignment
FFFs for Serine Hydrolases
• 35 serine hydrolase FFFs describing 25 EC-defined functions– Nucleophilic serine in active site– Protease, lipase, esterase, amidase or transacylase function (FAD-
independent-S-hydroxynitrile lyase, too)– Several “family” FFFs, including hydrolase “family” FFF
• 35 FFFs cover approximately 63% of known structural space and 23% of potential functional space
0% 20% 40% 60% 80% 100%
(S) Hydroxynitrile Lyases
Serine Transacylases
Serine Amidases
Serine Esterases
Serine Lipases
Serine Proteases
Total Structural Space
Fu
nct
ion
Identification of Yeast Serine Hydrolases by FFFs and Profiling
• 6946 yeast protein sequences (NCBI and SGD) • Threading with PROSPECTOR against PDB
structures• Analysis of top 20 threads (top five scores, four
scoring functions) with serine hydrolase FFFs• If thread is “hit” by FFF, sequence is identified as
a serine hydrolase (yes or no)• Active site profile scoring provides rank ordering
of identified serine hydrolases; ≥0.25 is considered significant
Skolnick & Kihara. (2001) Proteins 42:319-331.DiGennaro, Siew, Hoffman, Zhang, Skolnick, Neilson, Fetrow. (2001) J. Struct. Biol. 134:232-245.Fetrow, Godzik & Skolnick. (1998) J. Mol. Biol. 282:703-711.
Annotation of yeast genome for serine hydrolase functions
• 147 proteins identified by combination of PROSPECTOR and serine hydrolase FFFs
• 52 of 147 proteins identified by more than one serine hydrolase FFF
• 55 of 147 proteins identified with significant active site profile score (≥0.25)
• 7 proteins were previously identified* as serine hydrolases (“knowns”)– Profile score≥0.25: Dap2, Kex1, Prb1, Prc1, Ste13, and Yjl068c – Profile score=0.23: Ppe1
*Previously identified in SGD (http://genome-www.stanford.edu/Saccharomyces/)
How good are these function assignments?How good are these function assignments?
Activity-based Probe Technology
• Advantage: probe chemistry– Identifies functional
proteins in complex mixtures
– Fractionates proteome on basis of chemical reactivity (not protein abundance)
• Disadvantage: probe chemistry– Specific for serine
hydrolases?
BiologicalBiologicalSamplesSamples
BiologicalBiologicalSamplesSamples
ActivityActivityProbesProbesActivityActivityProbesProbes
High High ThroughputThroughputScreeningScreening
High High ThroughputThroughputScreeningScreening
Patricelli, Giang, Stamp, Burbaum. (2001) Proteomics 1:1067-1071.Kidd, Liu & Cravatt. (2001) Biochemistry 40:4005-4015. Cravatt & Sorenson. (2000) Curr. Opin. Chem. Biol. 4:663-668.
Identification of Serine Hydrolases by ABPs
• Yeast grown under four culture conditions
• Cultures lysed, centrifuged, fractions labeled with ABP
• Affinity chromatography; separation of labeled proteins by 1D PAGE
• In-gel tryptic digest and LC-MS identification of peptides
• High quality identifications: More than one peptide identified for a given protein
Results of ABP labeling experiments
• 80 proteins uniquely labeled by ABP• 23 of 80 proteins identified with high quality
mass spec data– 8 of 23 proteins were previously identified* as
serine hydrolases (“knowns”): Dap2, Kex1, Ppe1, Prb1, Prc1, Ste13, Yjc068c and Amd2
– “unknowns”: Ygl039w, Ygl157w, Yml059c, Fas2, Ydr428c, Ynl123w, Yor084w, Eht1, Yju3, Ybr139w, Ybr204c, Yhr049c, Ylr118c, Ymr222c, and Yor280c
*Previously identified in Saccharomyces Genome Database (SGD) (http://genome-www.stanford.edu/Saccharomyces/)
Comparison of computational and experimental results
• Chemical proteomics: 23 high quality identifications
• Computational/structural proteomics: 55 proteins identified with significant active site profile score (≥0.25)
• 15 proteins identified by both methods (high quality identifications by both methods)
How well did the FFFs identify ABP-labeled proteins?
• If all 23 proteins identified by ABP labeling are correct, then:– FFF identification: 15/23=65%– FFF coverage of structure space (“the best we
could expect to do”): 65%– FFF coverage of biological function space
(“the worst we could expect to do”): 23%
• But, are all the ABP identifications actually serine hydrolases?
What did the FFFs miss?
• 8 proteins identified by high quality ABP data, but not serine hydrolase FFFs– Amd2 (“8th known”) identified by ABP, but not FFF
because no amidase FFF had been constructed– 3 proteins identified by dehydrogenase FFFs, not
serine hydrolase FFFs (discussed subsequently)
– 3 proteins with significant threading scores, no FFF hit• Yor084w (1a8uA): chloroperoxidase T (known serine
hydrolase)• Fas2 (1kas): 3-oxo-ACP-reductase/synthase• Ynl123w (1pysB): tRNA synthetase
– 1 protein (Ydr428c) yields no computational results
Advantages of Combining Methods: Clarification of ABP identifications
• 3 proteins identified by high quality ABP data, but not serine hydrolase FFFs– Ygl039w, Ygl157w, and Yml059c– All three labeled by another family of FFFs (UDP-galactose-4-
epimerase, estradiol-17-beta dehydrogenase, and 3-alpha, 20-beta-hydroxysteroid dehydrogenase)
– Proteins in this family all have active site serine and tyrosine: possible site of ABP labeling
• If these protein functions are correctly identified by the FFFs AND if other five ABP identifications are correct, then:– FFF identification: 18/23=78% (better than expected)
What about the “unknowns”?
• 15 proteins identified by both methods– 7 of 8 “knowns” identified by both methods
(Dap2, Kex1, Ppe1, Prb1, Prc1, Ste13, and Yjl068c)
– 8 novel annotations of proteins as serine hydrolases (Eht1, Yju3, Ybr139w, Ybr204c, Yhr049w, Ylr118c, Ymr222c, and Yor280c)
• All 8 annotated as “function unknown” or “hypothetical protein” in SGD
• High confidence in novel annotations (two independently applied methods)
• 15 proteins identified by both methods– 7 of 8 “knowns” identified by both methods
(Dap2, Kex1, Ppe1, Prb1, Prc1, Ste13, and Yjl068c)
– 8 novel annotations of proteins as serine hydrolases (Eht1, Yju3, Ybr139w, Ybr204c, Yhr049w, Ylr118c, Ymr222c, and Yor280c)
• All 8 annotated as “function unknown” or “hypothetical protein” in SGD
• High confidence in novel annotations (two independently applied methods)
What about the “unknowns”?
New Family of Eukaryotic Serine Hydrolases (FSH)
• 3 yeast proteins (Yhr049w, Ymr222c, and Yor280c) identified by both ABP and FFFs
• 3 sequences related by sequence similarity
• All annotated as “function unknown” at SGD
• None annotated with confidence by other computational methods (Prints, Pfam or Blocks)
New Family of Eukaryotic Serine Hydrolases (FSH)
• These 3 proteins related to proteins from other eukaryotic proteomes (human, mouse, worm, fruit fly, mosquito, plant)
• No NCBI biochemical annotations for any of these proteins (except one—see next slide)
Cautionary Tale for Annotation Transfer
• One FSH protein, DYR_SCHPO, from S. pombe was annotated as a dihydrofolate reductase (DHFR)
• Sequence analysis indicates a multidomain protein: contains both DHFR and serine hydrolase function– Possible biological connection between serine hydrolase and
DHFR functions?
• Annotation transfer methods would have assigned incorrect function to FSH family of proteins
S. cerevisiae Fsh3p
S.pombe DYR_SCHPO
S.cerevisiae Fsh2p
S. cerevisiae Fsh1p
S. cerevisiae DHFR
Comparison to other computational methods: How much information does structure add?
• ABPs identified 23 proteins with high confidence• FFFs identified 15 (65%) as serine hydrolases• Pfam identified 10 (43%) as serine hydrolases
0
5
10
15
20
25
All Experimental Hits Experimental Hits with SGD"molecular function
unknown"
Total
FFF
Pfam
• 15 serine hydrolase sequences identified by both methods– 7 of 8 known serine hydrolases identified by both
methods (all eight identified by ABP labeling)– 8 new serine hydrolases identified (formerly
annotated as “function unknown”)– New family of eukaryotic serine hydrolases (FSH)
• FFF annotation clarifies molecular function of the three proteins identified by ABP labeling
• More accurately identify limits of FFF and active site profiling accuracy– If 23 ABP identifications are correct, FFF correctly
identifies function of 78%
Summary of yeast serine hydrolase annotation project
Baxter, et al. (2004) Mol. Cell Prot.
Structure-based annotation of protein function
• Prospective experimental validation of predictions demonstrates accuracies (and limitations) of current methods
• Mis-annotation of function continues to be a problem—found in all databases
• Results suggest that a significant number of proteins will exhibit well-studied functions, but are not identified by current computational methods
• Profiling of sequences around functional site provides additional information on function and specificity
Acknowledgements
– Susan Baxter (NCGR)– Melanie Nelson (SAIC)– Stephen Cammer (SDSC)– Brian Hoffman (Scitegic)– Jen Montimurro (Wadsworth Ctr)– Stacy Knutson (Wake Forest)– Jeff Speir (Scripps)– Jeannine DiGennaro (GeneVault)– Steve Betz (Neurocrine)– Marijo Galina– Susan Okuley– Chris Scott
ActivX– Jonathan Burbaum– Jonathan Rosenblum– Dan Giang
(now Cengent Therapeutics)