View
6
Download
0
Category
Preview:
Citation preview
Analysis of Haloacid Dehalogenase superfamily members and functional assignment of
Structural Genomics proteins
by Mong Mary Touch
B.S. in Chemistry, University of Massachusetts Lowell
A thesis submitted to
The Faculty of
the College of Science of
Northeastern University
in partial fulfillment of the requirements
for the degree of Master of Science
December 12, 2014
Thesis directed by Mary Jo Ondrechen
Professor of Chemistry and Chemical Biology
ii
Acknowledgments
I would like to thank my advisor Dr. Mary Jo Ondrechen who allowed me to join
her laboratory and work on this project, and provides valuable guidance. Special thanks
to my mentee and friend, Eve Mozur, who had worked really hard on this project with
me. I also would like to thank my committees, Dr. Penny Beuning and Dr. Carla Mattos,
for their assistance and their time in reviewing my project thesis. I want to thank previous
and current ORG members: Dr. Joslynn Lee, Dr. Ramya Parasuram, Lisa Ngu, Caitlyn
Mills, Timothy Coulther, Zhen Liu, and Jenifer Winters for their support and discussions.
My project was financially supported by National Science Foundation under grant CHE-
1305655.
iii
Abstract of Thesis
Dehalogenases are enzymes that can degrade certain types of organic pollutants,
particularly halogenated hydrocarbons, into non-toxic compounds. In this research, the
members of the haloacid dehalogenase (HAD) superfamily are analyzed to predict their
biochemical function, with the ultimate goal of finding enzymes for possible application
to the bioremediation of environmental contaminants. The HAD superfamily consists of
mainly phosphatases, dehalogenases, and a large number of protein structures of
unknown or uncertain function from Structural Genomics (SG). To study the HAD
superfamily, the computational methods Partial Order Optimum Likelihood (POOL) and
Structurally Aligned Local Sites of Activity (SALSA) are utilized. From this study, the
SG protein RSc1362 from Ralstonia solanacearum (PDB ID 3UMB) is predicted to
function as an L-2 haloacid dehalogenase while HAD/COF-like hydrolase from
Plasmodium vivax (PDB ID 2B30), a hypothetical protein from Geobacillus kaustophilus
(PDB ID 2PQ0), putative phosphate from Eubacterium Rectale (PDB ID 3DAO), and
haloacid dehalogenase-like hydrolase from Bacteroides thetaiotaomicron (PDB ID
3NIW) are predicted to function as sugar phosphatases.
iv
Table of Contents
Acknowledgement ii
Abstract of Thesis iii
Table of Contents iv
List of Figures v
List of Tables vi
Introduction 1
Methods 7
Results 10
Conclusions 24
References 25
v
List of Figures
Figure 1 Capless HAD enzyme, deoxy-D-mannose-octulosonate 2
8-phosphate phosphatase
Figure 2 Catalytic mechanism of phosphatases and dehalogenases 3
of the HAD superfamily
Figure 3 Structural alignment and functional residue alignment of 18
L-2 haloacid dehalogenases
Figure 4 Structural alignment and functional residue alignment of 20
sugar phosphatases
Figure 5 Structural alignment and functional residue alignment of 21
C-terminal domain phosphatases
vi
List of Tables
Table 1 Molecule name and source organism of proteins of 5
known function in HAD superfamily
Table 2 Names and source organism of the studied SG proteins 6
Table 3 Catalytic residues of representative proteins of the 8
HAD superfamily obtained from the literature
Table 4 Consensus signatures of proteins of three subgroups 12
of HAD superfamily
Table 5 Alignment of the predicted residues for the SG proteins 13
with the consensus signatures for three HAD subgroups
Table 6 SALSA table 14
Table 7 Functional residue alignments of high scoring SG proteins 15
to the consensus signatures
1
Analysis of Haloacid Dehalogenase superfamily members and functional
assignment of Structural Genomics proteins
I. Introduction
Halogenated hydrocarbons are widely used in industry, agriculture and in
household items1. These halogenated hydrocarbons can be found in silicones, pesticides,
disinfectants, air fresheners, and rug cleaners. The halogenated hydrocarbons such as
trichloroethane (TCA), trichloroethene (TCE), and perchloroethylene (PCE) are known to
be the most common environmental pollutants found in soil and ground water in the
United States2. According to the United States Environmental Protection Agency, TCE, a
volatile and colorless organic compound, is a priority contaminant due to its wide usage
and potential carcinogenicity3. The contamination caused by halogenated hydrocarbons
can be detoxified by dehalogenation. Reductive dehalogenation can degrade TCE to yield
ethane and hydrochloric acid4. Due to the importance of dehalogenation in
bioremediation, the haloacid dehalogenase (HAD) superfamily is studied.
The members of the HAD superfamily exist in all three superkingdoms of life5.
The length of these enzymes is approximately 200 to 250 amino acids. They fold into a
Rossmanoid fold, a three layered α/β sandwich with a repeating α/β unit5 shown in Figure
1. The alpha/beta sequence of the core domain is highly conserved throughout the
superfamily. Some HAD enzymes contain a cap domain, a sequence insertion into the
core domain which functions as a dynamic lid that determines the accessibility of the
active site6. The modified core catalytic domain creates loop regions that consist of four
conserved motifs. Loop I has the conserved nucleophilic Asp and loop II has the
conserved phosphate-binding residue, either Ser or Thr7. Loop III has a Lys or an Arg
that interacts with the phosphoryl group of the substrate and loop IV binds a magnesium
2
cofactor. Some proteins of the HAD superfamily are capless and they catalyze reactions
with large substrates such as proteins and DNA. An example of a capless HAD enzyme,
deoxy-D-mannose-octulosonate 8-phosphate phosphatase (PDB ID 1K1E)8, is shown in
Figure 1.
Figure 1: A capless HAD enzyme, deoxy-D-mannose-octulosonate 8-phosphate
phosphatase, depicting the Rossmanoid fold (PDB ID 1K1E)8. The image was built using
YASARA software9.
The enzymes that belong to the HAD superfamily include the P-type ATPase,
phosphatases, epoxide hydrolyses and L-2 haloacid dehalogenases; the latter catalyze the
dehalogenation reaction on a wide variety of substrates10
. Most of the HAD enzymes are
phosphatases. Phosphatases catalyze the hydrolysis of a phosphoryl group of the
substrates in two steps as shown in Figure 2. First, a conserved nucleophilic Asp of the
enzyme attacks the electrophilic center phosphorus atom5. Then, the phosphoryl-
intermediate is formed which leads to the second step that requires a water molecule to
hydrolyze the intermediate to regenerate the enzyme5. The nucleophilic attack by Asp is
driven by the coordination of the metal ion for the catalysis. The positive charge on the
metal ion neutralizes the negative charges on the Asp and the phosphate group. It also
assists in stabilizing the enzyme structure. Similarly, the dehalogenases require the Asp
3
nucleophilic attack to form the aspartyl-intermediate that is hydrolyzed by a water
molecule to remove the chloride ion, as also shown in Figure 2.
Phosphatases
Dehalogenases
Figure 2: Catalytic mechanism of phosphatases and dehalogenases of the HAD
superfamily5.
Computational methods for analyzing the HAD superfamily
HAD is a large superfamily that consists of many enzymes, some of which have
unknown or putative biochemical functions. These proteins of unknown or putative
function are primarily Structural Genomics (SG) proteins, i.e. protein structures
determined by the Protein Structure Initiative (PSI) or other high-throughput structure
determination projects. The functions of SG proteins are often assigned based on
sequence or structure similarity to one of the proteins of known function belonging to the
HAD superfamily. The functional assignments of the SG proteins typically are obtained
using sequence based methods or sequence comparison methods such as PSI-BLAST11
.
The study of SG proteins can help to increase understanding of the relationship
between the structure and function of proteins. However, the use of the three-dimensional
structures for the determination of the biological function of proteins has proved to be
much more difficult than was originally envisioned when the PSI was first proposed12
. To
4
determine the function of SG proteins, the computational methods Partial Order Optimum
Likelihood (POOL) and Structurally Aligned Local Sites of Activity (SALSA) are
utilized. POOL is used to predict amino acid residues in a protein structure that
participate in the biochemical function and SALSA is used to assign function according
to the local spatial arrangement of predicted residues at the active site. The development,
application, and verification of computational methods that can predict protein function
reliably from the three-dimensional structure will add tremendous value to Structural
Genomics data. This thesis represents an important step toward that goal.
POOL is a monotonicity-constrained maximum likelihood approach that uses the
three-dimensional structure of proteins to predict important residues involved in ligand
recognition or catalysis13
. POOL is a machine learning method which uses input features
such as metrics from theoretical microscopic anomalous titration curve shapes
(THEMATICS)14
, ConCavity pocket features (ConCavity)15
, and INformation-theoretic
TREe traversal for Protein functional site IDentification (INTREPID)16
to make the
predictions. The three methodologies used to generate POOL input features,
THEMATICS, ConCavity, and INTREPID, are now reviewed briefly.
THEMATICS is a computational method that predicts ionizable residues (Arg,
Lys, Asp, Glu, His, Cys and Tyr, plus the N- and C- termini) associated with the
Brønsted acid-base chemistry in the active site of enzymes14
. Using only the three-
dimensional structure of a protein, the ionizable residues in the active site can be
identified by the different shapes in their theoretical titration curves that deviate from the
usual Henderson-Hasselbalch curves14
. THEMATICS obtains the theoretical titration
curves from an approximate solution of the Poisson-Boltzmann equations17
and then from
5
a Monte Carlo sampling by HYBRID18
. Statistical and machine learning methods are
then used to calculate which residues are functionally significant in or near the active site
of an enzyme.
The structure-only version of ConCavity is an algorithm that uses the surface
topology of the protein structure to evaluate residues based on their location in protein
surface cavities15
. ConCavity assigns scores to residues in the protein based on their
likelihood of ligand binding to predict the location of the ligand-binding atoms in space15
.
INTREPID is a method that uses the sequence information, the phylogenetic tree, and the
Jensen-Shannon (JS) divergence19
to predict the conserved catalytic positions of the
query proteins16
. The positional conservation scores obtained from the calculation of JS
divergence are adjusted to consider the scores of other positions16
. Thus, the combination
of THEMATICS, ConCavity, and INTREPID enhances the performance of POOL in the
prediction of important residues in catalysis.
Table 1: Molecule name and source organism of proteins of known function in HAD
superfamily obtained from RCSB protein data bank20
.
Subgroup PDB ID Molecule name Organism
L-2 Haloacid
Dehalogenase
1JUD L-DEX YL Pseudomonas sp. YL
1AQ6 DhlB Xanthobacter autrophicus
2NO4 DehIVa Burkholderia cepacia
Sugar Phosphatase
1YMQ BT4131 Bacteroides thetaiotaomicron
1U02 T6PP Thermoplasma acidophilum
1TJ3 SPP Synechocystis sp. PCC 6803
C-terminal Domain
Phosphatase
3EF0 Fcp1 Schizosaccharomyces pombe
2GHQ Scp1 Homo sapiens
6
Table 2: Names and source organism of the SG proteins that are studied here.
PDB
ID Molecule name Organism
1PW5 Putative nagD protein Thermotoga maritima
1QYI Q8NW41 Staphylococcus aureus subsp.
aureus MW2
1YV9 Hypothetical protein Enterococcus faecalis
1ZJJ Hypothetical protein PH1952 Pyrococcus horikoshii
2B30 HAD/COF-like hydrolase Plasmodium vivax
2HI0 Putative phosphoglycolate phosphatase Gallus gallus
2HOQ Probable haloacid dehalogenase Pyrococcus horikoshii
2HSZ Novel predicted phosphatase Haemophilus somnus
2PQ0 Hypothetical protein Geobacillus kaustophilus
2PR7 Haloacid dehalogenase/epoxide hydrolase
family Corynebacterium glutamicum
2YBD Hydrolase, haloacid dehalogenase-like family Pseudomonas fluorescens Pf-5
3DAO Putative phosphate Eubacterium rectale
3EPR Hydrolase, haloacid dehalogenase-like family Streptococcus agalactiae serogroup
V
3FVV Uncharacterized protein Bordetella pertussis
3M9L Hydrolase, haloacid dehalogenase-like family Pseudomonas protegens Pf-5
3NIW Haloacid dehalogenase-like hydrolase Bacteroides thetaiotaomicron
3QNM Haloacid dehalogenase-like hydrolase Bacteroides thetaiotaomicron
3R09 Hydrolase, haloacid dehalogenase-like family Pseudomonas fluorescens Pf-5
3UMB RSc1362 Ralstonia solanacearum
4EEK Beta-phosphoglucomutase-related protein Deinococcus radiodurans
4GXT A conserved functionally unknown protein Anaerococcus prevotii
For each protein structure, the predicted set of residues involved in catalysis
consists of a spatially localized arrangement of specific types of amino acids. The
challenge of matching residues of the SG proteins to those of the proteins of known
function can be overcome using a matching and scoring method, SALSA. First SALSA
establishes consensus signatures, local spatial arrangements of catalytically important
residues, based on POOL predictions for proteins of common known function. SALSA
then predicts the biochemical function of proteins of unknown function by matching
7
residues at the predicted catalytic site with the consensus signatures for the known
functional types. Then SALSA calculates a score that measures the quality of the local
structural match21
. From the scores, the annotated function of SG proteins may be
confirmed or challenged; in some cases where the original functional assignment is found
to be incorrect, a more likely function may be assigned. The SG proteins that have
functional assignments made by POOL and SALSA will need verification from
experiments to confirm the annotation. In this study, the focus is on HAD superfamily
proteins of known function and selected SG proteins; the HAD proteins of known
function used to obtain the consensus signatures in this study are listed in Table 1. The
SG proteins evaluated here are listed in Table 2.
II. Methods
2.1. Analysis of the proteins of known function of HAD superfamily
The members of the HAD superfamily are classified into subgroups according to
their biochemical reactions and functions. A few representative proteins of known
function with sequence diversity within each subgroup are chosen for this study. Due to
the time constraint, only three subgroups of the HAD superfamily were analyzed. These
subgroups were the proteins of the L-2 haloacid dehalogenase, sugar phosphatase, and C-
terminal domain phosphatase. For each of these subgroups, representative proteins were
selected for the study. The selection was based on their sequence identity in which low
sequence similarities between the proteins within a subgroup were preferred to avoid any
bias in the determination of spatially overlapped residues. As shown in Table 1, eight
total proteins of the HAD superfamily were chosen. Three of them were the L-2 haloacid
dehalogenases, three were sugar phosphatases, and two were C-terminal domain
8
phosphatases. Again, the reaction mechanisms of these dehalogenases and phosphatases
were shown in Figure 2. The previously reported functionally important residues for each
protein are obtained from literature reports and the PDBsum database22
listed in Table 3.
Table 3: The catalytic residues of representative proteins of the HAD superfamily
obtain from the literature.
PDB
ID Catalytic/Important residues
1JUD D10, Y12, T14, R41, S118, K151, Y157, S175, N177, D180
1AQ6 D8, Y10, T12, R39, L113, S114, N115, G116, K147, Y153, S171, N173, D176
2NO4 D11, T15, R42, S119, K152, Y158, S176, N178, D181
1YMQ D8, D10, T43, K188, D211, N214, D215
1U02 D7, D9, T45, R47, K161, D179, D180, D183
1TJ3 D9, D11, T41, G42, K163, N189
3EF0 D170, L171, D172, T174, K280, D297, D298
2GHQ D96, D98, Y158, R178, K190, D206
Often, not all of the important residues located in or near the binding pocket have
been tested and reported in the literature. An alternative to find most or all of the
important residues in enzymes is to use the POOL method that predicts the important
residues involved in ligand recognition and catalysis. Using the set of protein structures
of known function, the Consensus Signatures (CS) are identified for each HAD subgroup
in order to provide a means to recognize that subgroup. CS is defined by the spatial
alignment of POOL-predicted amino acid residues of the same type for all or most of the
proteins of known function within that functional subgroup. The alignments of proteins
within each subgroup are made using the PDBefold database23
and the Chimera
software24
.
Important residues in the active site of proteins are obtained using the POOL
method. Minimally, the three-dimensional structure of the protein is required for the
9
POOL method; if a sufficient number of homologues exists, then INTREPID can be used
to improve accuracy of the functional residue predictions. The three input types,
THEMATICS, ConCavity, and INTREPID, are used for the POOL prediction. For
present purposes, the top 10% of the residues in the POOL ranking is considered
significant for participation in catalysis. The predicted set of important residues is used to
compare to the consensus signatures obtained for the proteins of known function. The
output from POOL is used for the SALSA scoring. SALSA scores proteins using a
scoring matrix on the aligned residues in the local spatial region of the predicted active
site. For example, proteins within a subgroup with aligned active site residues similar to
each other or to the consensus signature will get a high score. In contrast, proteins that
belong to different subgroups will have a low score because the aligned active site
residues do not match well.
2.2. Analysis of Structural Genomics proteins
The Structural Genomics proteins in the superfamily can be found by submitting
all of the representative protein structures to a structure comparison server, such as the
DALI server25
. The DALI server outputs a list of protein structures, along with their PDB
ID, structure comparison score, and percent sequence identity relative to the input
protein. The SG proteins with low percent sequence identities relative to the
representative proteins are the most interesting in this study, because two proteins with
high percentage identity are more likely to have the same function. POOL was run for
each SG protein to obtain important residues in the active site. The top 10% of the ranked
residues was also used. All the SG proteins are aligned with the representative proteins
using the PDBefold database. Each SG protein along with each representative protein
10
was given a score and this generates a scoring table. The scores were calculated using the
BLOSUM62 matrix, a matrix that scores aligned residues in pairwise fashion26
. The
number 62 in the name indicates that protein sequences that are more than 62% identical
were clustered together to generate the matrix, which is based on amino acid substitution
probability. The SG proteins with a score higher than the standard (>0.4) are predicted to
have the same function as the representative proteins.
III. Results
The members of the HAD superfamily were obtained from a search through the
SCOP27
and SFLD28
databases. A few representative enzymes were chosen for each
subgroup. These representative proteins were analyzed using POOL to obtain a ranked
list of residues, the top-ranked residues being the most likely to participate in catalysis.
The POOL-predicted residues were used to generate the consensus signatures for the
known functional subgroups. These consensus signatures for the three subgroups are
shown in Table 4. For each protein structure, the subunit used in the alignment is
indicated, along with the PDB ID, in the first column. Each row in Table 4 represents a
protein structure, grouped according to biochemical function. Each column represents an
aligned spatial position. Residues previously reported to be important are shown in red.
POOL-predicted residues are shown in upper case.
The Consensus Signatures of three subgroups of HAD superfamily shown in
Table 4 were obtained from literature information and from the alignment of the POOL-
predicted residues of the proteins of known function within the subgroup. From this
alignment of the proteins of known function, a total of 23 different spatial positions were
found to be important for one or more of the three functional types. The alignments of the
11
21 SG proteins to the three subgroups were generated based upon the alignments of all
SG proteins to each subgroup individually and as a whole. The resulting local alignments
are shown in Table 5.
12
Table 4: The consensus signatures of proteins of three subgroups of HAD superfamily. POOL predicted residues are shown in
upper case. Catalytic residues previously reported in the literature are shown in red. Rows represent individual protein structures, with
proteins of common function grouped together. Columns represent aligned spatial positions.
Position
PDB ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1JUD(A) D10 Y12 G13 T14 R41 Y91 L117 S118 N119 G120 K151 y157 s175 N177 D180
1AQ6(A) D8 Y10 G11 T12 R39 Y89 l113 S114 N115 g116 K147 Y153 s171 N173 D176
2NO4(A) D11 Y13 G14 t15 R42 Y92 l118 s119 n120 g121 K152 Y158 s176 N178 D181
1YMQ(A) D8 D10 G11 T12 T43 G44 R45 K188 D211 G212 N214 D215
1U02(A) D7 D9 G10 T11 T45 G46 R47 K161 D179 D180 T182 D183
1TJ3(A) D9 D11 n12 t13 T41 G42 r43 K163 D186 S187 N189 D190
3EF0(A) D170 l171 D172 T174 T243 Y249 r271 K280 D297 D298
2GHQ(B) D96 L97 D98 t100 t152 Y158 R178 K190 D206 N207
13
Table 5: The alignment of the predicted residues for the SG proteins with the consensus signatures for three HAD subgroups.
POOL-predicted residues are shown in uppercase. Previously reported catalytic residues are shown in red.
Position
PDB ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1JUD(A) D10 Y12 G13 T14 R41 Y91 L117 S118 N119 G120 K151 y157 s175 N177 D180
1AQ6(A) D8 Y10 G11 T12 R39 Y89 l113 S114 N115 g116 K147 Y153 s171 N173 D176
2NO4(A) D11 Y13 G14 t15 R42 Y92 l118 s119 n120 g121 K152 Y158 s176 N178 D181
1YMQ(A) D8 D10 G11 T12 T43 G44 R45 K188 D211 G212 N214 D215
1U02(A) D7 D9 G10 T11 T45 G46 R47 K161 D179 D180 T182 D183
1TJ3(A) D9 D11 n12 t13 T41 G42 r43 K163 D186 S187 N189 D190
3EF0(A) D170 l171 D172 T174 T243 Y249 r271 K280 D297 D298
2GHQ(B) D96 L97 D98 t100 t152 Y158 R178 K190 D206 N207
1PW5(A) D11 M12 D13 G14 T15 f43 T44 N45 N46 s47 d53 m81 n185 p186 v189 G207 D208 R209 D213
1QYI(A) D7 V8 D9 G10 V11 l38 y209 a237 T238 G239 R240 p241 E244 a266 y276 n286 p287 Y290 G322 D323 S324 D327
1YV9(A) D11 L12 D13 G14 T15 v43 T44 N46 t47 t53 m82 k185 a186 m189 G207 D208 N209 D213
1ZJJ(A) D7 M8 D9 G10 V11 l39 T40 N42 s43 m49 m77 n189 e190 y193 G209 D210 R211 D215
2B30(C) D33 F34 D35 G36 T37 C67 T68 G69 R70 K225 G247 D248 A249 N251 D252
2HI0(A) D9 m10 D11 G12 T13 v40 q105 v131 S132 N133 K134 p135 a138 a166 p167 t170 G188 D189 S190 D193
2HOQ(A) D8 L9 D10 D11 T12 d39 f90 i116 T117 g119 n120 K123 K150 h152 p153 f156 G174 D175 R176 s179
2HSZ(A) D10 L11 D12 g13 t14 n41 c91 v117 T118 N119 K120 P121 H124 G144 h153 p154 f157 g175 D176 S177 D180
2PQ0(A) D9 D11 G12 T13 T43 G44 R45 K184 G206 D207 G208 N210 D211
2PR7(A) D7 Y8 a9 g10 v11 l39 S40 d42 p43 g44 g47 e75 e76 f79 D97 D98 s99 N102
2YBD(A) D12 M14 D15 G16 T17 l44 a67 L93 t94 r95 n96 l100 r122 l135 g154 d155 Y156 D159
3DAO(A) D8 I9 D10 g11 T12 C42 S43 G44 R45 Q46 K193 G215 D216 N217 N219 D220
3EPR(A) D11 L12 D13 G14 T15 V43 T44 N46 t47 s53 m81 n184 a185 m188 G206 D207 N208 D212
3FVV(A) D10 L11 D12 H13 T14 r42 m85 v114 T115 A116 T117 n118 v121 t137 S185 D186 S187 D190
3M9L(A) D12 m13 D14 G15 T16 l92 T93 R94 n95 l99 P128 G151 D152 y153 f155 D156
3NIW(A) D9 L10 D11 G12 T13 A42 S43 G44 R45 Y69 K196 G218 D219 g220 N222 D223
3QNM(A) D9 L10 D11 D12 T13 s40 p101 l126 S127 N128 g129 f130 l133 K160 r162 p163 f166 G184 D185 S186 A189
3R09(A) D12 M14 D15 G16 T17 l44 a67 L93 T94 r95 n96 l100 r122 l135 g154 d155 Y156 D159
3UMB(A) D10 a11 Y12 G13 T14 R41 Y95 l121 S122 N123 g124 m128 K155 a157 p158 Y161 s179 s180 N181 D184
4EEK(A) D12 L13 D14 G15 V16 e43 m84 g110 S111 n112 s113 r117 K146 h148 p149 Y152 E170 D171 S172 G175 g176
4GXT(A) D43 W44 d45 N46 T47 V240 S241 A242 S243 f244 i247 l269 v296 G316 D317 S318 G320 D321
14
Table 6: SALSA table showing scores for the aligned functional residues of SG proteins to the representative proteins of HAD
subgroups. Several of the right-most columns of the table are missing due to space constraint, but are related by symmetry to the
bottom rows of the table. The high SALSA scores (>0.4) to the known functional subgroups, indicating a functional match, are
highlighted in green.
PDB ID 1JUD(A) 1AQ6(A) 2NO4(A) 1YMQ(A) 1U02(A) 1TJ3(A) 3EF0(A) 2GHQ(B) 1PW5(A) 1QYI(A) 1YV9(A) 1ZJJ(A) 2B30(C) 2HI0(A) 2HOQ(A) 2HSZ(A) 2PQ0(A) 2PR7(A) 2YBD(A) 3DAO(A) 3EPR(A) 3FVV(A) 3M9L(A) 3NIW(A) 3QNM(A) 3R09(A) 3UMB(A) 4EEK(A) 4GXT(A)
1JUD(A) 1 1 1 -0.16 -0.15 -0.21 -0.46 -0.52 -0.01 0.01 0.09 0.12 -0.20 0.02 0.06 0.01 -0.12 0.00 0.06 -0.02 0.09 0.00 -0.06 -0.16 0.03 0.06 0.74 0.08 -0.25
1AQ6(A) 1 1 1 -0.16 -0.15 -0.21 -0.46 -0.52 -0.01 0.01 0.09 0.12 -0.20 0.02 0.06 0.01 -0.12 0.00 0.06 -0.02 0.09 0.00 -0.06 -0.16 0.03 0.06 0.74 0.08 -0.25
2NO4(A) 1 1 1 -0.16 -0.15 -0.21 -0.46 -0.52 -0.01 0.01 0.09 0.12 -0.20 0.02 0.06 0.01 -0.12 0.00 0.06 -0.02 0.09 0.00 -0.06 -0.16 0.03 0.06 0.74 0.08 -0.25
1YMQ(A) -0.16 -0.16 -0.16 1 0.84 0.86 -0.12 -0.25 0.11 0.02 0.15 0.00 0.70 0.04 -0.16 0.02 0.94 -0.10 -0.08 0.64 0.08 -0.02 0.11 0.71 -0.22 -0.08 -0.28 -0.19 0.06
1U02(A) -0.15 -0.15 -0.15 0.84 1 0.79 -0.12 -0.26 0.11 0.02 0.16 0.00 0.61 0.04 -0.16 0.02 0.79 -0.10 -0.09 0.59 0.10 -0.02 0.12 0.56 -0.22 -0.09 -0.27 -0.22 0.04
1TJ3(A) -0.21 -0.21 -0.21 0.86 0.79 1 -0.12 -0.26 0.05 0.00 0.09 -0.06 0.65 0.01 -0.12 0.00 0.82 -0.12 -0.15 0.59 0.03 0.06 0.05 0.58 -0.15 -0.15 -0.35 -0.22 0.19
3EF0(A) -0.46 -0.46 -0.46 -0.1 -0.1 -0.1 1 0.85 -0.17 -0.33 -0.06 -0.11 -0.17 -0.28 -0.19 -0.11 -0.10 -0.25 0.01 -0.21 -0.06 -0.03 0.07 -0.08 -0.31 0.01 -0.50 -0.35 -0.10
2GHQ(B) -0.52 -0.52 -0.52 -0.3 -0.3 -0.3 0.847 1 -0.32 -0.49 -0.21 -0.26 -0.28 -0.43 -0.35 -0.26 -0.25 -0.40 -0.14 -0.36 -0.21 -0.18 -0.08 -0.24 -0.46 -0.14 -0.57 -0.50 -0.25
1PW5(A) -0.01 -0.01 -0.01 0.11 0.11 0.05 -0.17 -0.32 1 0.39 0.54 0.62 0.19 0.41 0.35 0.40 0.17 0.13 0.21 0.25 0.61 0.04 0.35 0.16 0.27 0.21 0.18 0.20 0.10
1QYI(A) 0.01 0.01 0.01 0.02 0.02 0.00 -0.33 -0.49 0.39 1 0.21 0.32 0.13 0.39 0.29 0.49 0.09 0.06 0.17 0.18 0.27 0.10 0.09 0.18 0.17 0.17 0.19 0.24 0.01
1YV9(A) 0.09 0.09 0.09 0.15 0.16 0.09 -0.06 -0.21 0.54 0.21 1 0.60 0.22 0.39 0.34 0.39 0.21 0.14 0.29 0.35 0.91 0.12 0.36 0.24 0.29 0.29 0.24 0.19 0.15
1ZJJ(A) 0.12 0.12 0.12 0.00 0.00 -0.06 -0.11 -0.26 0.62 0.32 0.60 1 0.13 0.29 0.36 0.32 0.09 0.27 0.30 0.19 0.64 0.02 0.40 0.09 0.26 0.30 0.30 0.30 0.07
2B30(C) -0.20 -0.20 -0.20 0.70 0.61 0.65 -0.17 -0.28 0.19 0.13 0.22 0.13 1 0.16 0.01 0.15 0.65 0.02 0.07 0.71 0.17 0.06 0.22 0.64 -0.04 0.07 -0.19 -0.12 0.19
2HI0(A) 0.02 0.02 0.02 0.04 0.04 0.01 -0.28 -0.43 0.41 0.39 0.39 0.29 0.16 1 0.22 0.60 0.12 0.15 0.29 0.25 0.37 0.18 0.29 0.19 0.35 0.29 0.28 0.23 0.16
2HOQ(A) 0.06 0.06 0.06 -0.16 -0.16 -0.12 -0.19 -0.35 0.35 0.29 0.34 0.36 0.01 0.22 1 0.41 -0.03 0.02 0.18 0.07 0.32 0.13 0.22 -0.01 0.45 0.18 0.24 0.38 -0.04
2HSZ(A) 0.01 0.01 0.01 0.02 0.02 0.00 -0.11 -0.26 0.40 0.49 0.39 0.32 0.15 0.60 0.41 1 0.09 0.12 0.24 0.16 0.33 0.21 0.20 0.16 0.31 0.24 0.17 0.28 0.09
2PQ0(A) -0.12 -0.12 -0.12 0.94 0.79 0.82 -0.10 -0.25 0.17 0.09 0.21 0.09 0.65 0.12 -0.03 0.09 1 -0.06 0.03 0.72 0.19 0.02 0.22 0.78 -0.09 0.03 -0.22 -0.16 0.17
2PR7(A) 0.00 0.00 0.00 -0.10 -0.10 -0.12 -0.25 -0.40 0.13 0.06 0.14 0.27 0.02 0.15 0.02 0.12 -0.06 1 -0.06 0.07 0.15 -0.09 0.03 -0.01 0.08 -0.06 0.12 0.10 0.00
2YBD(A) 0.06 0.06 0.06 -0.08 -0.09 -0.15 0.01 -0.14 0.21 0.17 0.29 0.30 0.07 0.29 0.18 0.24 0.03 -0.06 1 0.12 0.28 0.34 0.58 0.10 0.14 1.00 0.09 -0.03 0.12
3DAO(A) -0.02 -0.02 -0.02 0.64 0.59 0.59 -0.21 -0.36 0.25 0.18 0.35 0.19 0.71 0.25 0.07 0.16 0.72 0.07 0.12 1 0.30 0.12 0.28 0.70 0.06 0.12 -0.02 0.01 0.24
3EPR(A) 0.09 0.09 0.09 0.08 0.10 0.03 -0.06 -0.21 0.61 0.27 0.91 0.64 0.17 0.37 0.32 0.33 0.19 0.15 0.28 0.30 1 0.10 0.35 0.19 0.26 0.28 0.23 0.21 0.14
3FVV(A) 0.00 0.00 0.00 -0.02 -0.02 0.06 -0.03 -0.18 0.04 0.10 0.12 0.02 0.06 0.18 0.13 0.21 0.02 -0.09 0.34 0.12 0.10 1 0.22 0.12 0.06 0.35 0.02 -0.02 0.32
3M9L(A) -0.06 -0.06 -0.06 0.11 0.12 0.05 0.07 -0.08 0.35 0.09 0.36 0.40 0.22 0.29 0.22 0.20 0.22 0.03 0.58 0.28 0.35 0.22 1 0.19 0.16 0.59 0.13 0.07 0.28
3NIW(A) -0.16 -0.16 -0.16 0.71 0.56 0.58 -0.08 -0.24 0.16 0.18 0.24 0.09 0.64 0.19 -0.01 0.16 0.78 -0.01 0.10 0.70 0.19 0.12 0.19 1 0.0 0.1 -0.2 0.0 0.3
3QNM(A) 0.03 0.03 0.03 -0.22 -0.22 -0.15 -0.31 -0.46 0.27 0.17 0.29 0.26 -0.04 0.35 0.45 0.31 -0.09 0.08 0.14 0.06 0.26 0.06 0.16 0.01 1 0.1 0.3 0.3 0.1
3R09(A) 0.06 0.06 0.06 -0.08 -0.09 -0.15 0.01 -0.14 0.21 0.17 0.29 0.30 0.07 0.29 0.18 0.24 0.03 -0.06 1 0.12 0.28 0.35 0.59 0.10 0.12 1 0.09 -0.03 0.12
3UMB(A) 0.74 0.74 0.74 -0.28 -0.27 -0.35 -0.50 -0.57 0.18 0.19 0.24 0.30 -0.19 0.28 0.24 0.17 -0.22 0.12 0.09 -0.02 0.23 0.02 0.13 -0.15 0.26 0.09 1 0.3 -0.1
4EEK(A) 0.08 0.08 0.08 -0.19 -0.22 -0.22 -0.35 -0.50 0.20 0.24 0.19 0.30 -0.12 0.23 0.38 0.28 -0.16 0.10 -0.03 0.01 0.21 -0.02 0.07 -0.04 0.30 -0.03 0.27 1 -0.07
4GXT(A) -0.25 -0.25 -0.25 0.06 0.04 0.19 -0.10 -0.25 0.10 0.01 0.15 0.07 0.19 0.16 -0.04 0.09 0.17 0.00 0.12 0.24 0.14 0.32 0.28 0.27 0.13 0.12 -0.15 -0.07 1
15
Table 7: The functional residue alignments of high scoring SG proteins to the consensus signatures.
Position
Subgroup PDB ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
L-2 Haloacid Dehalogenase
1JUD(A) D10 Y12 G13 T14 R41 Y91 L117 S118 N119 G120 K151 y157 s175 N177 D180
1AQ6(A) D8 Y10 G11 T12 R39 Y89 l113 S114 N115 g116 K147 Y153 s171 N173 D176
2NO4(A) D11 Y13 G14 t15 R42 Y92 l118 s119 n120 g121 K152 Y158 s176 N178 D181
SG 3UMB(A) D10 Y12 G13 T14 R41 Y95 l121 S122 N123 g124 K155 Y161 s179 N181 D184
Sugar Phosphatase
1YMQ(A) D8 D10 G11 T12 T43 G44 R45 K188 D211 G212 N214 D215
1U02(A) D7 D9 G10 T11 T45 G46 R47 K161 D179 D180 T182 D183
1TJ3(A) D9 D11 n12 t13 T41 G42 r43 K163 D186 S187 N189 D190
SG
2B30(C) D33 D35 G36 T37 T68 G69 R70 K225 D248 A249 N251 D252
2PQ0(A) D9 D11 G12 T13 T43 G44 R45 K184 D207 G208 N210 D211
3DAO(A) D8 D10 g11 T12 S43 G44 R45 K193 D216 N217 N219 D220
3NIW(A) D9 D11 G12 T13 S43 G44 R45 K196 D219 g220 N222 D223
C-terminal Domain Phosphatase
3EF0(A) D170 l171 D172 T174 T243 Y249 r271 K280 D297 D298
2GHQ(B) D96 L97 D98 t100 t152 Y158 R178 K190 D206 N207
16
The second row of Table 5 indicates the 23 functionally important positions of the
spatially aligned residues and each of the rows 3-31 represents an individual protein
structure. Each of the 23 columns represents an aligned spatial position. The right-most
column in Table 5 specifies the PDB ID along with the subunit used in the alignment. For
example, L-2 haloacid dehalogenase from Pseudomonas sp. YL (PDB ID 1JUD)
indicates that chain A was used for aligning to the other proteins. Some of the protein
structures consist of one subunit; in this case the chain is indicated as A. For the
representative proteins, the chain A was used for L-2 haloacid dehalogenase from
Pseudomonas sp. YL (PDB ID 1JUD), L-2 haloacid dehalogenase from Xanthobacter
autrophicus (PDB ID 1AQ6), DehIVa from Burkholderia cepacia (PDB ID 2NO4),
BT4131 from Bacteroides thetaiotaomicron (PDB ID 1YMQ), Trehalose-6-phosphate
phosphatase from Thermoplasma acidophilum (PDB ID 1U02), sucrose-phosphatase
from Synechocystis sp. PCC 6803 (PDB ID 1TJ3), and Fcp1 from Schizosaccharomyces
pombe (PDB ID 3EF0) while chain B was used for Scp1 from Homo sapiens (PDB ID
2GHQ). The specific chain used was determined based on the quality of alignment of
proteins within each subgroup. Chain A for the SG proteins was mostly used except that
chain C of HAD/COF-like hydrolase from Plasmodium vivax (PDB ID 2B30) was used.
For the alignments shown in Table 5, the corresponding normalized SALSA
scores are shown in Table 6. The SALSA scores are normalized so that a perfect local
alignment of residues in the consensus signature positions has a score of 1. Note the first
eight rows and columns, with three blocks along the diagonal of Table 6, showing the
scores of the proteins of known function against each other. The diagonal blocks have
high, positive scores in the 0.793 – 1 range, indicating a good match of residues at the
17
local active site. The off-diagonal blocks in the first eight rows and columns have
negative scores, indicating a poor match at the local active site. These scores for the
proteins of known function against each other serve to guide the interpretation of the
scores for the SG proteins. The next 21 rows are the SG proteins. Five of these have high,
positive scores against one set of proteins of known function and either negative scores or
low, positive scores (~0.01) against the other two sets of proteins of known function. The
other 16 SG proteins do not show good matching scores against the three sets of proteins
of known function. Table 7 shows the alignment of the predicted residues for the five SG
proteins that were predicted to have the function of either L-2 haloacid dehalogenase or
sugar phosphatase.
3.1. L-2 haloacid dehalogenase
Three representative proteins belong to the L-2 haloacid dehalogenases from
Pseudomonas sp. YL (PDB ID 1JUD), Xanthobacter autrophicus (PDB ID 1AQ6), and
Burkholderia cepacia (PDB ID 2NO4) and were chosen for the study. The structural
alignments of these proteins are shown in Figure 3. L-2 haloacid dehalogenases catalyze
the reaction of L-2 haloalkanoic acids to D-2 hydroxyalkanoic acids29
. The core domain
of these dehalogenases consists of an inserted cap domain of a 4-helix bundle at the
interface. Arginine at position 6 (Table 4 and 5; R41 in 1JUD) of L-2 haloacid
dehalogenases is found to situate away from the substrate at the bottom of the cleft
between the core and cap domains and is assumed to assist in constructing the active site
of the enzyme. The catalysis of L-2 haloacid dehalogenase occurs by the aspartate at
position 1 (D10 in 1JUD), which makes a nucleophilic attack on the substrate to form an
ester intermediate29
. Serine (position 9; S118 in 1JUD) hydrogen-bonds and orients the
18
aspartate of position 1 (D10 in 1JUD) for the substrate binding. Tyrosine at position 3
(Y12 in 1JUD) is predicted to abstract the halide ion from the substrate. Asparagine at
position 21 (N177 in 1JUD) and aspartate at position 23 (D180 in 1JUD) are believed to
either hydrogen-bond and activate a water molecule so that it can attack the carbonyl
carbon atom of the intermediate, or to situate this water molecule so that it can hydrogen-
bond with aspartate at position 1 (D10 in 1JUD). The position or activation of this water
molecule is mediated by aspartate at position 23 (D180 in 1JUD) via hydrogen bonding
between its oxygen atoms and lysine of position 15 (K151 in 1JUD) and tyrosine of
position 18 (Y157 in 1JUD). This water also binds to the substrate along with serine
(position 9, S118 in 1JUD) through hydrogen bonding. Another water molecule is
presumably activated by either lysine (position 15; K151 in 1JUD) or threonine (position
5; T14 in 1JUD) for the attack on the intermediate.
A B
Figure 3: A). Structural alignment and B) Functional residue alignment of L-2 haloacid
dehalogenases. Shown in yellow is Pseudomonas sp. YL L-2 haloacid dehalogenase
(PDB ID 1JUD), cyan Xanthobacter autrophicus (PDB ID 1AQ6), and red Burkholderia
cepacia (PDB ID 2NO4). The alignment was made using YASARA software.
19
3.2. Sugar phosphatase
For the sugar phosphatase subgroup, the three proteins studied were BT4131 from
Bacteroides thetaiotaomicron, T6PP from Thermoplasma acidophilum, and SPP from
Synechocystis sp. PCC 6803 with the PDB ID of 1YMQ, 1U02, and 1TJ3, respectively.
The structural alignments of sugar phosphatases are shown in Figure 4. These proteins
also consist of a cap domain of a 4-stranded beta-sheet with two helices30
. The catalysis is
carried by nucleophilic attack of the aspartate at position 1 (Table 4; D8 in 1YMQ) on the
substrate’s phosphorous atom to form a covalent intermediate. A water molecule
stabilizes the intermediate. Lysine at position 16 (K188 in 1YMQ) interacts with the
aspartate at position 3 (D10 in 1YMQ) to stabilize the charge on the aspartate. Aspartate
at position 3 (D10 in 1YMQ) acts as a general acid donating its proton to one of the
oxygen atoms. Arginine at position 11 (R45 in 1YMQ) forms a salt bridge with the
aspartate at position 3 (D10 in 1YMQ). Threonine at position 9 (T43 in 1YMQ) interacts
with the phosphoryl oxygen of the substrate. Aspartates at position 1, 3 and 20 (D8, D10,
D211, respectively, in 1YMQ) coordinate the metal ion. Aspartate at position 21 (G212
in in 1YMQ) and 23 (D215 in 1YMQ) interact with a water molecule that coordinates the
metal ion and lysine (position 16; K188 in 1YMQ).
20
A B
Figure 4: A) Structural alignment and B). Functional residues alignment of sugar
phosphatases. Shown in green is Bacteroides thetaiotaomicron sugar phosphatase (PDB
ID 1YMQ), magenta Thermoplasma acidophilum (PDB ID 1U02), and blue
Synechocystis ps. PCC 6803 (PDB ID 1TJ3). The alignment was made using YASARA
software.
3.3. C-terminal domain phosphatase
Two representative proteins were chosen for the study of the C-terminal domain
phosphatase subgroup. They were Fcp1 from Schizosaccharomyces pombe and Scp1
from Homo sapiens with the PDB IDs of 3EF0 and 2GHQ, respectively. The structural
alignments of these proteins are shown in Figure 5. These proteins carry out hydrolysis
via the nucleophilic attack of the aspartate at position 1 (D170 in 3EF0) to transfer the
phosphoryl group31
. Aspartate at position 19 (D197 in 3EF0) acts as a general base
activating a water molecule for the reaction. Lysine in position 17 (K280 in 3EF0) forms
a salt bridge with the aspartate in position 19 D197 in 3EF0). This aspartate (position 19)
and aspartate at position 3 (D172 in 3EF0) activate the water molecule for the hydrolysis.
Tyrosine at position 13 (Y249 in 3EF0) forms a hydrogen bond with the aspartate of
position 3 (D172 in 3EF0). Arginine at position 14 (R271 in 3EF0) hydrogen bonds with
21
Serine2 and Threonine4 (2 and 4 are sequence numbers) and is found to be important for
the binding of the substrate and the C-terminal domain31
.
A B
Figure 5: A) Structural alignment and B). Functional residues alignment of C-terminal
domain phosphatase. Shown in yellow is Schizosaccharomyces pombe C-terminal
domain phosphatase (PDB ID 3EF0) and green Homo sapiens (PDB ID 2GHQ). The
alignment was made using YASARA software.
IV. Discussion
The Consensus Signatures for each of the three known functional types was
obtained from the POOL scores and from literature information about the individual
representative proteins, followed by alignment to the other representative proteins within
the same functional subgroup. All of the representative proteins were aligned to each
other for the overall spatial alignment as shown in Table 4. The alignments of the SG
proteins to the representative proteins of three subgroups indicated that one of the SG
proteins is likely to have the function of the L-2 haloacid dehalogenase and four are
likely to be sugar phosphatases. As shown in Table 7, the predicted active site residues in
chain A of RSc1362 from Ralstonia solanacearum (PDB ID 3UMB) align spatially well
with those of all of the Consensus Signature positions of L-2 haloacid dehalogenase. The
SALSA score of this SG protein to the three known L-2 haloacid dehalogenases is 0.74,
22
suggesting that the putative annotation of L-2 halogenase dehalogenase is likely to be
correct.
Four SG proteins, HAD/COF-like hydrolase from Plasmodium vivax (PDB ID
2B30), hypothetical protein from Geobacillus kaustophilus (PDB ID 2PQ0), putative
phosphatase from Eubacterium Rectale (PDB 3DAO), and haloacid dehalogenase-like
hydrolase from Bacteroides thetaiotaomicron (PDB ID 3NIW) were shown to align with
high SALSA scores in the 0.56 – 0.94 range to the sugar phosphatases. There is some
variability in the active sites among the sugar phosphatases, perhaps because of
differences in substrate preferences.
The residues of chain C of HAD/COF-like hydrolase from Plasmodium vivax
(PDB ID 2B30) spatially align to all the CS residues of the known sugar phosphatases
except for A249. Notice in column 21 that A249 is aligned with four different types of
residues. However, A249 does not match any of the corresponding residues for the
proteins of known function (G/D/S) at that spatial position. All of the residues of this SG
protein were predicted by POOL to be important for catalysis. The SALSA scores of
HAD/COF-like hydrolase are 0.70, 0.61, and 0.65 aligned to BT4131 from Bacteroides
thetaiotaomicron (PDB ID 1YMQ), trehalose-6-phosphate phosphatase from
Thermoplasma acidophilum (PDB ID 1U02), and sucrose-phosphatase from
Synechocystis sp. PCC 6803 (PDB ID 1TJ3), respectively. The average of these scores
was ~0.65.
Residues of chain A of hypothetical protein from Geobacillus kaustophilus (PDB
ID 2PQ0) aligned perfectly to CS of the sugar phosphatases, not counting the variable
23
position 21. G208 of 2PQ0 (position 21) aligned with G212 of 1YMQ(A), but not the
others in the same column. The SALSA scores of this SG protein are 0.94, 0.79, and 0.82
to those of BT4131 from Bacteroides thetaiotaomicron (PDB ID 1YMQ), trehalose-6-
phosphate phosphatase from Thermoplasma acidophilum (PDB ID 1U02), and sucrose-
phosphatase from Synechocystis sp. PCC 6803 (PDB ID 1TJ3), respectively. The average
score of 2PQ0 was ~0.85. This score is in the same range as those of the known sugar
phosphatases with each other.
The residues of chain A of putative phosphatase from Eubacterium rectale (PDB
3DAO), were mostly predicted by POOL to be significant except that of spatially aligned
position, g11. S43 (position 9) and N217 (position 21) did not align to the CS of any
sugar phosphatases. SALSA scoring method provides the scores of 0.64, 0.59, and 0.60
to the BT4131 from Bacteroides thetaiotaomicron (PDB ID 1YMQ), trehalose-6-
phosphate phosphatase from Thermoplasma acidophilum (PDB ID 1U02), and sucrose-
phosphatase from Synechocystis sp. PCC 6803 (PDB ID 1TJ3), respectively, with an
average of 0.61.
The CS residues of haloacid dehalogenase-like hydrolase Chain A from
Bacteroides thetaiotaomicron (PDB ID 3NIW) overlap with all the CS positions of the
sugar phosphatases except that of S43 (position 9); this position has a (chemically
similar) T for the known sugar phosphatases. Also, residue g220 at position 21 was not
predicted by the POOL method; note that this position is variable in the known sugar
phosphatases. The scores obtained from SALSA were 0.71, 0.56, and 0.58 aligning to
BT4131 from Bacteroides thetaiotaomicron (PDB ID 1YMQ), trehalose-6-phosphate
24
phosphatase from Thermoplasma acidophilum (PDB ID 1U02), and sucrose-phosphatase
from Synechocystis sp. PCC 6803 (PDB ID 1TJ3), respectively; these average to be 0.62.
Alignments of the other 15 SG proteins as well as their SALSA scores indicate that they
do not possess the function of L-2 haloacid dehalogenase, sugar phosphatase, or C-
terminal domain phosphatase.
V. Conclusions
Based on the alignment to the representative proteins and the SALSA scores,
RSc1362 from Ralstonia solanacearum is likely to function as L-2 haloacid dehalogenase
while HAD/COF-like hydrolase from Plasmodium vivax, hypothetical protein from
Geobacillus kaustophilus, putative phosphate from Eubacterium Rectale, and haloacid
dehalogenase-like hydrolase from Bacteroides thetaiotaomicron are likely to function as
sugar phosphatases. Docking studies could help to identify the likely native substrate for
each of the five proteins for which function could be assigned.
In some cases, although it is not possible at this time to predict the biochemical
function, it can be established that certain pairs of SG proteins have functions similar to
each other. For example, the putative nagD protein from Thermotoga maritima with PDB
ID 1PW5 and the hypothetical protein from Enterococcus faecalis with PDB ID 1YV9
have a SALSA similarity score of 0.54, suggesting that these two proteins have similar
function. High scores are observed for many pairs of SG proteins (Table 6). The highest
SALSA score between the SG proteins is 1. These SG proteins are both from
Pseudomonas fluorescens Pf-5. The two structures were reported by the same Structural
25
Genomics group and have different protein names and different PDB IDs, 2YBD and
3R09, but they are in fact the same protein with 100% sequence identity.
The results from this study are subject to experiments to confirm the
computational predictions of functional importance of residues as well as the function of
the SG proteins. From this study, three subgroups of the HAD superfamily are analyzed
and five SG proteins are predicted to have the functions of either L-2 haloacid
dehalogenase or sugar phosphatases. It has been found that none of the studied SG
proteins is likely to possess the function of C-terminal domain phosphatase. There are
still many proteins in the HAD superfamily, of both known and unknown function, that
are required to be studied since HAD is a large superfamily of proteins. Some of the SG
proteins, if functionally classified, may have many potential applications in the
bioremediation of the soil and the ground water in the United States.
References
[1] Olaniran, A., Pillay, D., and Pillay, B. (2004) Haloalkane and haloacid dehalogenases
from aerobic bacterial isolates indigenous to contaminated sites in Africa demonstrate
diverse substrate specificities, Chemosphere 55, 27-33.
[2] Russell, H. H., Matthews, J. E., and Guy, W. S. (1992) TCE removal from
contaminated soil and groundwater, EPA Environmental Engineering Sourcebook.
[3] Doherty, R. E. (2000) A History of the Production and Use of Carbon Tetrachloride,
Tetrachloroethylene, Trichloroethylene and 1, 1, 1-Trichloroethane in the United States:
Part 1--Historical Background; Carbon Tetrachloride and Tetrachloroethylene,
Environmental Forensics 1, 69-81.
[4] Mcnab, W. W., Ruiz, R., and Reinhard, M. (2000) In-situ destruction of chlorinated
hydrocarbons in groundwater using catalytic reductive dehalogenation in a reactive well:
Testing and operational experiences, Environmental Science and Technology 34, 149-
153.
26
[5] Burroughs, A. M., Allen, K. N., Dunaway-Mariano, D., and Aravind, L. (2006)
Evolutionary genomics of the HAD superfamily: understanding the structural adaptations
and catalytic diversity in a superfamily of phosphoesterases and allied enzymes, Journal
of Molecular Biology 361, 1003-1034.
[6] Lahiri, S. D., Zhang, G., Dunaway-Mariano, D., and Allen, K. N. (2006)
Diversification of function in the haloacid dehalogenase enzyme superfamily: The role of
the cap domain in hydrolytic phosphoruscarbon bond cleavage, Bioorganic Chemistry 34,
394-409.
[7] Peisach, E., Selengut, J. D., Dunaway-Mariano, D., and Allen, K. N. (2004) X-ray
crystal structure of the hypothetical phosphotyrosine phosphatase MDP-1 of the haloacid
dehalogenase superfamily, Biochemistry 43, 12770-12779.
[8] Parsons, J. F., Lim, K., Tempczyk, A., Krajewski, W., Eisenstein, E., and Herzberg,
O. (2002) From structure to function: YrbI from Haemophilus influenzae (HI1679) is a
phosphatase, Proteins: Structure, Function, and Bioinformatics 46, 393-404.
[9] Krieger, E., and Vriend, G. (2002) Models@ Home: distributed computing in
bioinformatics using a screensaver based approach, Bioinformatics 18, 315-318.
[10] Ridder, I., and Dijkstra, B. (1999) Identification of the Mg2+-binding site in the P-
type ATPase and phosphatase members of the HAD (haloacid dehalogenase) superfamily
by structural similarity to the response regulator protein CheY, Biochemistry Journal
339, 223-226.
[11] Baker, D., and Sali, A. (2001) Protein structure prediction and structural genomics,
Science 294, 93-96.
[12] Lopez, G., Rojas, A., Tress, M., and Valencia, A. (2007) Assessment of predictions
submitted for the CASP7 function prediction category, Proteins: Structure, Function, and
Bioinformatics 69, 165-174.
[13] Tong, W., Wei, Y., Murga, L. F., Ondrechen, M. J., and Williams, R. J. (2009)
Partial order optimum likelihood (POOL): maximum likelihood prediction of protein
active site residues using 3D Structure and sequence properties, PLoS Computational
Biology 5, e1000266.
[14] Ondrechen, M. J., Clifton, J. G., and Ringe, D. (2001) THEMATICS: a simple
computational predictor of enzyme function from structure, Proceedings of the National
Academy of Sciences 98, 12473-12478.
[15] Capra, J. A., Laskowski, R. A., Thornton, J. M., Singh, M., and Funkhouser, T. A.
(2009) Predicting protein ligand binding sites by combining evolutionary sequence
conservation and 3D structure, PLoS Computational Biology 5, e1000585.
27
[16] Sankararaman, S., and Sjölander, K. (2008) INTREPID—INformation-theoretic
TREe traversal for Protein functional site IDentification, Bioinformatics 24, 2445-2452.
[17] Madura, J. D., Briggs, J. M., Wade, R. C., Davis, M. E., Luty, B. A., Ilin, A.,
Antosiewicz, J., Gilson, M. K., Bagheri, B., and Scott, L. R. (1995) Electrostatics and
diffusion of molecules in solution: simulations with the University of Houston Brownian
Dynamics program, Computer Physics Communications 91, 57-95.
[18] Gilson, M. K. (1993) Multiple‐site titration and molecular modeling: Two rapid
methods for computing energies and forces for ionizable groups in proteins, Proteins:
Structure, Function, and Bioinformatics 15, 266-282.
[19] Majtey, A., Lamberti, P., and Prato, D. (2005) Jensen-Shannon divergence as a
measure of distinguishability between mixed quantum states, Physical Review A 72,
052310.
[20] Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H.,
Shindyalov, I. N., and Bourne, P. E. (2000) The protein data bank, Nucleic Acids
Research 28, 235-242.
[21] Wang, Z., Yin, P., Lee, J. S., Parasuram, R., Somarowthu, S., and Ondrechen, M. J.
(2013) Protein function annotation with Structurally Aligned Local Sites of Activity
(SALSAs), BMC Bioinformatics 14, S13.
[22] de Beer, T. A., Berka, K., Thornton, J. M., and Laskowski, R. A. (2014) PDBsum
additions, Nucleic Acids Research 42, D292-D296.
[23] Krissinel, E., and Henrick, K. (2004) Secondary-structure matching (SSM), a new
tool for fast protein structure alignment in three dimensions, Acta Crystallographica
Section D: Biological Crystallography 60, 2256-2268.
[24] Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M.,
Meng, E. C., and Ferrin, T. E. (2004) UCSF Chimera—a visualization system for
exploratory research and analysis, Journal of Computational Chemistry 25, 1605-1612.
[25] Holm, L., and Rosenström, P. (2010) Dali server: conservation mapping in 3D,
Nucleic acids research 38, W545-W549.
[26] Eddy, S. R. (2004) Where did the BLOSUM62 alignment score matrix come from?,
Nature Biotechnology 22, 1035-1036.
[27] Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. (1995) SCOP: a
structural classification of proteins database for the investigation of sequences and
structures, Journal of Molecular Biology 247, 536-540.
28
[28] Akiva, E., Brown, S., Almonacid, D. E., Barber, A. E., Custer, A. F., Hicks, M. A.,
Huang, C. C., Lauck, F., Mashiyama, S. T., and Meng, E. C. (2013) The Structure–Function Linkage Database, Nucleic Acids Research, gkt1130.
[29] Hisano, T., Hata, Y., Fujii, T., Liu, J.-Q., Kurihara, T., Esaki, N., and Soda, K.
(1996) Crystal Structure of L-2-Haloacid Dehalogenase from Pseudomonas sp. YL AN
α/β HYDROLASE STRUCTURE THAT IS DIFFERENT FROM THE α/β HYDROLASE FOLD, Journal of Biological Chemistry 271, 20322-20330.
[30] Rao, K. N., Kumaran, D., Seetharaman, J., Bonanno, J. B., Burley, S. K., and
Swaminathan, S. (2006) Crystal structure of trehalose‐6‐phosphate phosphatase–related
protein: Biochemical and biological implications, Protein Science 15, 1735-1744.
[31] Zhang, Y., Kim, Y., Genoud, N., Gao, J., Kelly, J. W., Pfaff, S. L., Gill, G. N.,
Dixon, J. E., and Noel, J. P. (2006) Determinants for dephosphorylation of the RNA
polymerase II C-terminal domain by Scp1, Molecular Cell 24, 759-770.
Recommended