30
G-domain prediction across the diversity of G protein families Hiral Sanghavi IIT Gandhinagar Richa Rashmi IIT Gandhinagar Anirban Dasgupta IIT Gandhinagar Sharmistha Majumdar ( [email protected] ) IIT Gandhinagar https://orcid.org/0000-0003-4356-7237 Article Keywords: G protein family, G domain, Spacers and Mismatch Algorithm Posted Date: December 2nd, 2021 DOI: https://doi.org/10.21203/rs.3.rs-1102638/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License

G-domain prediction across the diversity of G protein families

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: G-domain prediction across the diversity of G protein families

G-domain prediction across the diversity of Gprotein familiesHiral Sanghavi 

IIT GandhinagarRicha Rashmi 

IIT GandhinagarAnirban Dasgupta 

IIT GandhinagarSharmistha Majumdar  ( [email protected] )

IIT Gandhinagar https://orcid.org/0000-0003-4356-7237

Article

Keywords: G protein family, G domain, Spacers and Mismatch Algorithm

Posted Date: December 2nd, 2021

DOI: https://doi.org/10.21203/rs.3.rs-1102638/v1

License: This work is licensed under a Creative Commons Attribution 4.0 International License.  Read Full License

Page 2: G-domain prediction across the diversity of G protein families

G-domain prediction across the diversity of G protein families

Hiral M. Sanghavi1, Richa Rashmi1, Anirban Dasgupta2 and Sharmistha Majumdar1*

1 Discipline of Biological Engineering, Indian Institute of Technology Gandhinagar, Gujarat, 382355, India

2 Discipline of Computer Science & Engineering, Indian Institute of Technology Gandhinagar, Gujarat,

382355, India

* Correspondence should be addressed to Sharmistha Majumdar. Tel: +91-8758549280; Email:

[email protected]

Keywords: G proteins, G domain, G box prediction

ABSTRACT

Guanine nucleotide binding proteins are characterized by a structurally and mechanistically conserved

GTP-binding domain (G domain), indispensable for binding GTP. The G domain comprises five adjacent

consensus motifs called G boxes, which are separated by amino acid spacers of different lengths. Several

G proteins, discovered over time, are characterized by diverse function and sequence. This sequence

diversity is also observed in the G box motifs (specifically the G5 box) as well as the inter-G box spacer

length. The Spacers and Mismatch Algorithm (SMA) introduced in this study can predict G-domains in a

given protein sequence, based on user-specified constraints for approximate G-box patterns and inter-box

gaps in each G protein family. The SMA parameters can be customized as more G proteins are discovered

and characterized structurally. Family-specific G box motifs including the less characterized G5 box were

predicted with higher accuracy. Overall, our analysis suggests the possible classification of G protein

families based on family-specific G box sequences and lengths of inter-G box spacers.

SMA can be implemented via a web-based server at https://labs.iitgn.ac.in/datascience/gboxes/

INTRODUCTION

GTP binding proteins (G proteins) or GTPases, are known to play important roles in different cellular

processes such as translation, cell signaling, nuclear and cytoplasmic protein transport, DNA repair and

cytoskeleton stabilization (1, 2). GTPases are part of a larger family of NTPases (Nucleotide

triphosphatase), which are proteins that bind and hydrolyze nucleotide triphosphates. NTPases are

classified according to the presence of distinct structural folds; for example, the Rossman fold and the

related FtsZ /tubulin fold, the P loop fold, the protein kinase fold, the HSP90/TopoII fold and the

HSP70/RNaseH fold (3). Structurally the P loop is formed by an underlying consensus GXXXXGKS/T

signature motif which folds into a pocket (3).

Most P loop GTPases, except for recently identified proteins such as NACHT and McrB, are classified into

two classes, namely TRAFAC and SIMIBI, based on distinct sequence and structural signatures. The

TRAFAC class includes translation factors, the Ras superfamily, Dynamin superfamily, heterotrimeric G

proteins and Septins. The SIMIBI class includes Signal Recognition particle (SRP) GTPases, MinD and

BioD related proteins. MinD and BioD related proteins lack specificity to bind GTP (3).

G proteins have a conserved G domain spanning 160-180 amino acid residues and with a molecular weight

of 18-20 kDa. The G domain can bind both GTP and GDP; this allows the G protein to switch between a

Page 3: G-domain prediction across the diversity of G protein families

GTP-bound active state and a GDP-bound inactive state (when GTP is hydrolyzed to GDP). The G domain

structurally folds into five α helices, six β strands and five loops interconnecting the helices and strands (1).

The five loops within the G domain are involved in ligand (GTP/GDP) binding (1, 4) unlike other proteins

where ligand binding sites may be alpha helical (5) or made of beta sheets (6). Each conserved loop, known

as a G box, has a consensus amino acid sequence (7): (i) G1 box (P-loop/Walker A motif): “GXXXXGKS/T”. (ii) G2 box (switch 1/ effector domain) has a conserved Thr residue (iii) G3 box (Walker B motif/Switch II):

“DXXG” (iv) G4 box: “NKXD” (v) G5 box: “EXSAX/ SAX”. The loops forming G1 and G3 boxes are not

structurally disordered as the G1 box is rigidly fixed in the middle of the G domain and the D of G3 box is

usually a part of the preceding beta strand.

The crystal structures of G proteins with bound GTP/GDP/GTP analogues illustrate the importance of the

most conserved amino acid residues in each G box of the G domain. There is a strongly conserved Lys in

the G1 box that is crucial for nucleotide binding; the side chain of this residue directly interacts with the

oxygens of the γ-phosphate of the bound nucleotide. The conserved Ser of G1 box, Thr of the G2 box and

Asp of the G3 box, do not directly contact the nucleotide but coordinate Mg2+, which bridges the β and γ phosphates of GTP. The Gly in the consensus G3 sequence DXXG is more variable than Asp. In the G4

box, the Asn contacts the C5 of purine base and the Asp forms a bifurcated hydrogen bond with the nitrogen

atoms of the guanine base, thus conferring specificity to the guanidinium base (1). Although the G5 box is

not well-characterized, structural evidence suggests that the Ala of the G5 box interacts with O6 of guanine

base (1, 8) (Suppl. Fig. 1).

The G boxes are separated from each other by amino acids (spacers) that are not involved in GTP binding.

The spacer between G1 and G3 boxes is 40-80 residues long in most small G proteins (Ras superfamily)

and 130-170 residues long in most large G proteins (Translation factor family), while the spacers between

G3 and G4 boxes is between 40-80 residues long in most G proteins (7). It is difficult to predict a consensus

spacer length between G1 and G2 since the G2 box consists of only one amino acid (Thr). The consensus

spacer between G4 and G5 is unknown due to less available structural information about the G5 box.

Since the discovery of G proteins over fifty years ago, many new GTP binding proteins have been identified.

These range from extensively studied proteins such as Ras, Rab, Ran, Rac, Roc, Arf, translational factors

and G𝞪 subunit of heterotrimeric G proteins to more recently identified members like EngB, Septins, FeoB,

AIG-1, OBG. These G proteins have very distinct cellular functions. For example, Ras is mainly involved in

regulation of cell proliferation, differentiation, morphology and apoptosis. Ran plays an important role in

nuclear-cytoplasmic transport, Rho/Rac is implicated in regulation of actin cytoskeletal organization and

cell polarity while Rab proteins are involved in intracellular vesicular cargo transport. Arf proteins play an

important role in vesicular protein trafficking via coat proteins and lipid modifying enzymes (9).

Heterotrimeric G proteins are involved in signal transduction from cell surface G protein coupled receptors

(GPCRs) to intracellular effectors(10). FeoB proteins, present only in archaeal and bacterial kingdoms, are

found to transport ferrous ions in a potassium dependent manner and have a GDI (GDP dissociation

inhibitor) domain in addition to an amino-terminal G domain (11). Septins are of fundamental importance in

cytokinesis, vesicular trafficking, phagocytosis and dendrite formation (12). AIG-1 is involved in T cell

development and survival in animals and response to biotic and abiotic stress in plants(13). Dynamins are

implicated in budding of clathrin coated vesicles from the plasma membrane, mitochondrial division and

interferon induced GTPase activity (14) while TrmE proteins are involved in tRNA wobble and uridine

modification (15). OBG proteins are postulated to play important roles in many cellular processes like

ribosome biogenesis and maturation, chromosome partitioning, DNA replication control (16). IRG (Immunity

Related GTPases) are involved in immunity against pathogens in vertebrates (17). The signal recognition

particle (SRP) along with its receptor constitutes essential cellular machinery that couples the synthesis of

nascent proteins to their proper membrane localization (18). The Era family of proteins are speculated to

Page 4: G-domain prediction across the diversity of G protein families

be involved in translocation based on their association with 16s rRNA (19) and EngB proteins are involved

in coordination of cell cycle events and assembly of 50S ribosome units in bacteria (20).The

EngA/Der/YfgK/YphC class of proteins, which are yet to be structurally and functionally characterized, have

the unique feature of two tandemly occurring GTPase domains separated by an acidic linker (20).

Moreover, some G proteins such as Septins, Dynamin, TrmE, IRG, AIG1/Toc are also known as GADs

(GTPases activated by Dimerization) since their GTPase cycle is regulated by nucleotide binding and

subsequent dimerization (21) while HflX class of proteins bind both ATP and GTP (22).

As more and more G proteins with diverse functions are identified and their corresponding structures are

elucidated, it is important to investigate if all of these GTP-binding proteins, in fact, have a canonical G-

domain with conserved G boxes and spacers, as described above. It is to be noted that there are many G

proteins in which the conserved residues in each G box are substituted by other amino acids which do not

abrogate nucleotide binding. For example, HypB (23) has Glu in place of Asp in the G3 box, YchF (24) has

Glu in place of Asp in the G4 box, Septin family of proteins (25) have Ala or Gly in place of Asn in the G4

box and FeoB (26) and Ysxc, in which the conserved Ala in the G5 box, is substituted by Thr and Ser

respectively (27, 28).

Due to the diversity in G box sequences and inter-G box spacer length, it is difficult to predict G boxes

and/or important GTP-binding residues if structural information is not available. This task becomes more

challenging in larger G proteins such as Translation factors, Septin, AIG1, HypB, which have multiple

domains in addition to the G domain. We first explored if freely available tools, that can find motifs in

sequences, could be used to predict G-boxes in diverse G proteins. These tools include ELM (Eukaryotic

Linear Motif) (29), GTP binder (30), ScanProsite (31), Pfam (32), SCANMOT (33), MEME/GLAM2 (Gapped

Local Alignment of Motifs) (34) etc. ELM looks for short linear motifs (4- 10 amino acids) in eukaryotic

protein sequences; hence, given a G protein, ELM can possibly look for one G-Box at a time but not the

entire G-domain. Pfam identifies domains (in a submitted input protein sequence) that matches an already

characterized protein domain and hence is not a good option for less studied proteins with very little domain

information. ScanProsite is a more flexible method, which can look for motifs in a protein sequence as well

as variations of the same; however, it is unable to look for different ranges of inter-motif gaps or spacers.

GLAM2 uses gapless Gibbs sampling algorithm to re-discover and refine motifs from various databases

like PROSITE, ELM using sophisticated alignment options, but does not allow for an input of inter-motif

spacers. SCANMOT utilizes pattern matching algorithms to identify multiple motifs in a given protein

sequence. It also computes the inter motif spacer lengths and uses these computed values as a constraint

when scanning sequence databases for identifying multiple such motifs. However, none of these methods

are specific for G proteins and hence do not consider the sequence diversity that exists across G protein

families. GTPbinder is specific to G proteins and predicts GTP interacting dipeptides and tripeptides in a

given protein sequence, using a machine learning based method; it uses PSSM which can consider

possible variations in the sequence but cannot look for a G- Box (which is mostly longer) or the whole G-

domain.

In this study, we describe a method (“Spacers and Mismatch Algorithm” or SMA) that identifies all possible

G domains in a given G protein sequence, after taking into account the G protein family-specific diversity.

An exhaustive analysis of individual members of twenty-three G protein families was used to compile a set

of approximate patterns for the G-boxes as well as a set of constraints for the inter-motif gaps. All this

information was then used by SMA to make its predictions. The novelty in our approach is that we use both

approximate pattern matching as well as inter-motif separation constraints to predict the locations of the G-

boxes. Our algorithm can be further customized to accommodate new information about mismatches (any

amino acid) at any position of the G box as well as variations in the inter G box spacer length.

Page 5: G-domain prediction across the diversity of G protein families

G box sequences predicted by SMA, were validated in corresponding G protein structures by observing if

the predicted G boxes and G-domain to which they belonged, were indeed involved in guanine nucleotide

binding. SMA was able to predict G protein family-specific G box consensus sequences for all G-boxes

including the less characterized G5 Box. An accurate prediction of G5 box is specifically important since it

is the last of five motifs in a G domain and hence can help estimate the carboxy-terminal boundary of the

domain.

Thus, SMA analysis demonstrated that G protein families could possibly be classified based on family-

specific G box sequences and inter-G box spacer lengths. Moreover, this method can be used to better

predict the boundaries of a G domain, within a large multi-domain protein.

SMA can be implemented via a user-friendly webserver which is available at

https://labs.iitgn.ac.in/datascience/gboxes/ . TSMA can also help in classifying any protein into one of the

G-Protein classes by looking for all possible G-Domains in its primary sequence.

RESULTS

Significant reduction in the size of protein families after removal of similarity bias

Twenty-three different G protein families (with a total of 17156 proteins across all families) were selected

based on PROSITE documentation and Pfam entry for Galpha family (Suppl. File 1). Members of the same

G protein family share very high sequence similarity amongst each other. To reduce redundancy in the

results and over-representation of any sequence, the similarity bias (see Methods) from each G protein

family was removed. As seen in Suppl Table 1, a significant reduction in the number of member proteins

was observed in Arf, EngA, EngB, Era, Galpha, OBG, Rac, Ran, Rab, Ras, SRP, translational and TrmE

families. After removal of similarity bias, a total of 14686 proteins across all families were used for further

analysis.

Individual G boxes of the G domain cannot always be identified by Multiple Sequence Alignment

It is difficult to identify G domains by Multiple Sequence Alignment (MSA) of G proteins. This is because

most G proteins, except the Ras superfamily members, are multi-domain proteins (one of these domains

being the conserved G domain); this leads to poor alignment of the G domain across G proteins of different

families. We attempted to unsuccessfully identify G domains for each of the twenty-three G protein families

by constructing separate phylogenetic trees using MSA of full-length protein sequences (Suppl. File 2). This

was not surprising since the phylogenetic tree for the Ras superfamily (includes Ras, Rac, Rab, Ran and

Arf families), which comprise small G proteins with a single G domain, also demonstrate a distinct

separation between individual families (9,40).

G boxes could only be predicted upon careful visual examination of the alignment results i.e manually

correlating the occurrence of conserved amino acids and spacers (For example, adjacent Gly and Lys

residues occurring four amino acids downstream of a Gly at the amino-terminal end of Ras proteins can be

predicted as the G1 box; a Gly residue occurring two residues downstream of an Asp residue, which again

is separated from the predicted upstream G1 box by some spacer residues, is G3 box; and so on). The

tedious manual method described above could predict G1, G3, G4 and G5 boxes only in Ras, Rab and Ran

G protein families; G1, G3 and G4 boxes for Rho, Roc, Arf, Dynamin, EngA, EngB, FeoB, HflX, IRG, OBG,

Septin, SRP and TrmE families; only G1 and G3 boxes in GB1/RHD3 and AIG1 families and no G box in

the Translational family.

Page 6: G-domain prediction across the diversity of G protein families

Thus, we were encountering the well-known problem of identifying short motifs in proteins and subsequently

identifying remote protein homologs on the basis of these conserved motifs. Earlier attempts at solving this

problem using a pattern matching algorithm (SCANMOT, 33) and Gibbs sampling algorithm (GLAM2, 34)

have made significant improvements in the field. However, none of these methods are specific for G

proteins and hence do not take into account the sequence diversity that exists across G protein families.The

difficulty in predicting consensus G boxes and their corresponding G domains in both the Ras family, with

similar protein sequences, to the Translational protein family, with dissimilar protein sequences, maybe due

to deletions, insertions or substitutions in otherwise similar protein sequences of the same G protein family.

This could lead to a wide diversity of G box sequences as well as lengths of intervening spacers in the G

domains of individual G protein families (35). Thus, although all G proteins are traditionally defined by a

conserved G domain, there may be family-specific signatures for G box sequence and spacer length.

Curating available information about the G protein families.

To further probe our hypothesis that the diversity of G domains may have family-specificity, we embarked

on an exhaustive analysis of all known G protein family members. After removal of similarity bias from the

sequences of the twenty-three selected G-Protein families (Suppl. Table 1), G box consensus sequence

and family specific spacers between consecutive G boxes were manually curated from the available

literature and PDB database as described below:

Inter-G box Spacing: Structural information (obtained using X- ray crystallography or solution NMR) of G

proteins bound to GDP, GTP or GTP analogues can help directly determine their GTP binding sites and

corresponding G domain. Most available structures for each of the twenty-three G protein families were

studied (column 4, Suppl. Table 2). The G domain of each of these proteins was mapped to the amino acid

residues in the respective protein. Briefly, individual G box sequences were manually assigned to the amino

acids that made contacts with either the phosphate (G1 and G3) or the guanine ring (G4 and G5) of

GDP/GTP. The spacers between the consecutive G boxes was calculated by subtracting the amino acid

position at the beginning of a specific G box from the amino acid position at the beginning of the succeeding

G box (for example, if G1 box of a G protein starts at position 15 and the G3 box of the same protein starts

at position 81, the spacer between them is calculated as 66). If there was structural information for more

than one protein in a family, a putative range was decided based on the minimum and maximum values of

calculated spacers for all such proteins within the same family (Suppl. Table 3). Some flexibility was allowed

in these putative ranges. For instance, in well-studied families like Ras family which have shorter proteins,

the putative spacer length was defined as +1 residues longer than the calculated spacer values, whereas

for the translational or dynamin family which have longer proteins, the spacer length was defined as +2

residues longer than the calculated spacer boundaries. In less-studied protein families such as AIG1, EngA

which have short proteins, we have defined the spacer as +5 residues than the calculated spacer

boundaries whereas for less studied protein families with long proteins like Septin and OBG, the spacer

value was defined as +7 than the calculated spacer boundaries (Suppl. Table 3).

In some families like dynamin, EngB, HflX, OBG and translational, different proteins within the same family

had significantly different inter G box spacers (Suppl. Table 4). Thus, for such families, more than one

spacer constraint was included (Suppl. Table 3) to account for the observed intra-family variability.

In this way, putative spacers were defined for the approximate distances between G1-G3, G3-G4 and G4-

G5 boxes (Suppl. Table 3) for each G protein family. Thus, we identified more precise spacers between

consecutive G boxes in each G protein family as compared to the broad range of 40-80 amino acids

between G1 box and G3 box and in between G3 box and G4 box suggested previously (7). Also, it was

Page 7: G-domain prediction across the diversity of G protein families

noted that the G4-G5 spacer length varied considerably and could be as short as 20 residues and as long

as 120 residues.

Mismatch in G Box consensus sequence: Most G proteins (with structural information) have highly

conserved G box sequences and do not differ from the consensus (i.e G1- GXXXXGK, G3-DXXG, G4-

NKXD, G5-SAX; where “X” =any amino acid) by more than one mismatch.

For example, “GFPSVGKS” (in Rbg1) and “AHIDAGK” (in S. aureus EF-G; 2XEX) are G1 box motifs with

one mismatch from the consensus motif, GXXXXGK. Also, “DIWE”, the G3 motif for human Rad; 2DPX and mouse REM2; 3Q85, has one mismatch from the consensus DXXG.

Most experimentally identified G4 boxes had 1 mismatch from the consensus sequence, NKXD [e.g.,

“TKLD” in human DNM1L (3W6N), “TQID” in human Cdc42 (1A4R), “TLRD” in human GBP1 (1DG3); “NVNE” in H. influenza YchF (1JAL);“G/AKXD” in human SEPT2 (2QAG), “T/CKXD” in Rac proteins, “TKXD” in EngB, Roc, SRP, Dynamin family], except the G4 boxes in the Roc, AIG-1, GB1/RHD3 and OBG

families which usually had two mismatches [e.g., human LRRK2 “THLD” (2ZEJ), human GIMAP7 “RKEE” (3ZJC), human GBP1 “TLRD” (1DG3), human OLA1 “NLSE” (2OHF)].

It was observed that there were many variations to the G5 box consensus sequence “SAX”. This ranged from one mismatch in “SSE” in B. subtilis YsxC, “QAH”in Legionella pneumophila YsxC, “SSX” in EngB family, “SG/QX” in many translational and dynamin family proteins, “SCX/ CAX” in the Arf family, SGL” (4ZKD), “TAL” (2H5E), “SSK” (4U5X), “SCA” (1E0S), “SSA” (1MOZ), “NAT” (2ZEJ), “SQL” (3W6N), “STR” (3HYR) to “GVG” (2OG2), “GEG” (1OKK), “GVVRNSQ” (1JWY/1JX2) in the dynamin family and“GXG/K” in SRP family. Thus, most identified G5 boxes, except GVG, GEG and GVVNRSQ, appeared to have one

mismatch from the “SAX” consensus.

Therefore, based on our observed variations in G-box sequences, we defined that up to one mismatch was

allowed in the consensus patterns for G1box (GXXXXGK), G3 box (DXXG), G4 (NKXD) and G5 box (SAX)

for each of the G protein families. The only exception was the allowance of two mismatches in the G4 boxes

of the Roc, AIG-1, GB1/RHD3 and OBG families.

A doughnut plot (Fig. 1) and Suppl. Table 3 represents the manually curated information for all members

across all G protein families. In Fig. 1, each concentric circle represents consensus sequences of different

G boxes, The longer the arc, higher is the occurrence of that consensus sequence across most G protein

families; for example, NKXD is the most prevalent (occurs in 55% of curated G proteins) G4 box (green

circle) while SAX is the most prevalent (occurs in 40% of curated G proteins) G5 box (blue circle). Fig. 1

also shows that in up to 40% of curated G proteins, the G5 box sequence was unknown.

Prediction of GTP-binding domains by Spacers and Mismatch Algorithm (SMA)

An algorithm, named “Spacers and Mismatch Algorithm” (SMA), was written (Fig. 2) to predict G domains

for the member proteins of the twenty-three different G protein families, using the manually curated G

protein family-specific constraints for G domains (Suppl. Table 3).

SMA Step1a predicted G domains in more than 75% of the member proteins in all the twenty-three G

protein families [except for Galpha (50.4%), SAR (10%), IRG (40%) and SRP (42.4%)] (Suppl. Table 5,

columns 2 & 3). The rest of the proteins in each of the twenty-three G protein families probably did not give

output due to differences in spacer length [e.g., in P24498 (Ras family), the spacer between G1 and G3 is

13 residues longer than the putative spacing (Suppl. Table 3)]. However, for many experimentally studied

Page 8: G-domain prediction across the diversity of G protein families

G proteins, multiple possible G domains were predicted for a single protein due to various possible

combinations of predicted G boxes (G1, G3, G4 and G5).

Fig. 3 shows the individual G boxes, predicted using SMA, for two proteins each from Ras, Era and

translational G protein families. These families were chosen to represent the findings because the G

domains are very similar amongst homologs in the Ras protein family, (e.g., NRAS in humans and mouse),

are very diverse in the translational family (e.g., EF2 in humans and mouse) while the Era family proteins

have not been studied extensively. One protein from each family [Q62636; 3X1W (Ras), Q5SM23; 1WF3

(Era) and Q71V39; 4C0S (translational)]] has an experimentally identified G domain (PDB structures

available) while the other protein [P55043(Ras), Q8NNB9 (Era) and A4IMD7 (translational)] has no

available structural information (Fig. 3). In the SMA generated output for proteins with solved structures,

one of the three outputs for Q62636 and one of the four outputs for Q5SM23 matched exactly with the

experimentally identified G boxes. On the other hand, for Q71V39, the entire G domain (i.e., all 5 G boxes)

was not predicted correctly in any of the 11 SMA outputs although 2 of the outputs were able to predict the

G3, G4 and G5 boxes correctly. Overall, it appears that SMA predicts multiple G domains either due to

many predicted G5 boxes [e.g., Ras (Q62636); Era (Q8NNB9); Translational (A4IMD7) or many predicted

G1-G5 boxes [e.g., translational (Q71V39), Ras (P55043), Era (Q5SM23)].

It was thus noted that, as expected, most variability was observed in the prediction of the G5 box by SMA.

SMA predicts significantly different G5 box motifs for some G protein families

“SAX” has been accepted as the G5 box consensus sequence (1, 3) since most experimentally discovered G5 boxes, including the exceptions to “SAX” across all G protein families, have been found to have either Ser or Ala interacting with the guanine ring of the GTP molecule. However, structural studies have identified

many variations to the G5 box motifs across G protein families. For example, “SGL” (4ZKD), “TAL” (2H5E), “SSK” (4U5X), “GVG” (2OG2), “GEG” (1OKK), “SCA” (1E0S), “SSA” (1MOZ), “NAT” (2ZEJ), “SQL” (3W6N), “STR” (3HYR), “GVVNRSQ” (1JWY).

It is to be noted that Step 1a of SMA, which searches for a G domain in a protein sequence using the G5

box constraint of “SAX” with one mismatch within the specified G4-G5 spacer length, essentially boils down

to searching for only one amino acid i.e either S or A. This is because one mismatch will replace either S

or A in one independent search. The variability observed in G5 box prediction by SMA (Fig. 3), may be

attributed to this obvious non-specificity in the search parameters. Thus, we decided to search for a more

specific and probable family-specific consensus sequence for the G5 box.

To increase the accuracy of G5 box prediction, the search space was increased by increasing the length

of the G5 box search string from 3 to 7 residues. SMA Step 1b (See Methods for details), searched for all

possible 7-mers after the predicted G4 box, that had either S or A in the centre and lay within the family-

specified G4-G5 spacer range. Steps 2 and 3 of SMA were then performed to predict a more precise G5

box consensus sequence per G protein family (Fig. 2).

Interestingly, the results indicated that there was no common 7-mer sequence that occurred in most

members of all G protein families. Even closely related families such as Ras, Ran, Roc and Arf showed

different 7-mer sequences. On the other hand, each G protein family (except translational, TrmE, SAR and

SRP) had one or more 7-mers that occurred in >25% of the proteins of an individual G protein family. This

indicated that there was probably no consensus G5 motif that was conserved across all G protein families,

unlike the consensus motifs for G1 (GXXXXGK), G3 (DXXG) and G4 (NKXD) boxes.

Page 9: G-domain prediction across the diversity of G protein families

An arbitrary threshold of 25% was chosen to ensure that the predicted 7-mer covered at least a quarter of

the members in a given G protein family. The motif which had the highest coverage was selected as the

new consensus G5 box sequence for the given G protein family. Also, if there were Xs at the two ends of

the predicted 7-mer, they were trimmed to give a more concise G5 box consensus sequence. Thus, the

predicted family-specific G5 box sequences had different lengths. For example, EXSAX in the Ras family,

SXXSXXS in the Roc family and ADXP in the OBG family (Suppl. Table 6).

In order to validate if SMA predicted G5 box motifs accurately, the output after SMA Step4 was matched

with G5 box motif sequences observed in corresponding available G protein structures. As seen in Suppl.

Table 7, SMA predicted correct G5 box motifs for all G protein families which had structural evidence of

“SAX” as the G5 box. (Ras, Ran, Rab, Arf, EngA, EngB, Era, HflX, FeoB, Miro). However, for G protein families which did not have “SAX” as the G5 box motif (OBG, Dynamin, GB1/RHD3, IRG, Rho/Rac, SAR,

SRP, Translational) the SMA-predicted G5 box sequence did not match the G5 box sequences observed

in available structures.

Comparison of G domains generated using “SAX” as consensus G5 box sequence versus using SMA-predicted G5 box sequence

(a) Prediction of fewer false positive G5 boxes

SMA Step 4 (Fig. 2b) was performed to search for all possible G-domains in each G protein sequence,

using the family-specific constraints of consensus sequence, spacers and allowed mismatches listed in

Suppl. Table 3 along with the additional constraint of using the new G5 box sequences predicted in Step 3.

The number of predicted G5 boxes per G protein was considerably reduced (Fig. 4, Suppl. Fig. 2) upon

using the SMA-predicted G5 box sequence (from SMA Step 3; referred to as “SMA3” from now on), as compared to the output using “SAX” as G5 box sequence (from SMA Step 1a; referred to as “SMA1a” from now on). The Ras (members have similar G domains), translational (members have diverse G domains)

and Era (less structural information) protein families were chosen to represent the findings. Fig. 4 illustrates

that for individual G proteins (vertical line) from all three G protein families, SMA3 (Fig. 4b) consistently

reduced the number of predicted G5 boxes (red) and while the number of predicted G1 boxes (blue)

remained similar.

(b) Prediction of fewer unique G boxes per G protein family

We then investigated if SMA Step 4 had any effect on the prediction of all G boxes along with the prediction

of fewer G5 boxes per G protein family. Upon using SMA3-predicted G5 box sequence (as compared to

the output from SMA-1a), the predicted number of unique G boxes (G1, G3, G4 and G5 box) were much

less i.e., <=50% in AIG-1, GB1/RHD3, Septin, IRG, Roc, EngB, FeoB, HflX, OBG, Rac, Arf and SRP families

while no reduction was seen in the Ran family (Suppl. Table 8, Suppl. Fig. 3). Fig. 5 illustrates the prediction

of fewer unique G boxes in Ras, translational and Era G protein families.

Fewer unique G boxes are predicted either due to less proteins generating output upon using the SMA3-

predicted G5 box sequence (in Arf, EngA, EngB, alpha, OBG, Rac and SAR families; Suppl. Table 5,

columns 4 and 5) or the prediction of a different G5 box sequence than the conventional “SAX”(in Galpha, GB1/RHD3, IRG, Roc families) or the experimentally reported “GXG” (in SRP family, Suppl Table 7).

(c) Spaces between consecutive G boxes before and after SMA analysis

Page 10: G-domain prediction across the diversity of G protein families

Spacers between consecutive G boxes were calculated in the G domains (predicted after SMA Step 3) of

proteins in each G protein family. These spacers were compared to the spacers obtained after SMA Step

1a, in a representative histogram plot (Fig. 6) for (A) Ras, (B) Era and (C) translational G protein families.

The percentage of proteins in a G protein family, having indicated spacer value (bars); mean of inter-box

spacing (dashed line), before (blue), and after (yellow) SMA-Step 3, varies between families as well as

between specific spacers. These parameters are either identical in Ras (G1-G3, G3-G4) and Era (G1-G3)

families (Fig. 6A, 6B), quite similar in Era (G3-G4) and translational (G1-G3, G3-G4 and G4-G5) (Fig.s 6B

and 6C) or very different in Ras (G4-G5) and Era (G4-G5).

The percentage of proteins having similar spacers before and after SMA-Step 3 were found to be

similar in all the G protein families except for Hflx, FeoB and SRP (for G1-G3 spacer), FeoB, Miro and SRP

(for G3-G4 spacer) and AIG1, Arf, EngB, Era, Galpha, IRG, HflX, Miro, OBG, Rab, Rac, Ran, Ras, SAR,

Septin, SRP, TrmE (for G4-G5 spacer) (Suppl. Table 9).

Thus, it appears that in many G protein families, the percentage of proteins having similar G4-G5 spacers

as well as the mean of the G4-G5 predicted spacings, differs significantly before and after SMA3. However,

the G1-G3 as well as the G3-G4 spacers do not differ in most families. This implies that there is a significant

difference between the SMA3-predicted G5 motif and “SAX”, which was the accepted G5 consensus.

Prediction of G protein family specific G1, G3 and G4 box consensus sequences:

SMA Step 3 predicted G protein family-specific G5 box consensus sequences that were previously not

reported (Suppl. Table 6). It was intriguing to then ask if there were also G protein family-specific G1, G3

and G4 box motifs beyond the consensus “GXXXXGK”, “DXXG”, “NKXD sequences respectively. In other words, can the consensus motifs be modified and more specifically, are there any family-specific patterns,

which could possibly give better representation of the “X” residues (where X=any of the twenty amino acids).

G1, G3 and G4 box sequences were predicted for all the proteins in each G protein family after SMA Step

4 and visualized as sequence logos. More distinct family-specific G box consensus motifs were predicted

in most G protein families except AIG-1, GB1/RHD3, OBG, Roc and translational families. G protein family

specific G box consensus motifs of all the proteins in Ras, Era and translational protein families are shown

in Fig. 7. For example, the first “X” in G3 box is found to be “T” in all three protein families whereas, the second “X” in G3 box can either be “P” or “A” (Fig. 7). Several interesting family-specific signatures were

observed. For instance, the predicted G3 consensus was “DXXS/D” in the AIG1 protein family but

“D/WXXG” in Arf while the predicted G4 consensus was “TKXD”in Dynamin, SRP, Rac and EngB G protein families and “AKXD” in Septin (correlates with corresponding PDB structures). Some families had very specific predicted G box consensus motifs. For example, G3 box (DLPG) in FeoB, G1 (GESGAGK) and G4

(TKVD) boxes in IRG and all four G boxes (GDGGTGK, DTAG, NKVD and SAKSN) in Ran. (Suppl. Fig. 4)

The diversity of predicted G boxes in the translational family is consistent with the documented sequence

diversity of its member proteins; similar G box diversity is also seen in Roc, OBG, AIG-1 and GB1 families

(Suppl Table 1, Suppl. Fig. 4). On the other hand, the similarity of predicted G boxes in the less studied Era

family members may imply greater homology in their G domains since Ras family members, which have

high homology, also appear to have very similar predicted G boxes.

It is interesting to note that the amino acids at the amino and carboxy-terminal ends of each G box were

conserved across all families; these conserved residues were in fact, parts of the previously accepted G

Page 11: G-domain prediction across the diversity of G protein families

box consensus motifs which had defined amino acid signatures instead of “X”. [ i.e., the G and GK

“GXXXXGK”(G1), D and G in“DXXG” (G3) and NK and D“NKXD (G4)].

To test the specificity of the SMA output, it was validated by using proteins, other than canonical GTP

binding proteins, as input (Suppl. File 3). For example, no output (after SMA Step 1a) was generated when

hemoglobin chain A or hemoglobin chain B were used as query sequences. Moreover, ATP binding proteins

(e.g., RBBA protein in E. coli) which have a P loop (G1 box) but no other G box, do not generate any output.

In proteins like human OLA1 which bind to both ATP and GTP, the generated output has one combination

of predicted G boxes (G domain) that binds GTP.

To maximize the number of predicted G domains per G protein, the family-specific G box consensus

sequence and inter G box spacer constraints used to run SMA were kept broad. To test the significance of

the SMA results and to ensure that the broad constraints did not end up predicting G domains in any string,

a negative control experiment was performed. Suppl. Fig. 5 illustrates that after shuffling the protein

sequences, the number of G-Domains predicted per sequence (red) becomes significantly less. This further

validates the specificity of SMA prediction.

The predicted G box sequences does not depend on the size of the G protein family

Fig. 7 demonstrates that for certain G boxes in some G protein families, there is significant diversity in

individual positions within each box (e.g., translational family). It can be argued that this observed diversity

may be due to the varying sizes of the different G protein families i.e., a protein family with more members

would show more variation in G box motifs merely because of larger representation than a protein family

with fewer members. To probe whether the diversity of predicted G box sequences was a function of the

size (number of member proteins) of a G protein family, the entropy of individual G boxes was calculated

for each family. The calculated entropy values gave insights about how random the “X” position in each of the G boxes (i.e., GXXXXGK” for G1, “DXXG” for G3 and “NKXD” for G4) is and how randomness of individual “X” positions impact the total entropy of the G-Box.

Fig. 8 investigates the correlation between the total entropy of each G box and the size of each G protein

family. The Translational family is the largest G protein family with 2870 member proteins; however, the

entropy calculated for each of its G boxes is not significantly high. On the other hand, small G protein

families like Roc, with only 20 members, demonstrate highest entropy for G1, G3 and G5 boxes while the

GB1/RHD3 family, with 50 members, has the highest entropy for G4 box. The IRG (11 members), FeoB

(19 members) and Ran (30 members) families have the lowest entropy for G1, G4 and G3, G5 boxes

respectively despite having similar sizes to that of Roc and IRG families (Fig. 8). Thus, it appears that the

observed variation in G box motifs does not correlate with the size of the protein family but maybe because

of other properties of individual proteins.

Position of predicted G5 box sequence on the loops of available and predicted structures provide

confidence to the sequence based G5 box analysis

G boxes are the most conserved regions of G domains (3). The conserved G boxes of G proteins occur in

loop regions (8), which are less ordered than other secondary structural elements of any protein structure

(36).This is unlike transcription factors, in which most conserved sequences fall on alpha helices or the

lipocalin protein family which have conserved beta sheets (6). We decided to test whether the sequence-

based prediction of G protein family-specific G box sequences mapped on to loop-like structures of the

corresponding protein.

Page 12: G-domain prediction across the diversity of G protein families

Predicted G box motifs (using the newly identified G5 box sequence) were mapped onto available protein

structures downloaded from the PDB database [Fig.9a; Q62636; 3X1W (Ras), Q5SM23; 1WF3 (Era) and

Q71V39; 4C0S (translational)] as well as ITASSER predicted protein structures [Fig.9b; P55043 (Ras),

Q8NNB9 (Era) and A4IMD7 (translational)] from Ras, Era and translational G protein families. Fig.9

demonstrates that the predicted G box sequences mapped to the loop regions in both the available (Fig.9a)

and predicted (Fig.9b) protein structures. This structural alignment provides confidence to the sequence-

based G box prediction. Additionally, the fold of the G domain is found to be very similar between the two

proteins of the same G protein family [Q62636 and P55043 (Ras), Q5SM23 and Q8NNB9 (Era) and

Q71V39 and A4IMD7 (translational)].

DISCUSSION

The GTP binding domains of all G proteins have been characterized by a set of consensus G box

sequences (G1-G5) and inter-box spacers (3, 7). However, with the recent identification of many G protein

families, it is observed that there is a vast diversity in the sequence of G proteins. This diversity is reflected

in the sequences of the G-boxes and more specifically in inter-G box spacer length and the G5 box

consensus sequence. Here, we present a tool called SMA, which can identify G protein family-specific G

box sequences and inter-G box spacer length, based on the constraints within individual families.

SMA is a one-of-a-kind algorithm (Fig. 2a, b, c), which can perform an approximate search for multiple

sequences occurring with a given range of gaps between them. In other words, unlike tools like GTP binder

(30), ELM (29), ScanProsite (31), Pfam (32), SCANMOT (33), MEME/GLAM2 (34). SMA offers the

advantage of using multiple proteins in the input file and searches for all G boxes simultaneously instead

of one after the other. It uses consensus patterns with allowed variations as well as maximum and minimum

gaps allowed between each pattern. Moreover, all the constraints in the algorithm are customizable and

the user has the flexibility to change the G box sequence including the number of allowed mismatches as

well as the inter G box spacer length. This is especially useful, if the protein of interest has very low amino

acid sequence similarity with well-studied GTP binding proteins and/or is less studied i.e., little or no

information is available from structural or biochemical studies. It is to be noted that SMA is a generalized

algorithm, which is not restricted to finding amino acid motifs in protein sequences. The algorithm can be

appropriately modified to identify patterns in any sequence including those of biological importance like

DNA or RNA.

An exhaustive analysis of twenty-three different G protein families, demonstrated that G box sequences

and inter-G box gaps are often class-specific. Thus, it is proposed that G protein families could possibly be

classified based on family-specific G box sequences and inter-G box spacer lengths. Moreover, SMA was

able to come up with better representations of the family-specific G box motifs. For instance, it was possible

to predict if the “X”s in the established G1 box motif (GXXXXGKS/T) could be better represented by a specific set of residues rather than any of the twenty residues (Fig. 7). SMA, was also able to predict G5

boxes for the different G protein families. This was significant since the G5 box marks the carboxy-terminal

boundary of a G domain and hence demonstrates that SMA can be used to predict more precise boundaries

of a G domain, within a large multi-domain protein.

Analysis of the sequence divergence between different G protein families may provide mechanistic insights

into the differential basis of GTP binding and hydrolysis in different G proteins like Ras and Dynamin. For

instance, Dynamin proteins have low nucleotide affinity and low GTP hydrolysis rates with a GTPase cycle

that is regulated by nucleotide binding and subsequent dimerization (GAD) as compared to Ras proteins

which have low intrinsic nucleotide exchange rate, high nucleotide affinity and are regulated by GEF, GAP

and GDI proteins (21). It is tempting to speculate if these differences can be attributed to the difference in

Page 13: G-domain prediction across the diversity of G protein families

inter G box spacers (Suppl. Table 3), G4 motifs (Ras; “NKXD” in 121P, 1X1R and Dynamin; “TKXD” in 1DYN, 1JWY) and G5 motifs (Ras; “SAK” in 121P,1X1R and Dynamin; “SQK” in 1DYN, “GVVNRSQ” in 1JWY) between these two families. Interestingly, fusion Dynamins like Mfn1/2 lack the consensus G5 box

while the consensus G5 box is “SQX” in fission Dynamins like Drp1 and human Dynamin-1.

Some G proteins can bind and/or hydrolyse more than one nucleotide apart from GTP. This may be

explained by the presence of additional domains (e.g., the Hflx proteins, which have a G domain as well as

an ATP binding ND1 domain (22). Alternately the nucleotide specificity may be altered by changes in the

G4 box motif (“NKXD”). For example, within the OBG family, the G4 motif is “NLXD” in human YchF/OLA1, which functions as an ATPase rather than a GTPase (37) and “NhXE” (where h is a hydrophobic residue) in Oryza Sativa YchF, which binds both ATP and GTP with similar binding specificities (38).

The EngA/Der/YfgK/YphC family is characterized by two tandemly occurring GTPase domains separated

by an acidic linker (20). SMA’s ability to independently predict two distinct domains along with

corresponding G boxes in individual EngA members provides confidence to the method. Moreover, the

accuracy of SMA prediction was further validated by the fact that no G domain was predicted in the center

of the EngA proteins.

It is also interesting to note that some G proteins have different insertions within the G domain For example,

eIF-2𝛄 has a zinc ribbon insertion between the G2 and G3 boxes (3) while AIG-1 proteins have a

hydrophobic motif insertion between G3 and G4 boxes (13). SMA predicts structurally identified G boxes

and the corresponding G domain for some of the AIG-1 and eIF2 proteins. SMA predictions for these

families may improve as more structural information becomes available and the SMA parameters can be

changed accordingly.

Detailed analysis of SMA results can also be used to explore interesting evolutionary relationships amongst

homologs of the same protein. For example, Rho1 protein in E. histolytica, B. vulgaris, P. sativum does not

give any output with specified G box sequences and inter G box spacers. This is because, in E. histolytica

Rho1, the putative G5 box (EASSV) is only 28 amino acids away from the putative G4 box (LKVD) which

differs from the defined G4-G5 spacer range of 40-50 amino acids, that is present in most Rho1 homologs

(Suppl. Table 3). Rho1 from B. vulgaris and P. sativum, lacks a putative G3 (NKXD) box and the putative

G5 box (ECSSK) is 37 amino acids away from the putative G4 box (TKLD) which differs from the defined

G4-G5 spacer range (40-50 amino acids). On the other hand, SMA predicts identical G1 (GDGACGK), G3

(DTAG) and G5 (ECSAK) boxes in Rho1 homologs from D. melanogaster, A. gossypii, C. elegans, S.

pombe, K. lactis, S. cerevisiae and C. albicans but different G4 box sequence (NKXD in D. melanogaster

and C. elegans and CKXD in A. gossypii, S. pombe, K. lactis, S. cerevisiae and C. albicans). These

observations could be further investigated to understand the nucleotide binding properties of the different

Rho1 homologs.

It was also noted that for certain G proteins, the inter G box spacer length was conserved within homologs

present in the same kingdom of life but differed between kingdoms. For example, the IF2/eIF5B proteins of

the translational family, which is functionally conserved in all domains of life, has respective G1-G3, G3-G4

and G4-G5 spacer lengths of 46, 54 and 34/ 36 in bacteria (e.g., IF2 structures from B. stearothermophilus

and T.thermophilus), 67/72/77, 56 and 33/35 in Archaea (IF2 structures from P. furiosus, P. abyssi and S.

solfatricus) and 64/72, 54 and 66/68 in Eukarya (e.g., eIF5B from S. cerevisiae, M. thermautotrophicus).

Similar paralog-specific spacers are also observed in eukaryotic RFs (eRF3, EF1a2, ET-Tu) as well as the

OBG family (OBG, DRG, NOG1, Ygr210 compared to YchF/YyaF). These interesting observations about

diversity of inter G box spacers within the translational family, agrees with reported evidence of the

divergence of translational factor subfamilies (39). This also suggests that certain families like the

Page 14: G-domain prediction across the diversity of G protein families

translational factors may be further classified into distinct subfamilies based on sequence divergence and

possibly different cellular functions.

Our current observations are, however, biased by the PROSITE classification of G protein families, our

manual curation of family-specific G box sequences and our imposed constraints for allowed mismatches

in individual G box sequences and inter G box spacer length. Thus, any exceptions to the above constraints

will be missed and may need to be changed, as more structural information is obtained about members of

different G protein families. Moreover, it is to be noted that SMA had an inherent bias in G5 box prediction

since it only looked for patterns (with allowed mismatches) which had “SAX” in the centre. Also, the G5

boxes that are identified using a 25% threshold, miss under-represented putative G5 box sequences. This

is specifically important for less-studied families such as AIG-1, Era, GB1/RHD3, IRG and Roc, in which

there is very little available structural information for individual members. On the other hand, for families

like Ras, for which there is a wealth of structures, our imposed constraints may stand the test of time.

MATERIALS AND METHODS

Selection of G protein families for further analysis

There are multiple databases like PROSITE, Pfam that can be used for the study. PROSITE, which is a

collection of protein domains and families along with the structural or functional patterns associated with

them, was chosen for the study because it had curated information for some G protein families (EngA,

EngB, Era,OBG, Translational, TrmE) which were not available on Pfam. “Guanine nucleotide-binding” was used as the search string to find PROSITE [Pubmed id 23161676] entries corresponding to G proteins. 21

documentation entries were identified to have “Guanine nucleotide binding” profiles. 4 out of these 21 entries include proteins which interact with GTP binding proteins but do not bind to GTP themselves

[PDOC00210 G-protein coupled receptors family 1, PDOC01002 G-protein gamma subunit, PDOC00574

Trp-Asp (WD-40) repeats and PDOC51717 Very large inducible GTPase (VLIG)]. PDOC00017 ATP/GTP-

binding site motif A (P-loop) includes proteins containing only the P loop and not the entire G domain.

PDOC00962 Hydrogenases expression/synthesis hypA family does not have any GTP binding proteins.

Also, PDOC51721 Circularly permuted (CP)-type proteins are excluded as this family consists of proteins

that contain G boxes in a different order from the canonical G1-G2-G3-G4-G5 pattern, e.g., DAR GTPase

3 in Arabidopsis thaliana has G boxes in the order G4-G1-G2-G3 (40).

The Uniprot/Swiss-prot true positive sequences for each of the remaining 14 entries were selected for

further analyses. True positive sequences on PROSITE are the sequences that are curated with a strict

cut-off value in order to exclude any entries without a pattern (in our case, the five G boxes). These include

the following guanine nucleotide-binding (G) domain signatures and profiles: PDOC51720 AIG1,

PDOC51714 Bms1, PDOC00362 Dynamin, PDOC51712 EngA, PDOC51706 EngB, PDOC51713 Era,

PDOC51711 FeoB, PDOC51715 GB1/RHD3, PDOC51705 HflX, PDOC51716 IRG, PDOC51710 OBG,

PDOC51719 Septin, PDOC00273 Translational (tr), PDOC51709 TrmE. In addition to the above entries,

the following entries, which were not represented in the results of the search string “Guanine nucleotide binding”, were included: PDOC51417 (small GTPase family which includes all the canonical small G proteins such as Arf, Miro, Ran, Rab, Ras, Rho/Rac, Roc and SAR1) and PDOC00272 (SRP54 family, a

SIMIBI class protein). Bms1 family was excluded from this study due to the unavailability of sufficient

structural information of any of its protein members. Also, since PROSITE does not have a separate

documentation entry for alpha subunits of heterotrimeric G proteins, the sequences available from Pfam

were used for analysis.

Page 15: G-domain prediction across the diversity of G protein families

An excel file (with uniprot id, protein name, length of the protein, protein sequence and PDB structure

information) was retrieved for true positive sequences of all the 20 PROSITE documentation entries from

PROSITE (Galpha from Pfam) and used as the input file for our analysis (Suppl..File 1)

Removal of similarity bias from each PROSITE documentation entry:

Levenshtein distance (edit distance) (https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf) was used to

compare all the member protein sequences of the same G protein family amongst each other. Only one

sequence from all the sequences that showed >=90% similarity was retained to reduce redundancy and

remove the similarity bias from further analysis. (https://github.com/RichaRashmi-projects/G-Protein-

Project.git )

Multiple sequence alignment of proteins and phylogenetic tree generation for all G protein families:

After removal of similarity bias, FASTA files (containing the non-redundant protein members of each of the

G protein families) were used to perform multiple sequence alignments followed by phylogenetic analysis.

These were performed using ClustalW and maximum likelihood method respectively, provided in the MEGA

X software package (41).

Spacers and Mismatch algorithm (SMA)

The Spacers and Mismatch Algorithm (SMA), uses a data-based approach to predict G domains for the

GTP-binding proteins (Fig. 2b, c). It takes special consideration to define a more precise G5 box and

consequently a better carboxy-terminal boundary of a G domain. Following are the steps of SMA:

Step1-G- boxes prediction

A python script (https://github.com/RichaRashmi-projects/G-Protein-Project.git ) was written, to search for all possible G-domains (along with putative G boxes) in each G protein sequence, using the family-specific constraints of consensus sequence, spacers and allowed mismatches listed in Suppl. Table 3. Each protein sequence was used as a string. Four different G boxes (sub-strings; 1.”GXXXXGK”, 2. “DXXG”, 3.”NKXD” and 4.”SAX”) were searched in the string, one at a time. After the identification of the first conserved amino acid of a G box, a sliding window was used to find the complete G box sequence. After all four G boxes (G1-G3-G4-G5) were predicted, the algorithm checked to see if the defined spacings between the consecutive G boxes was maintained. This led to the prediction of the entire G domain with predicted G box sequences that had defined inter-box spacings. The sequences and start positions of each predicted G box were saved along with their respective protein IDs. The algorithm implemented in the script is shown in Fig.2b.

Step 1 was implemented twice. (a), SAX with one allowed mismatch was used as the G5 box consensus sequence (Fig. 2b, Fig. 2c Step 1a) (b) Next, to predict a better and possibly longer G5 consensus sequence centred around SAX, we modified the search in 3 ways-(i) “XXXSAXX” with no mismatch (ii) “XXXSAXX” with one mismatch and (iii) “XXXAXXX” with no mismatch , where X can be any of the 20 amino acids (Fig. 2c Step 1b). Then G-domain identification for each of the 23 G-protein families was performed again using these modified G5 box sequences.

Step 2-G5 box Extraction

Only the G5 results obtained in Step 1b above were used for further analysis. These results, which did not include any other information about the rest of the G domain, contained the id of the protein from which the sequence came, the actual sequence and its position in the respective protein. All the results from Step 1b (options i,ii,iii) were merged separately for each G-Protein family. Redundant results, which had the same G5 box sequence and position in a particular protein, were removed (Fig. 2c)

Page 16: G-domain prediction across the diversity of G protein families

Step 3 - Prediction of new G5 box consensus sequence

To find the underlying pattern in the merged G5 box sequences, the amino acids in the modified G5 box 7-mer sequences were replaced with 1, 2, 3 and 4 Xs at all the possible positions (Fig. 2c). Since, the modified G5 box consensus sequence was only 7 amino acids long, a maximum of 4 amino acids were replaced with X, so as not to lose the underlying pattern completely.

The “coverage” for each predicted consensus G5 motif (with Xs) was calculated as the percentage of proteins from that particular G protein family that had the given motif. Position weighted matrices (PWM) were also calculated for the G5 motifs using ggseqlogo package available on GitHub (43). The sequence logos of predicted G boxes were visualized using the ggseqlogo package, a visualization tool which uses polygons to draw elongated letters. The height of each letter was computed based on their relative frequencies at the specified position.The G5 motifs which had the highest coverage and were prominent in the PWM were selected as the new consensus G5 box sequence for the given G protein family. Xs at the two ends of 7-mer consensus motifs were trimmed to give more concise G5 box consensus sequence (Suppl. Table 6).

Step 4- G boxes prediction with new G5 box sequences

Step 1 of SMA was repeated to search for all possible G-domains (along with putative G boxes) in each G protein sequence, using the family-specific constraints of consensus sequence, spacers and allowed mismatches listed in Suppl.Table 3 along with the additional constraint of using the new G5 box sequences predicted in Step 3

The examples of each step of the workflow of SMA can be found in the Supplementary information.

Significance testing of predicted G-domains

To test the significance of the SMA results, a negative control dataset was created by shuffling (fifty times)

the protein sequences for each G protein family. The number of G-Domains predicted per protein (by SMA)

was counted after each shuffle. After 50 shuffles, the total number of predicted G domains per sequence

was averaged and compared with the total number of predicted G domains (after running SMA-3) in the

actual protein sequence.

Generation of protein models and visualization of predicted G boxes

Structures of one member each from Ras (P55043), Era (Q8NNB9) and translational (A4IMD7) G protein

families were predicted using ITASSER (42). Briefly, the amino acid sequences of these proteins were

submitted to the ITASSER server with no specified template. The ITASSER server identified suitable

templates depending on sequence similarity searches from the PDB database. Monte Carlo simulations

were used to assemble the full- length conformations of identified templates and models were generated

for the sequence of interest. All the conformations were confirmed and cluster centroids were identified,

which were then used to build the final models after refinement of cluster centroids.

In addition, one protein each from Ras (Q62636), Era (Q5SM23) and translational (Q71V39) G protein

families were extracted from PDB.

The predicted G boxes for all six of the mentioned proteins were visualized as loops using PyMol (43, 44).

Entropy Calculations:

Page 17: G-domain prediction across the diversity of G protein families

To investigate if the classes with more number of proteins have more variation in the individual G boxes,

entropy of each G box was calculated using the entropy function in scipy (stats.entropy). This was

performed by adding the entropy of each position of the predicted G box sequence. Scatter plots were

generated where the X axis represents the total number of proteins (size) per G protein family and the Y

axis represents total entropy of each G box.

Web-Application: SMA can be implemented via a user-friendly web-application which is available at

https://labs.iitgn.ac.in/datascience/gboxes/ .The web application uses the Python and Django web

framework. The sequence of a G-Protein along with user-specified G-Box sequences and inter G-Box gaps

are given as input. It outputs the G-Domains with predicted G5 box sequences, before and after SMA-Step

3. The portal can also help classify any protein into one of the G-Protein families by looking for all possible

G-Domains in its primary sequence.

AVAILABILITY: The individual Python scripts used for the analysis of SMA are available at

https://github.com/RichaRashmi-projects/G-Protein-Project.git

ACKNOWLEDGEMENTS: We acknowledge Dr. Umashankar Singh and Dr. Sairam S. Mallajosyula for providing valuable suggestions about the study. Special acknowledgements to Aniket Agarwal, NIT Arunachal Pradesh and Raviraj Sukhadiya for helping set up the web portal. Thanks to Dr. Murali Krishna Enduri for helping HS write, understand, and troubleshoot Fig. 3C, Dr. Althaf Shaik and Ms. Deepshikha Ghosh for their timely help with PyMol, Mr. Divyesh Patel for suggestions about trying MEGA for MSA and phylogenetic tree analysis. FUNDING: This work was supported by IIT Gandhinagar (computers and network access required for collection, analysis and interpretation of data), SERB (Science and Engineering Research Board, Government of India) grant ECR/2016/000479 and DBT Ramalingaswami Fellowship BT/RLF/Re-entry/43/2013 and BT/PR16074/BID/7/569/2016, awarded to SM.

References 1. Wittinghofer A, Vetter IR. 2011. Structure-Function Relationships of the G Domain, a Canonical Switch Motif. Annual Review of Biochemistry 80:943–971. 2. Bourne HR, Sanders DA, McCormick F. 1991. The GTPase superfamily: conserved structure and

molecular mechanism. Nature 349:117–127. 3. Leipe DD, Wolf YI, Koonin EV, Aravind L. 2002. Classification and evolution of P-loop GTPases and

related ATPases. Edited by J. Thornton. Journal of Molecular Biology 317:41–72. 4. Paduch M, Jelen F, Otlewski J. 2001. Structure of small G proteins and their regulators. Acta

biochimica Polonica 48:829–850. 5. Jones S. 2004. An overview of the basic helix-loop-helix proteins. Genome Biology 5:226. 6. Flower DR, North ACT, Sansom CE. 2000. The lipocalin protein family: structural and sequence

overview. Biochimica et Biophysica Acta (BBA) - Protein Structure and Molecular Enzymology 1482:9–24.

7. Dever TE, Glynias MJ, Merrick WC. 1987. GTP-binding domain: three consensus sequence elements with distinct spacing. Proceedings of the National Academy of Sciences of the United States of America 84:1814–1818.

8. Vetter IR. 2014. The Structure of the G Domain of the Ras Superfamily BT - Ras Superfamily Small G Proteins: Biology and Mechanisms 1: General Features, Signaling, p. 25–50. In Wittinghofer, A (ed.), . Springer Vienna, Vienna.

9. Rojas AM, Fuentes G, Rausell A, Valencia A. 2012. The Ras protein superfamily: Evolutionary tree and role of conserved amino acids. The Journal of Cell Biology 196:189–201.

10. Duc NM, Kim HR, Chung KY. 2017. Recent Progress in Understanding the Conformational Mechanism of Heterotrimeric G Protein Activation. Biomolecules & therapeutics 25:4–11.

11. Sestok AE, Linkous RO, Smith AT. 2018. Toward a mechanistic understanding of Feo-mediated ferrous iron uptake. Metallomics 10:887–898.

Page 18: G-domain prediction across the diversity of G protein families

12. Abbey M, Gaestel M, Menon MB. 2019. Septins: Active GTPases or just GTP-binding proteins? Cytoskeleton 76:55–62.

13. Wang Z, Li X. 2009. IAN/GIMAPs are conserved and novel regulators in vertebrates and angiosperm plants. Plant signaling & behavior 4:165–167.

14. Ramachandran R, Schmid SL. 2018. The dynamin superfamily. Current Biology 28:R411–R416. 15. Fislage M, Wauters L, Versées W. 2016. Invited review: MnmE, a GTPase that drives a complex

tRNA modification reaction. Biopolymers 105:568–579. 16. Kint C, Verstraeten N, Hofkens J, Fauvart M, Michiels J. 2014. Bacterial Obg proteins: GTPases at

the nexus of protein and DNA synthesis. Critical Reviews in Microbiology 40:207–224. 17. Howard J. 2008. The IRG proteins: A function in search of a mechanism. Immunobiology 213:367–

375. 18. Akopian D, Shen K, Zhang X, Shan S. 2013. Signal Recognition Particle: An Essential Protein-

Targeting Machine. Annual Review of Biochemistry 82:693–721. 19. Tu C, Zhou X, Tarasov SG, Tropea JE, Austin BP, Waugh DS, Court DL, Ji X. 2011. The Era

GTPase recognizes the GAUCACCUCC sequence and binds helix 45 near the 3′ end of 16S rRNA. Proceedings of the National Academy of Sciences 108:10156 LP – 10161.

20. Verstraeten N, Fauvart M, Versées W, Michiels J. 2011. The Universally Conserved Prokaryotic GTPases. Microbiology and Molecular Biology Reviews 75:507 LP – 542.

21. Gasper R, Meyer S, Gotthardt K, Sirajuddin M, Wittinghofer A. 2009. It takes two to tango: regulation of G proteins by dimerization. Nature Reviews Molecular Cell Biology 10:423–429.

22. Srinivasan K, Dey S, Sengupta J. 2019. Structural modules of the stress-induced protein HflX: an outlook on its evolution and biological role. Current Genetics 65:363–370.

23. Gasper R, Scrima A, Wittinghofer A. 2006. Structural Insights into HypB, a GTP-binding Protein That Regulates Metal Binding. Journal of Biological Chemistry 281:27492–27502.

24. Teplyakov A, Obmolova G, Chu SY, Toedt J, Eisenstein E, Howard AJ, Gilliland GL. 2003. Crystal structure of the YchF protein reveals binding sites for GTP and nucleic acid. Journal of bacteriology 185:4031–4037.

25. Sirajuddin M, Farkasovsky M, Zent E, Wittinghofer A. 2009. GTP-induced conformational changes in septins and implications for function. Proceedings of the National Academy of Sciences of the United States of America 106:16592–16597.

26. Guilfoyle A, Maher MJ, Rapp M, Clarke R, Harrop S, Jormakka M. 2009. Structural basis of GDP release and gating in G protein coupled Fe2+ transport. The EMBO journal 28:2677–2685.

27. Ruzheinikov SN, Das SK, Sedelnikova SE, Baker PJ, Artymiuk PJ, Garcı́a-Lara J, Foster SJ, Rice DW. 2004. Analysis of the Open and Closed Conformations of the GTP-binding Protein YsxC from Bacillus subtilis. Journal of Molecular Biology 339:265–278.

28. Chan K-H, Wong K-B. 2011. Structure of an essential GTPase, YsxC, from Thermotoga maritima. Acta crystallographica Section F, Structural biology and crystallization communications 67:640–646.

29. Dinkel H, Roey K, Michael S, Kumar M, Uyar B, Altenberg B, Milchevskaya V, Schneider M, K??hn H, Behrendt A, Dahl SL, Damerell V, Diebel S, Kalman S, Klein S, Knudsen AC, M??der C, Merrill S, Staudt A, Thiel V, Welti L, Davey NE, Diella F, Gibson TJ. 2016. ELM 2016 - Data update and new functionality of the eukaryotic linear motif resource. Nucleic Acids Research 44:D294–D300.

30. Chauhan JS, Mishra NK, Raghava GPS. 2010. Prediction of GTP interacting residues, dipeptides and tripeptides in a protein from its evolutionary information. BMC bioinformatics 11:301.

31. de Castro E, Sigrist CJA, Gattiker A, Bulliard V, Langendijk-Genevaux PS, Gasteiger E, Bairoch A, Hulo N. 2006. ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Research 34:W362–W365.

32. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. 2016. The Pfam protein families database: towards a more sustainable future. Nucleic acids research 44:D279–D285.

33. Chakrabarti S, Anand AP, Bhardwaj N, Pugalenthi G, Sowdhamini R. 2005. SCANMOT: searching for similar sequences using a simultaneous scan of multiple sequence motifs. Nucleic Acids Research 33:W274–W276.

34. Frith MC, Saunders NFW, Kobe B, Bailey TL. 2008. Discovering Sequence Motifs with Arbitrary Insertions and Deletions. PLOS Computational Biology 4:e1000071.

35. Golubchik T, Wise MJ, Easteal S, Jermiin LS. 2007. Mind the Gaps: Evidence of Bias in Estimates of Multiple Sequence Alignments. Molecular Biology and Evolution 24:2433–2442.

Page 19: G-domain prediction across the diversity of G protein families

36. Subramani A, Floudas CA. 2012. Structure prediction of loops with fixed and flexible stems. J Phys Chem B 116:6670–6682.

37. Koller-Eichhorn R, Marquardt T, Gail R, Wittinghofer A, Kostrewa D, Kutay U, Kambach C. 2007. Human OLA1 Defines an ATPase Subfamily in the Obg Family of GTP-binding Proteins. Journal of Biological Chemistry 282:19928–19937.

38. Cheung M-Y, Li X, Miao R, Fong Y-H, Li K-P, Yung Y-L, Yu M-H, Wong K-B, Chen Z, Lam H-M. 2016. ATP binding by the P-loop NTPase OsYchF1 (an unconventional G protein) contributes to biotic but not abiotic stress responses. Proc Natl Acad Sci USA 113:2648.

39. Atkinson GC. 2015. The evolutionary and functional diversity of classical and lesser-known cytoplasmic and organellar translational GTPases across the tree of life. BMC Genomics 16:78–78.

40. Hill TA, Broadhvest J, Kuzoff RK, Gasser CS. 2006. Arabidopsis SHORT INTEGUMENTS 2 Is a Mitochondrial DAR GTPase. Genetics 174:707.

41. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. 2018. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Molecular Biology and Evolution 35:1547–1549.

42. Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. 2015. The I-TASSER Suite: protein structure and function prediction. Nat Meth 12:7–8.

43. Schrödinger, LLC. 2015. The JyMOL Molecular Graphics Development Component, Version 1.8. 44. Zhang Y, Wen J, Yau SS-T. 2019. Phylogenetic analysis of protein sequences based on a novel k-

mer natural vector method. Genomics. 111(6), 1298-1305 45. V. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10 (8): 707--710

Figure Legends

Figure 1: Doughnut plot representing consensus spacing between adjacent G boxes and consensus G box sequences across G protein families, collected by manual curation. Each concentric circle represents consensus sequences of different G boxes. G1 box (GXXXXGK, innermost grey circle), G3 box (DXXG, orange), G4 box (different shades of green), G5 box (outermost circle, different shades of blue). The white space in between consecutive concentric circles represents the amino acid spacers between consecutive boxes.

Figure 2: (A) Flowchart representing the entire methodology followed in the study (B) Workflow of G domain prediction by Spacers and Mismatch Algorithm (SMA) (up to Step 1a in Fig. 2c) (C) G5 box prediction.

Figure 3: Prediction of multiple G domains per protein. (A) Q62636 (Ras), Q5SM23 (Era) and Q71V39 (translational) are proteins with structurally identified G domains (B) P55043(Ras), Q8NNB9 (Era) and A4IMD7 (translational) are proteins with no structural information. For each G box, different predicted motifs were depicted in different colours.

Figure 4: Comparison of predicted G domain boundaries (in Ras, Era and translational G protein families), before and after SMA-Step 3. Number of G domains predicted per protein (A) using “SAX” as G5 box (B) using newly identified G5 box after SMA-Step3. The X axis represents all the proteins in the G protein family. Each vertical line on the X axis represents each individual protein of the family. The Y axis represents the length of proteins (amino acids). Red dot marks the start of the predicted G1 box and blue dot marks the start of predicted G5 box. Each dotted line represents an individual G domain predicted per protein.

Figure 5: Comparison of unique G boxes (in Ras, Era and translational G protein families) predicted before and after SMA-Step 3. Number of unique G box sequences predicted (A) using “SAX” as G5 box (B) using newly identified G5 box after SMA-Step 3. X axis = total number of unique G boxes predicted in a G protein family; Y axis = frequency of occurrence i.e., the number of proteins (in G protein family), in which the predicted unique G box occurs). G1 (peach), G3 (green), G4(blue), G5(purple).

Page 20: G-domain prediction across the diversity of G protein families

Figure 6: Comparison of amino acid spacers between consecutive G boxes predicted before and after SMA-Step 3. (A) Ras, (B) Era (C) translational families. X axis = range of spacer length (amino acids); percentage of proteins in G protein family, having indicated spacer value (bars); mean of inter-box spacing (dashed line) before (blue), and after (yellow) SMA-Step 3. The mean of inter-box spacers was calculated by dividing the sum of individual spacer values (number of amino acids) in the defined spacer range by the total number of spacer values, where each spacer value represents the inter G box spacer in individual predicted G domains in a G protein family.

Figure 7: G protein family-specific G box motifs predicted using SMA3. (A) Ras, (B) Era (C) translational families. X axis= amino acid positions in respective G box; Y axis= probability of occurrence of an amino acid from 0 to 1.

Figure 8. Correlation between G box entropy and G protein family size: Scatter plots where the X axis represents the total number of proteins (size) per G protein family and the Y axis represents total entropy of each G box (A) G1 box (GXXXXGK), (B) G3 box (DXXG), (C) G4 box (NKXD) (D) G5 box predicted by SMA. Each G protein family is represented by a different colour as shown in the key. Figure 9: Superposition of the predicted G boxes to the secondary structure of G proteins.

Predicted G domains (gray), visualized using PyMol, were mapped onto (a) available protein structures downloaded from PDB [Q62636 (Ras), Q5SM23 (Era), Q71V39 (translational)] (b) I TASSER predicted protein structures [P55043(Ras), Q8NNB9 (Era), A4IMD7 (translational)]. G1 box (red), G3 box (green), G4 box(blue), predicted G5 boxes (magenta). GTP/GDP are shown in black in (a).

Page 21: G-domain prediction across the diversity of G protein families

Figure 1: Doughnut plot representing consensus spacing between adjacent G boxes and consensus G

box sequences across G protein families, collected by manual curation. Each concentric circle

represents consensus sequences of different G boxes. G1 box (GXXXXGK, innermost grey circle),

G3 box (DXXG, orange), G4 box (different shades of green), G5 box (outermost circle, different

shades of blue). The white space in between consecutive concentric circles represents the amino acid

spacers between consecutive boxes.

Page 22: G-domain prediction across the diversity of G protein families

Get the data for G protein families from

PROSITE

Removal of similarity

bias >=90%

Multiple Sequence

alignment of proteins

Spacers and Mismatches

based Algorithm

(SMA)

Entropy Calculation

Generation of

phylogenetic tree based on MSA of

proteins

Input file: separate file for each G

protein family

Search for G1 box sequence -GXXXXGK with specified mismatch

Search for G3 box sequence -DXXG with specified mismatch following the identified G1 box

Search for G4 box sequence -NKXD with specified mismatch following the identified G3 box

Search for G5 box sequence -SAX with specified mismatch following the identified G4 box

Do identified G1, G3, G4 and G5 box meet the specified spacer constraint?

Yes No

Print the positon and sequence of each G box in the input protein sequence

Do not print anything for this protein.

Move to the next protein.Continue

the probe for all the possible outputs in the same protein and print them.

A B C

XXXSAXXwith no

mismatch as G5 Box

XXXAXXX with no

mismatch as G5 Box

XXXSAXX with one

mismatch as G5 Box

G5-Box Extraction

Data for G protein families (after removing similarity bias)

SAX with one mismatch as G5

Box

Insert 1 ‘X’ at all possible positions

Insert 2 ‘X’ at all possible

positions

Insert 3 ‘X’ at all possible positions

Insert 4 ‘X’ at all possible positions

Calculate coverage

Select the Sequence with best coverage as new G5- box sequence

Get G-domains with predicted G5-Box

sequence

G-domain prediction

Step 3

Step 1

Step 2

Step 4

a b

Spacers and Mismatches based Algorithm (SMA) workflow

Extension of SMA workflow

Figure 2: (A) Flowchart representing the entire methodology followed in the study(B) Workflow of G domain prediction by Spacers and Mismatch Algorithm (SMA) (upto Step 1a in Fig. 2c) (C) G5 box prediction

Page 23: G-domain prediction across the diversity of G protein families

Figure 3: Prediction of multiple G domains per protein. (A) Q62636 (Ras), Q5SM23 (Era) and Q71V39 (translational) are proteins with structurally

identified G domains (B) P55043(Ras), Q8NNB9 (Era) and A4IMD7 (translational) are proteins with no structural information. For each G box, different

predicted motifs were depicted in different colours

Page 24: G-domain prediction across the diversity of G protein families

Figure 4: Comparison of predicted G domain boundaries (in Ras, Era and translational G protein families), before and

after SMA-Step 3. Number of G domains predicted per protein (A) using “SAX” as G5 box (B) using newly identified G5 box after SMA-Step3. The X axis represents all the proteins in the G protein family. Each vertical line on the X

axis represents each individual protein of the family. The Y axis represents the length of proteins (amino acids). Red

dot marks the start of the predicted G1 box and blue dot marks the start of predicted G5 box. Each dotted line

represents an individual G domain predicted per protein.

Page 25: G-domain prediction across the diversity of G protein families

0

50

100

1 62 124

Number of Unique G−Boxes of Ras family

Fre

quency o

f occure

nce o

f uniq

ue G

box s

equences

A.

0

50

100

1 89 124

Number of Unique G−Boxes of Ras family

Fre

quency o

f occure

nce o

f uniq

ue G

box s

equences

B.

0

5

10

15

20

25

1 94 189

Number of Unique G−Boxes of Era family

Fre

quency o

f occure

nce o

f uniq

ue G

box s

eq

uences

0

5

10

15

20

25

1 124 189

Number of Unique G−Boxes of Era family

Fre

quency o

f occure

nce o

f uniq

ue G

box s

eq

uences

0

100

200

300

1 2414 4829

Number of Unique G−Boxes of Translational family

Fre

quency o

f occure

nce o

f uniq

ue G

box s

equences

0

100

200

300

1 2458 4829

Number of Unique G−Boxes of Translational family

Fre

quency o

f occure

nce o

f uniq

ue G

box s

equences

G−Boxes G1 G3 G4 G5

Page 26: G-domain prediction across the diversity of G protein families

Figure 6: Comparison of amino acid spacers between consecutive G boxes predicted before and after SMA-Step 3.

(A) Ras, (B) Era (C) translational families. X axis = range of spacer length (amino acids); percentage of proteins in G

protein family, having indicated spacer value (bars); mean of inter-box spacing (dashed line) before (blue), and after

(yellow) SMA-Step 3. The mean of inter-box spacers was calculated by dividing the sum of individual spacer values

(number of amino acids) in the defined spacer range by the total number of spacer values, where each spacer value

represents the inter G box spacer in individual predicted G domains in a G protein family.

Page 27: G-domain prediction across the diversity of G protein families

Figure 7: G protein family-specific G box motifs predicted using SMA3. (A) Ras, (B) Era (C) translational families.

X axis= amino acid positions in respective G box; Y axis= probability of occurrence of an amino acid from 0 to 1.

Page 28: G-domain prediction across the diversity of G protein families

0

5

10

15

0 2000 4000 6000 8000

Size of G protein familys

Entr

opy

G1−Box A.

0

5

10

15

0 2000 4000 6000 8000

Size of G protein family

Entr

opy

G3−Box B.

0

5

10

15

0 2000 4000 6000 8000

Size of G protein family

Entr

opy

G4−Box C.

0

5

10

15

0 2000 4000 6000 8000

Size of G protein family

Entr

opy

G5−Box D.

Protein

AIG1

Arf

Dynamin

EngA

EngB

Era

FeoB

Galpha

GB1

Hflx

IRG

Miro

Obg

Rab

Rac

Ran

Ras

Roc

SAR

Septin

SRP

Translational

tRme

Page 29: G-domain prediction across the diversity of G protein families

Figure 9: Superposition of the predicted G boxes to the secondary structure of G proteins. Predicted G domains

(gray), visualized using PyMol, were mapped onto (a) available protein structures downloaded from PDB [Q62636

(Ras), Q5SM23 (Era), Q71V39 (translational)] (b) I TASSER predicted protein structures [P55043(Ras), Q8NNB9

(Era), A4IMD7 (translational)]. G1 box (red), G3 box (green), G4 box(blue), predicted G5 boxes (magenta).

GTP/GDP are shown in black in (a).

Page 30: G-domain prediction across the diversity of G protein families

Supplementary Files

This is a list of supplementary �les associated with this preprint. Click to download.

Supplementalinfo.pdf

Suppl�les.zip