Correct Protonation States and Relevant Waters = Better Computational Simulations?

Correct Protonation States and Relevant Waters = Better Computational Simulations?

Correct Protonation States and Relevant Waters = Better Computational Simulations? Francesca Spyrakis1, Luca Dellafiora1, Chenxiao Da2, Glen E. Kellogg2,* and Pietro Cozzini1,* 1Department of Food Science, University of Parma, Parma, Italy, 2Department of Medicinal Chemistry & Institute for Structural Biology and Drug Discovery, Virginia Commonwealth University, Richmond, Virginia, 23298-0540 USA. * corresponding authors Glen E. Kellogg Department of Medicinal Chemistry & Institute for Structural Biology and Drug Discovery, Virginia Commonwealth University, Box 980540, Richmond, Virginia, 23298-0540 USA; phone +01 828-6452; fax: +01 827-3664; E-mail: [email protected] Pietro Cozzini Department of Food Science, University of Parma, Parco Area delle Scienze 17/A, Parma, Italy; phone: +39 0521 905669; fax: +390 0522 905556; E-mail: [email protected]

Keywords: water conservation, pKa prediction, hydropathic interactions, tautomerism.

Abstract: The unique physicochemical properties of water make it the most important molecule for life. Water molecules have many roles, direct and indirect, related to both biological structure and function. This paper: 1) reviews tools for the prediction of water conservation in and around protein active sites, by empirical (knowledge-based) algorithms and by methods based on thermodynamics principles; 2) reviews principles and approaches to predict pKa for both protein residue ensembles and for ligands; and 3) discusses the HINT biomolecular interaction model and forcefield – based on experimental measurements of LogPo/w, the 1-octanol/water partition coefficient, which implicitly incorporates all solution phenomena like these, and others like tautomerism and entropy. Lastly, it must be considered that the “real” biological environment is a continuum of nano-states and it may not be possible to represent it as a single discrete all-atom model. Keywords: water conservation, pKa prediction, hydropathic interactions, tautomerism.

INTRODUCTION Water is the Most Important Molecule for Life

The majority of molecular interactions in life systems occur in an aqueous system. But water behavior is complex: it shows unusual properties and anomalous values of many thermodynamic parameters (e.g., melting point, boiling point and heat of vaporization). Obviously, the anthropometry of a water molecule – its 3D structure – is responsible for these properties and, more interestingly, their consequences on life processes.

While the volume of water on Earth is less than 0.1% of the Earth’s total volume, all of the water is at or near the surface and 71% of its surface is covered in water. Estimates for the fraction of water in the human body range between 50-60%. Water is a seemingly simple molecule, having a volume of only 16-18 Å3 and just three atoms, but those three atoms pack a lot of punch! A single water can engage in four hydrogen bonds – two as donor and two as acceptor along more or less tetrahedral axes. Water has a very high polarity (relative permittivity = 80) and for it to be just in the neighborhood significantly changes the electrostatic interactions of other species. The boundary between bulk water and a molecule like a protein is a fascinating place because of the radical drop in relative permittivity (2-6 for a typical protein, excluding the usually charged surface sidechains [1]). More importantly, and in ways we may never completely understand, these “physical” properties of water have led to our notion of life, arguably making water the most important molecule for biological processes.

In addition to its bulk properties as a solvent, etc., water can act in numerous ways on the nano (individual molecule) scale: as a lubricant, as a buffer, as a proton conductor, as a sidechain extender, to change an H-bond donor into an acceptor or vice versa, and many more. Water molecules can be found barely associated with a surface residue, deeply stored in surface cavities, as a part of binding site walls, or directly involved into the binding process between protein and ligand or protein and protein, and in many other places. Its charge and dipolar nature allow water to be an excellent intermediary of binding by acting as a bridge between protein and ligand or protein and protein [2]. In fact, Klebe [3] demonstrated that in about two-thirds of protein–ligand complexes at least one water molecule is involved in the binding. It has been reported that there are, on average, 10-11 waters per 1000 Å2 at a protein-protein interface [4].

Waters are not just “spectators” to ligand binding or protein-protein associations; one view is that they may actually be the most important component of a biological system. All of its features make water a key element in the structure and function of macromolecular complexes [5]. In addition, entropy changes, as a critical component of Gibbs free energy, largely arise from changes in water structure. Of course, solvation/desolvation energies are directed by changes in water/macromolecule and water/ligand structure: water solvates proteins and/or ligands by “surrounding” them in solution, which leads to solvation and desolvation energy terms whenever two or more molecules undergo an encounter. Also, the “hydrophobic effect”, which represents a significant fraction of most biological interactions, is really a misnomer. The cause of the apparent attraction of hydrophobic moieties, whether in protein-protein associations, in ligand binding or other scenarios, is actually their mutual desire to not be associated with water. Thus, proteins will generally fold to expose their most polar regions to the bulk water solvent and “protect” their hydrophobic regions in a core that is mostly water free and impenetrable to solvent.

Unfortunately, water structure is difficult to experimentally observe or predict in biological systems: X-ray diffraction experiments cannot always resolve between water, other ions or experimental artifacts. This is more of a problem with low-resolution structures where electron density envelopes can be vague or poorly defined, making it impossible to discern the identity or role of the encapsulated molecular species. Since the solvent of choice for proteins is water, as they will likely fold improperly in others, nuclear magnetic resonance (NMR) experiments in water are normally unable to identify specific and tightly-bound water, although recent work with micelles [6] and with solid state NMR [7] are promising. Prediction methods that place water in the absence of experimental data are largely empirical in nature and usually are based on molecular mechanics forcefields calibrated against experimental structures. The most popular of these methods is the GRID program of Goodford [8], but other similar tools have been reported [9].

Waters, Protons and Molecular Modeling

Thus, it continues to be surprising that so much molecular modeling and computational chemistry is performed without explicit water. For example, the popular ZDOCK program of Weng et al. [10] currently has no provision for water in its docking and scoring algorithms. It seems obvious that when we build a model for computational simulations, we must think of a multi-body model; i.e.; for a biological system we must consider at least three or more bodies – the protein, the ligand and water(s). These water molecules give the proteins a

great deal of flexibility [11]. Hydrated proteins contains waters that account for 20 – 60% of the protein’s mass and have residence time that, depending on how they are associated with or buried within the protein cavity, range from nanoseconds for external waters to tens of milliseconds for more internal waters [11]. We [12], and others [13], have shown that every water molecule plays a different role, perhaps as a leading man or as a supporting actor. As a consequence, it is obvious that to predict, as correctly as possible, the interactions and energetics in biological complexes, we are forced to model the multi-body interactions of the protein, ligand and, at a minimum, some “important” waters. Moreover, because the gold standard starting point for modeling are generally X-ray diffraction data, meaning that the initial models for the molecules are without hydrogens and protons, determining the correct protonation state for acidic and basic protein sidechains and functional groups on ligands should be considered mandatory in order to obtain realistic predictions of binding energies.

One of the most well known cases that graphically illustrates the different behavior of water molecules is that of HIV-1 protease [14, 15]. This protease is essential for the HIV replication cycle and is, along with reverse transcriptase and integrase, an essential and fundamental enzyme of the HIV retrovirus. It is a homodimer enzyme with two chains of 99 amino acids each and possessing a two-fold axis of symmetry with a fairly large number of water molecules, of which six are well conserved in the binding pocket region. The number of observed water molecules in any structure is a function of the crystal quality, temperature under which the experiment is performed, as well as the resolution of the X-ray diffraction experiment. The binding site in unliganded HIV-1 protease [15] is perfectly symmetric, with Asp25 belonging to the first chain and facing Asp25, its twin in the second chain. Both coordinate the catalytic water, usually referred to a Wat300. This water is consistently displaced when a ligand binds, as that is its role in the functional protein.

Another conserved water, termed Wat301, is detected in both the free form of the protein and in the majority of protein ligand complexes (see Figure 1). Miller [16], Erickson [17], Swain [18], Jaskolski [19], Abdel-Meguid [20], Wlodawer and Vondrasek [21] investigated and elucidated the function and the relevance of Wat301 in binding recognition. In fact, this molecule is located on the symmetry axis of the HIV-1 dimer and bridges the two subunits and most of the bound ligands. It establishes hydrogen bonds with the protein backbone amide nitrogen atoms of Ile50 and its symmetry partner Ile50. The most successful HIV-1 protease ligands position two carbonyl oxygen atoms or similar acceptors near Wat301.

Because this water is so well recognized, one might suppose that crystallographers are assuming its presence in their electron density maps, but the “stability” of the position is testified by the low B factor values and, computationally, by GRID analyses that show the region occupied by Wat301 as being very energetically favorable for its presence. As mentioned above, this water has been a critical design feature with respect to HIV-1 protease inhibitors. Early stage inhibitors were designed to provide appropriate hydrogen bond acceptors for the two protons of the water – a task made easier by the inherent symmetry of the site. However, in one of the triumphs of structure-based drug discovery, ligands that were designed to displace this water [22, 23] were even more tight binders, where the additional free energy was hypothesized to arise from entropy [14], i.e., generated by its release to the bulk solvent, but Fornabaio et al. suggested that it was more of an enthalpic/desolvation effect [24].

There are a few other water molecules buried within the ligand binding site that are not always detected by crystallographic analysis: Wat313 (and its symmetrical Wat313’) and Wat313bis (and its symmetrical Wat313bis’) being the most conserved. The interaction energies of twelve waters, observed with varying degrees of conservation in the HIV-1 protease active site, were calculated by Fornabaio et al. [24] using the HINT program (vide infra). It turns out that, while the value of understanding and accounting for the contribution of “important” water molecules in biological computations is obvious, the identification of these waters is not always so obvious or easy: it is not simply a matter of proximity. Amadasi et al. proposed a method to identify statistically “relevant” waters in proteins from their structure [25]. This sort of tool may provide a rationale for selection of waters that can be strategically targeted for “displacement by design” such as Wat301 in HIV-1 protease with a net free energy gain for ligand binding.

The other key consideration in building atomic-level models of biological systems is the protons. Since it is impossible to obtain all (or usually any) hydrogen positions from X-ray analyses, and the pH at which crystals are grown are generally chosen for the quality of the resulting crystals for diffraction rather than their biological relevance, experiment does not reveal the ionization states of the acidic or basic sidechains (or ligand functional groups). In general, when hydrogens are considered, the assumption is made that all residues are protonated as at pH 7. In our work [24], the ionization states of the two catalytic aspartates (25 and 25) were examined and modeled based on the reports of Smith [26] and Wang [27], who had earlier shown with NMR that only one of these aspartates can be protonated in the pH range 2.5 – 6.5. In one complex, where the Glu-Asp-Leu peptide was liganded to HIV-1 protease in solution at pH values between 3.0 and 5.0, Louis et al. [28] demonstrated pH-dependent binding as protonation states of the two protein aspartates and in the peptide were accessed. The

https://www.researchgate.net/publication/15281616_An_Orally_Bioavailable_HIV-1_Protease_Inhibitor_Containing_an_Imidazole-Derived_Peptide_Bond_Replacement_Crystallographic_and_Pharmacokinetic_Analysis?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/12160528_Pillai_B_Kannan_K_K_Hosur_M_V_19_A_x-ray_study_shows_closed_flap_conformation_in_crystals_of_tethered_HIV-1_PR_Proteins_43_57-64?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/12160528_Pillai_B_Kannan_K_K_Hosur_M_V_19_A_x-ray_study_shows_closed_flap_conformation_in_crystals_of_tethered_HIV-1_PR_Proteins_43_57-64?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/14920105_Rational_Design_of_Potent_Bioavailable_Nonpeptide_Cyclic_Ureas_as_Hiv_Protease_Inhibitors?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/7280751_Mapping_the_Energetics_of_Water-Protein_and_Water-Ligand_Interactions_with_the_Natural_HINT_Forcefield_Predictive_Tools_for_Characterizing_the_Roles_of_Water_in_Biomolecules?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/7280751_Mapping_the_Energetics_of_Water-Protein_and_Water-Ligand_Interactions_with_the_Natural_HINT_Forcefield_Predictive_Tools_for_Characterizing_the_Roles_of_Water_in_Biomolecules?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/21163187_Structure_at_25-A_resolution_of_chemically_synthesized_human_immunodeficiency_virus_type_1_protease_complexed_with_a_hydroxyethylene-based_inhibitor_PDB_code_Biochemistry_1991_30_1600_1609_1993177?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/20909523_X-ray_crystallographic_structure_of_a_complex_between_HIV-1_protease_and_a_hydroxyethylamine_inhibitor?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

availability of the corresponding PDB structure with good resolution (2.9 Å) allowed Spyrakis et al. to illustrate a computational method to explore the relationship between binding free energy and protonation state for such biomolecular systems [29]. In the tripeptide-HIV-1 case there were actually over 4000 potential ionization state models; determining the correct model or subset of models is clearly a prerequisite for accurate predictions of ligand binding energy. Development of a fast method for such a determination could translate in improved accuracy in virtual screening and structure-based drug design.

This chapter explores these two issues, water role and acid/base ionization states, in much more detail. We will review the current state-of-the-art in computational predictions for both. This is more than an academic exercise: water is a critical element that should absolutely be taken into account in all in silico analyses for screening, discovering or designing ligands and biologically active compounds [30]. There are many examples of the importance in considering water molecules in simulations, but the results can be discordant: considering key waters molecules sometimes improves the simulations, e.g., as illustrated by Huang and Shoichet [31], but sometimes the opposite is observed [32]. We believe that this just indicates that we still have a lot to learn! In the final section of the chapter, we will describe the tool – HINT (Hydropathic INTeractions) – that we have been developing for the past two decades and relate some of the lessons we have learned concerning water, ionization states and related phenomena in atomic-scale biological modeling and simulations.

WATER CONSERVATION

Water is omnipresent in biological systems and it can act simply as solvent, as well as a product or substrate in many enzymatic reactions [33]. Both “hiding” the hydrophobic regions and the resulting intermolecular H-bonds are driving forces of protein folding. Moreover, water can play fundamental mechanistic roles and actively take part in the structures and functions of proteins [34]. Water appears is a unique multi-functional molecule in all biological (and many non-biological) environments.

While it sometimes appears that water molecules are merely filling space, particularly when observed in locations that are not thermodynamically favorable [35, 36], there is likely purpose and biological importance behind these waters even if we cannot discern it. At the simplest level, these water molecules can help define the shape and the plasticity of binding or interaction sites and thus govern ligand specificity and affinity [37]. Tame et al. [38] and Sleigh et al. [39] report an example of water molecules acting as flexible adapters at the OppA (oligo peptide binding protein)-ligand interface, while Quiocho et al. [40] show that water molecules can influence binding affinity and specificity in the L-arabinose-binding protein.

Most of the information regarding the position of water molecules in proteins is derived from X-ray crystallographic studies. However, as noted above, even this experiment can give false or misleading results, especially when the resolution is poor. Davis, Teague and Kleywegt reviewed the limitations of crystallographic data for structure-based drug design nearly a decade ago [41], but the situation has not really changed dramatically. Certainly, water molecules possessing relatively high mobility into the binding site or in rapid exchange with the bulk solvent [33] will be difficult to detect and characterize.

Despite the limitations of experiment, the proper evaluation of the role and behavior of water molecules near binding interfaces remains of crucial relevance. Thus, over the past two to three decades, many computational tools have been developed to validate, predict and classify water molecules with respect to virtual screening and docking calculations. We review below some of the most commonly used algorithms. Most of the tools focus on a decision point – whether a particular water molecule will be conserved (or water site will be occupied). There has been a large volume of other work, not described here, where the goal is to learn more about the energetics of such waters.

Empirical and Knowledge-based Tools for Prediction of Water Conservation

Consolv [42] implements an algorithm that couples a k-nearest-neighbors classifier with a genetic algorithm [43] and performs an environmental characterization of each water molecule in ligand-free crystallographic structures. This leads to a prediction of which water molecules will remain and which will be removed from the ligand-protein interface after the binding event. In order to characterize the micro-environment of each crystallographic water molecule by comparison with test water binding sites, four properties are evaluated: the water molecule’s crystallographic temperature (B) factor, the number of hydrogen bonds between the water molecule and protein estimated using the Hbond program [44], the protein surface topography (the number of protein atoms within 3.6 Å of the water), and the hydrophilicity of neighboring protein atoms. The latter measures the tendency of surrounding atoms to bind water molecules and is based on Kuhn’s study of the frequency of hydration for each atom type in 56 high-resolution protein structures [45]. After training on a set of 13 non-homologous proteins, Consolv was able to predict the conservation (or displacement) of water

molecules in the binding sites of 7 different proteins with an accuracy of 75%. One issue is that it has difficulties in predicting displacement of water molecules caused by ligand polar groups, i.e., where there is no net desolvation.

WaterScore [46] was proposed in 2003 by García-Sosa, Mancera and Dean. They performed a multivariate logistic regression analysis to establish a statistical correlation between the structural characteristics of water molecules in the binding sites of apo structures of proteins and the probability of observing water molecules in the same locations in protein-ligand complexes. Here, the B-factor, the solvent-contact surface area, the total hydrogen bond energy and the number of protein–water contacts were found to quantitatively discriminate between tightly bound water molecules and displaceable ones. This method provides a potentially useful tool for identifying those waters that should be included in structure-based drug design and docking simulations. On a test set of 25 protein structures, WaterScore shows an accuracy of 64%, but like Consolv, it does not consider displacement caused by competition between polar groups of the ligand and the water molecules.

Lu et al. [37] performed a similar but more comprehensive analysis of water molecules at the protein-ligand interface of 392 high-resolution crystallographic complexes to determine the number of water molecules bound to the ligand in protein-ligand complexes and to understand the factors that influence water binding. They also analyzed the propensity of protein residues and ligand atom types to bind water molecules. In their study, there were 1829 ligand bound water molecules (72% interfacial and 18% at the surface) and the average number of ligand-bound water is 4.6 per complex where 76% of them are bridging waters, i.e., with polar interactions with both ligand and protein. Among all factors analyzed, polar van der Waals surface area may be most important in describing the number of bound waters, while there is only a weak correlation with the resolution of the structures. It was also noted that, in analyzing the isotropic B-factors for these complexes, water molecules with more than three polar interactions show less mobility than protein atoms. Also, polar moieties have higher hydration propensity than nonpolar moieties. The arginine side chain shows the highest value for proteins and propensity largely correlates with charge for ligand atoms.

3D-RISM (three-dimensional reference interaction site model) [47] is based on a statistical-mechanical theory of molecular solvation that has the advantage of quantitatively estimating the thermodynamic properties and 3D distribution functions of the solvent sites. Its advantage is in the localization of water molecules, e.g., in a study of hen egg-white lysozyme, the number of water molecules predicted to be entrapped in the sites compared well with the positions commonly observed in crystallographic structures of the protein – under different conditions and resolutions. In addition, the 3D-RISM methodology was able to resolve water molecules that are ambiguously determined by X-ray diffraction. Furthermore, it is suggested that MD simulations are poorly suited for this kind of structure-based analysis because the starting coordinates of the waters overly influence these simulations.

SUPERSTAR [48] is an empirical knowledge-based approach for identifying interaction sites in proteins based on the Cambridge Structural Database (CSD) IsoStar [49] library, which is a repository of experimental information about non-bonded interactions that occur in X-ray crystal structures of small molecules. SUPERSTAR combines structural data from the Protein Data Bank (PDB) [50] with IsoStar to generate composite propensity maps of binding sites depicted in the form of scatterplots that show the distribution of these interactions observed in structures of a particular functional group (the probe) around another. A protein’s binding pocket is fragmented such that the residues of the binding site are considered as small molecules or small structure fragments that can be mapped to these probes. Finally, composite three-dimensional maps showing the propensity of certain probe types to be at positions around the residues of the binding sites are generated and displayed. Barillari et al. showed that this software could also be used to identify protein hydration sites [51].

AcquaAlta [52] is an algorithm that reproduces the polar interactions of water molecules at ligand-protein interfaces. From a rigorous analysis of small-molecule crystal structures, again from the CSD, data about the geometry of water molecules involved in the interactions with generic functional groups was collected. Then, in order to define an empirical ranking of hydration-propensity, the interaction energies between water and hydrated functional groups were obtained with Gaussian 03 ab initio calculations. This yielded a knowledge-based algorithm that was validated on a test set of twenty crystallographic structures with resolutions ranging from 0.97 to 2.60 Å. For validation, the experimental waters were deleted and subsequently recalculated using AcquaAlta. After orienting and minimization, the match between calculated and experimental water positions was 76%. Interestingly, algorithm accuracy was not influenced substantially by the resolution of the crystal structures.

WaterDock [53] is a protocol that is an adjunct to AutoDock Vina designed to predict the binding site of water molecules. While water binding sites can be predicted through Molecular Dynamics (MD) or Monte Carlo (MC) simulations, protocols relying on these techniques can be very computationally expensive.

https://www.researchgate.net/publication/6517242_Classification_of_Water_Molecules_in_Protein_Binding_Sites?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

WaterDock proposes a rapid method (requiring only a few seconds) that is not only able to predict the position of water molecules, but also able to predict if a water molecule can be displaced by polar or nonpolar groups. The method is validated against high-resolution crystallographic structures, molecular dynamics and neutron diffraction data. With a validation set of high-resolution X-ray structures, they obtained a success rate of 88% in the prediction of consensus binding sites for water. Moreover, by combining data mining, heuristic and machine learning techniques, they created a model that is able to assign the probability of water molecule to remain or to be displaced after the binding event with an accuracy of 75%.

Santos et al. [5] report an elegant example applied to cythocrome P450 2D6 (CYP2D6) of how water molecules can be described and analyzed by integrating molecular dynamics and docking simulations. They define the correctness of water binding sites to be where water molecules remain at least 80% of the time during the simulation. In testing the effect of including water molecules in docking simulations, it is concluded that the inclusion of some water molecules can influence the reliability of docking, but their role is not unique. In some cases, these waters lead to a slight improvement in predictions, while in others there is no overall effect.

AQUARIUS [54] provides a unique low-level knowledge-based method to identify water sites in proteins. Using electron density maps derived by crystallography, the algorithm defines the likely position of water molecules within a protein by mapping each amino acid to a data set of crystal structures.

CS-Map [55] predicts the favorable binding position for waters (or other “solvent” probes) in binding sites taking into account van der Waals, electrostatic and solvation contributions. The algorithm is divided into three sequential steps: 1) a rigid body search where the probes move within electrostatic and desolvation fields of the protein. In order to allow the ligand the freedom to move towards regions with favorable electrostatic and desolvation characteristics, atomic overlap is penalized but van der Waals contributions are not take into account; 2) the van der Waals contributions are considered and a more accurate continuum model is used to calculate the electrostatic and solvation contributions. Starting from conformations generated in the previous step, the free energy of protein-ligand complexes is minimized and the flexibility of the ligand is taken into account; 3) the docked positions are clustered and then ranked on the basis of their average free energies.

Fold-X [56] is an empirical forcefield originally developed for rapid free energy calculations in proteins that is also capable of predicting the positions of bound water (and metal ions) in protein structures. In a test set of 50 protein structures, it was shown that the Fold-X forcefield predicted, on average, 76% of crystallographic waters in contact with at least two polar groups with a root-mean deviation of 0.68 Å between predicted and observed positions. To build the forcefield, water molecules (or metal ions), along with their coordinating atoms and two atoms covalently bonded to them, were extracted from high-resolution X-ray crystal structures. Superposition of the resulting triads upon the same sidechain backbone atoms in the protein structure of interest leads to a “cloud” of water molecules (or metal ions) around coordinating atoms in the protein. Clustering of these clouds yields at least one spatial-representative center for canonical positions of interaction between the water (or metal ion) and the protein. All canonical waters are initially placed, but those that clash with the other protein atoms are removed and those within a defined distance threshold are fused. The optimal positions of the predicted water molecules are determined through an energy minimization.

HINT Relevance [25] is a simple metric for water conservation derived from two structural properties of water molecules in X-ray structures. The first property is Rank [57], a geometric-only parameter that scores the water site with respect to its potential ability to hold the water. The second parameter is the HINT score of the water, with its orientation optimized with respect to its environment. Despite an obvious appearance of information overlap between these two parameters, they are not well correlated [12]. Using a Bayesian-like approach and this pair of parameters a function was constructed and trained with liganded and unliganded protein structures in which 59/68 (87%) of waters in a test set were correctly predicted with respect to conservation or displacement on ligand binding. A correlation was also found between results and crystallographic data quality: 35/38 (92%) of the waters were correctly predicted for proteins with resolution ≤ 2.0 Å. Further discussion of HINT and Relevance can be found in a later section.

Thermodynamics Tools for Prediction of Water Conservation

JAWS [58] is a grid-based Monte Carlo molecular simulation methodology developed to determine the positions of water molecules in the binding sites of proteins or in protein-ligand complexes. Occupancies and absolute binding free energies of water molecules are computed using a statistical thermodynamics approach. While the method was validated on a relatively small test set of 5 protein structures (neuraminidase, scytalone dehydratase, major urinary protein 1, β-lactoglobulin and COX-2), it was demonstrably accurate in identifying hydration sites when compared to high-resolution crystallographic data.

DOWSER [59] predicts favorable water sites based on the average interaction energy during a short molecular dynamic simulation. In order to determine if an interior cavity is occupied by water molecules or not,

a cutoff energy value of -10 kcal mol-1 was established. While performing a 10 ns molecular dynamics (MD) simulation, Damjanovic et al. [60] applied DOWSER to identify water molecules in the cavity of Staphylococcal nuclease. Internal water molecules were identified on the basis of their coordination state and characterized in terms of their residence times, average location, dipole moment fluctuations, hydrogen bonding interactions and interaction energies. Water molecules possessing residence times of several nanoseconds and small mean-square displacements were in agreement with crystallographic data.

WaterMap [61, 62] is an approach designed to evaluate the relative free energies of binding for series of various ligands. Molecular dynamics simulations used to generate the positions of water sites and to estimate the free energies for displacements of water molecules – compared to the bulk solvent – are performed using a method based on inhomogeneous solvation theory. The authors showed how the approach is able to predict the binding free energies of a set of congeneric pairs of Factor Xa ligands. Furthermore, they recently highlighted the nontrivial effects of binding moieties in dry sites and describe their contributions to the binding affinity [63]. The displacement of water molecules from hydrophobic sites due to the presence of complementary hydrophobic groups in ligands has been established as principal driving force of protein-ligand binding. However, one can also occasionally observe situations in which a region of the ligand binding site is so unfavorable for water molecules that a void is formed. These dry regions should have an important role in the binding event. Wang et al. [63] investigate quantitatively these aspects by combining WaterMap free energy differences with additional terms attributable to occupation of the dry regions by ligand atoms.

IFST (inhomogeneous fluid solvation theory) [64, 65] also uses short molecular simulations to calculate the thermodynamics of water molecules in the water binding sites, thus revealing information about bound water energetic contributions. An advantage of this method is that the free energy is divided into enthalpic and entropic contributions. As suggested by the name “inhomogeneous fluid”, the theory treats the solute as fixed and computes the solvation energy and entropy as integrals over the space occupied by solvent. IFST has been implemented in STOW (Solvation Thermodynamics of Ordered Water) [2], which can use as input data water coordinates obtained from any MD package. These coordinates are then used to compute the integral used by IFST and the output consists of the energetic contributions of each water molecule.

Barillari et al. [51] pointed out that a better understanding of the role and nature of water molecules at ligand-protein interfaces could greatly improve the efficacy of rational molecular design and synthesis. Thermodynamic analysis of water molecules in six protein structures were carried out by calculating the absolute free binding energy of 54 key water molecules in these structures using the double-decoupling method with replica exchange thermodynamic integration in Monte Carlo simulations. A statistical approach to distinguish between two classes of water molecules – those conserved and not removed by any ligands and those could be displaced – was proposed. It was shown that evaluating the propensity of a water molecule to be displaced from its binding site at the protein-ligand interface could suggest useful information to drive compound design towards removing specific waters.

Hamelberg and McCammon [66] proposed an analogous approach. They performed a double-coupling experiment to calculate the standard free energy of binding for water molecules in the binding pockets of the crystallographic structures of two different proteins (anionic trypsine complexed with benzyldiamine and HIV-1 protease complexed with the KNI-272 inhibitor). One of the key conclusions is that this method is useful to identify those water molecules that should be targeted in ligand-driven displacement for rational-drug design, especially when the goal is to improve the binding affinity of a lead compound.

pH AND IONIZATION STATES

The structure and, consequently, the function, of a protein is related to the ionization state of a number of critical residues, whose pKa values are, in turn, strongly influenced by the electrostatics of the protein environment [67]. It is clear that the determination/prediction of the correct protonation states of ionizable groups represents a fundamental, even if still partially unsolved, problem. Equally clear is that this problem involves a bit of circular logic, in that the residue influences the environment even as the environment influences the residue. Inserting a ligand into this system adds another layer of complexity in than many, if not most, small molecule drug-like moieties are themselves acids and/or bases, or contain functional groups that are affected by changes in pH or in the nano environment. The ubiquitous presence of water adds still more complexity. In this section of the paper we examine these issues in two parts: first with respect to the protein and then with respect to the ligands.

From the Protein’s Point of View

https://www.researchgate.net/publication/6517242_Classification_of_Water_Molecules_in_Protein_Binding_Sites?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

The primary ionizable residues (Lys, Arg, His, Asp and Glu) are involved in regulating most of a protein’s properties such as function, dynamics, structure and inter-molecular interactions. These residues represent 25% of the types in an average protein [68, 69]. In order to perform reliable in silico simulations, structure-based methodologies need to be able to properly predict the pKa values for these residues and determine their associated electrostatic energies. At a minimum, we should know what state (ionized, unionized) they are in. It is important to highlight that the binding of a proton represents the simplest reaction that a protein can undergo! Unfortunately, as described above, this is a quite difficult experimental problem that is acerbated by the scale and complexity of biological systems. Thus, the computational estimation of amino acid pKas not only is an alternative tool where experimental measurements are not possible, but, in many cases, the only tool for this purpose. Understanding these phenomena allows us to link structure and energetics and, consequently, structure and function, because such predictions are made by applying physics-based models of electrostatic forces to high-resolution structures of proteins [70].

Many studies have clearly demonstrated that prediction of pKa values is anything but straightforward [71-75]. While the pKas of surface residues are rather easy to reproduce since they depend mostly on the dielectric properties of water (in which they are immersed) and on the flexibility at the protein-water interface, the pKas of internal residues are often anomalous and very different from the normal values in water [76, 77]. Nevertheless, even though early simulations were only able to provide estimations slightly better than the NULL model [78], which assumes that pKa values are equal to those isolated residues in water, more recent approaches provide estimations with errors near 1 pKa unit, similar to the protein-induced shifts [79].

In principle, methods for predicting pKas calculate the free energy term associated with the hypothetical transfer of the group of interest from the solvent to the protein, and add this term to the pKa of the same residue in that solvent, i.e., as in the following equation:

pKa (protein) = pKa (solvent) + ∆pKa (solvent protein) The interactions made by the transferred residue with the surrounding environment and of course the pKa

(solvent) term are basically electrostatic in nature. It follows that any method that aims to properly model and predict the behavior of ionizable residues in proteins must also properly calculate electrostatics and other relevant energy terms. Electrostatic effects can be determined by applying macroscopic frameworks, such as the Poisson-Boltzmann Equation (PBE) or Generalized Born (GB) equation, where the system is described as a continuum and all interactions are determined by solving the macroscopic electrostatic equations, or microscopic frameworks, which work at the atomic level of detail and calculate the thermodynamic properties by statistical averaging [80].

Poisson-Boltzmann methods were first adopted by Tanford and Kirkwood [81] for calculating the pKas of titratable groups in proteins even before the three-dimensional structure of any protein was available. This first macroscopic framework was termed the TK model and described the protein as an impenetrable sphere characterized by a low dielectric constant in the interior, where some ionizable groups are located, and by a high dielectric constant on the protein’s exterior. Following the growing knowledge of protein structure obtained in the 1960s and 1970s, through the development of biomacromolecular X-ray crystallography and high-field NMR techniques, adjustments were made in the model in 1974 [82] so that the actual accessibility of the titratable groups was considered. These developments made the calculation, as well as the experimental determination, of pKas easier and more accurate. Nonetheless, errors of over 1 pKa unit are quite normal with this model, particularly when the related titratable groups experience large shifts from the reference (normal) values.

Bashford and Karplus [83] delivered a significant advancement in 1990 by using more detailed structural information and solving the PBE with a finite difference method (FDPB). They observed that protein dynamics and side-chain flexibility significantly affect pKa values. Other enhancements to PBE in this period included deeper exploitation of information encoded in 3D-structures [84], better optimization of partial charges [85], the inclusion of both neutral and charged forms of each titratable residue in the calculation protocols [86] and a new multiple-site titration algorithm for the proper treatment of proteins bearing more than one titratable sites [87]. Further innovations [78, 88, 89] were the development of methodologies that use dielectric constants to mimic protein flexibility. In particular, Karshikoff [88] assigned a different dielectric constant to each residue, using a combination of the FDBP and the Tanford-Roxby iterative procedures. Later, Nielsen and Vriend [90] showed that better pKa predictions could be obtained by introducing an explicit step for optimizing the hydrogen bond network into the FDBP. You and Bashford simulated the local flexibility of sidechains carrying polar protons through conformer ensembles where the proton positions were systematically varied [91]. Similarly, the agreement between predicted and experimentally determined pKa data with a Monte-Carlo based method was extended to compute statistical averages over protonation states of lysozyme, myoglobin and hemoglobin [92].

https://www.researchgate.net/publication/14743407_Multiple-site_titration_and_molecular_modeling_Two_rapid_methods_for_computing_energies_and_forces_for_ionizable_groups?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/51719835_Progress_in_the_prediction_of_pKa_values_in_proteins_Proteins?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

These results suggested that well-solvated residues sidechains represent an important factor in modeling proton binding. The FDPB_MF rotamer repacking method [93], able to sample side chain conformational space and to calculate multibody protein-solvent interactions, provides accurate predictions of the effects on pKa played by solvent exposure, ionic strength variations, structural reorganization and sequence mutation: over a subset of acidic residues extracted from five different proteins, the reported root mean square deviation (RMSD) was about 0.3 pKa units.

The Multi-Conformation Continuum Electrostatics (MCCE) approach combines FDBP electrostatic calculations with an explicit sampling of the different possible positions of side chains, hydrogens and ligands [94-96]. In its first application [95], MCCE predicted the pKas of 166 residues in 12 proteins with an RMSD of 0.83, and is now one of the most commonly used methods for incorporating protein local flexibility in pKa prediction calculations. Among its capabilities, MCCE can be used to analyze the conformational effects associated with charge changes due to structural protein adjustments, the structural and ionization effects from variations in pH, and the packing of sidechain rotamers as a function of pH. A hybrid pKa method applying MD simulations or ab initio techniques to represent the conformational ensembles for the ionized and neutral forms of titratable residues, followed by submitting these models to MCCE calculations, was recently reported [80]. Many programs and groups base their calculations on solutions of the PBE [97-104], which are obtained on volume-filling grids or by boundary element methods.

Generalized Born (GB) is another, but simpler, macroscopic framework method. The GB equation is an extension of the original Born approach [105] to multi-particle systems such as proteins, and is a faster method for describing and computing the solvation energies of organic molecules in solution. One of the first applications of GB was reported in 1990 by Still et al. [106] in a successful calculation of the solvation energies of small organic molecules that accounted for both electrostatic damping and solvation. After specific modifications for the study of macromolecules by Case and co-workers [107], the GB approximation accurately predicted most pKa shifts of titratable residues in lysozyme, myoglobin and bacteriorhodopsin, especially when local conformational changes were taken into account. Since then, many other tools based on the GB equation have been developed and successfully applied [108-116].

Microscopic frameworks have received greater attention in recent years because they are believed to be more accurate and to have a more solid theoretical basis, although they are more demanding and time consuming. The availability of faster and cheaper computational power is certainly a major factor in their increasing popularity. However, the effects of introducing necessary simplifications should be carefully evaluated because the theoretical content of the methods could be inadvertently compromised. Three classes of microscopic methods are in use. The first includes information from quantum mechanics (QM), the second includes information from extensive MD and the third is largely empirical.

Hybrid QM/MM methods, such as that of Jensen et al. [117] demonstrate that the combination of QM with a continuum treatment of the solvent presents a very accurate method for predicting and rationalizing pKa values of ionizable residues. Interestingly, pKa values were largely determined by hydrogen bonds rather than by long-range charge-charge interactions. QM/MM was able to predict pKas even for cases where traditional forcefield methods failed. The commonly observed errors arise from intramolecular interaction, solvation energies or by not considering local changes in conformation. Another approach [118], based on the generalized solvent boundary potential (GSBP) method for dealing with electrostatic interactions under spherical boundary conditions, offers a finely tuned balance between QM/MM and MM/MM interactions. The effects of the bulk solvent and macromolecule atoms outside of the microscopic region are handled at the Poisson-Boltzmann level. The reported success of these GSBP-based QM/MM simulations [118] highlighted the importance of both properly treating electrostatic effects in hybrid calculations and avoiding commonly used truncation schemes based on extensive conformational sampling. Most significantly, this approach proved the validity of studying chemical events of very complex biomolecular systems with multi-scale frameworks.

A QM/MM-FEP (free energy perturbation) hybrid approach was used by Ghosh and Cui to calculate the pKa of residue 66 in two mutants (V66E and V66D) of Staphylococcal nuclease [119]. The potential used was SCC-DFTB/MM-GSBP [119]. Despite the observation of small local conformational adjustments in the V66E mutant (in particular, the Glu flipped out from the protein’s interior during titration), its predicted pKa was close to the experimental value. However, in the V66D case, the calculated pKa, due to a partial unfolding of a β-sheet region in the simulation, was significantly higher than observed [120]. It was suggested, in contrast to what had previously been stated [117, 121], that the lack of electronic polarization for the protein interior is unlikely to be a major source of error for pKa shift calculations, but, instead, a more key factor was the employment of enhanced sampling techniques in the free energy simulations [122]. In particular, the use of enhanced sampling techniques that focus on structural transitions coupled to titration, e.g., Hamiltonian Replica Exchange [123], are suggested.

https://www.researchgate.net/publication/11451856_Rapid_Grid-Based_Construction_of_the_Molecular_Surface_and_the_Use_of_Induced_Surface_Charge_to_Calculate_Reaction_Field_Energies_Applications_to_the_Molecular_Systems_and_Geometric_Objects?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx


https://www.researchgate.net/publication/24429804_CHARMM_the_biomolecular_simulation_program_J_Comput_Chem_301545?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

Molecular Dynamics simulations, where pH is included as an external parameter, have also been extensively adopted for residue-level pKa predictions. MD calculations allow the system to experience different protonation states as a function of changes in the chemical environment and in external pH. In 2002, van Gunstersen and co-workers [124] presented an algorithm that generates trajectories at a Boltzman-distributed ensemble of protonation states by a combination of MD and Monte Carlo (MC) simulations. In particular, at each MC step, the protonation state of a titratable group is changed and the free energy difference of protonation/deprotonation is calculated by MD simulation in phase space by FEP. All reference free energy differences were determined with the thermodynamic integration (TI) method [125].

Constant pH Molecular Dynamics (CpHMD) [126] was first introduced by Mongan, Case and McCammon, implementing it with GB electrostatics, where protonation states are modeled with different charge sets, and sampling a Boltzmann distribution of protonation states that were generated by MC to titrate residues. Between MC steps, the system evolves according to GB-solvated MD, and at each MC step a titratable site and a new protonation state are randomly chosen and the transition free energy associated with that process is calculated. Then, these estimated energies are used to apply the Metropolis criterion and determine if the transition is to be accepted or refused. Close agreement between predicted and experimental pKa values was observed; nevertheless, the need for improving and accelerating conformational sampling was clear, as was the importance of properly coupling protonation and conformational adjustments.

Enhancements to CpHMD by Brooks and coworkers now allow ionizable groups to constantly switch between protonated and unprotonated forms [127, 128]. These methods are based on the λ dynamics approach for free energy calculations [129], and the GB implicit solvent model is used to calculate forces on spatial and titration coordinates. The latter allows rapid convergence of the pKas, which is not easily obtainable with explicit solvent models, and the analytical computation of forces on the titration coordinates. Although CpHMD has been demonstrated to provide accurate and robust predictions for pKa, and introduced a new tool for pH-dependent protein dynamics and folding [130, 131], recent experiments have revealed that attractive electrostatic interactions are often overestimated [132, 133], and that even small errors in computing electrostatic solvation energies can alter the relative deprotonation free energies and, consequently, the pKa shift. Other errors typically arise from small distortions in the conformation or distribution of conformations and, given the GB model’s dependence on conformational sampling, CpHMD methods are generally inapplicable to polyionic systems as nucleic acids.

Considering these limitations, Wallace and Shen [134] proposed an extension of CpHMD by using an explicit solvent representation able to capture a more dynamic view of the ionization equilibria at the protein interior. This methodology estimated the free energy of protein solvation using the GB implicit solvent model, and considered the system conformational dynamics with a more accurate explicit solvent model. In addition, to enhance both protonation and conformational state sampling, and to accelerate the sampling convergence, a pH-based replica exchange method was used. Promising results were obtained on a set of five different proteins: RMSD of 0.74 from the experimental data. In addition, protein native structures are preserved, a more realistic description of the conformational flexibility in the hydrophobic cluster is provided, and solvent mediated ion-pair interactions are more properly modeled. Thus, even if only a slight improvement of the target pKa prediction was achieved, the explicit solvent CpHMD is more physically realistic and allows the system to reach convergence after very short (1 ns) runs per replica.

By coupling accelerated MD (aMD) [135] with thermodynamic integration (TI), McCammon and coworkers showed that the number of conformational transitions was significantly increased, the sampling of internal degrees of freedom of solute was enhanced, and both low and high-energy regions of the potential surface were efficiently sampled [136]. Then, the thus validated aMD was coupled with the constant pH methodology – substituting for the commonly used MD – to build CpHaMD [137]. CpHaMD uses the same GB implicit solvation, with MC sampling based on GB-derived energies, as in the standard approach [126]. However, better sampling of conformational space, faster convergence of constant pH simulations, and better agreement between predicted and experimental pKa values were observed. In particular, the method proved to be most relevant for determining the pKa shifts of more buried residues. Again, the importance of coupling protonation and conformational adjustment was shown. More recently, the accelerated MD approach was implemented in the framework of ab initio calculations [138].

Baptista and coworkers proposed other two CpHMD methods in order to evaluate the complementarity of the MM/MD and PBE methods, which sample, respectively, different conformations at the same pH and different ionization states for the same conformation. With the implicit titration method, fractional protonation states are periodically updated by solving the PBE concurrently with the MD runs [139], while in the stochastic titration approach PBE and MC are used to generate discrete protonation states. A constant-(pH,E) MD method, which included the treatment of protonatable groups with hydrogen tautomerism [140] and redox groups [141], was also proposed. When applied to cytochrome c3 of Desulfovibrio Vulgaris, better results were obtained when

https://www.researchgate.net/publication/231247781_Continuous_Constant_pH_Molecular_Dynamics_in_Explicit_Solvent_with_pH-Based_Replica_Exchange?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/5902329_Linking_folding_with_aggregation_in_Alzheimer's_b-amyloid_peptides?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

a high protein dielectric (εp) was assigned, probably due to high heme-heme interactions at low εp. This is in contrast to expectations, as standard methods do not predict a εp dependence [142].

Empirical methods are directly based on functions whose parameters are obtained and optimized by experimental pKa values contained in large databases or indirectly based on related experimental physiochemical molecular properties.

PROPKA is an approach proposed by Li et al. [143] that calculates pKas in five different steps: 1) structure analysis/identification of ionizable groups; 2) determination of the position of backbone NH protons and sidechains of Asn, Gln, Trp, His and Arg residues possibly involved in H-bonds; 3) preliminary calculation of pKa values for ionizable groups based on the protonation states of other, more easily determined, ionizable residues; 4) iterative determination of all pKa values, usually with one to three iteration runs; 5) report the predicted pKa values and specific ΔpKa terms, i.e., ΔpKGlobalDes correlating global desolvation pKa shifts to the number of excess protein atoms, ΔpKLocalDes correlating local desolvation pKa shifts to the local protein atoms, ΔpKSDC-HB describing pKa shifts due to sidechain hydrogen bonds, ΔpKBKB-HB describing pKa shifts due to backbone hydrogen bonds and ΔpKchg-chg describing the relationship between pKa shifts and the charge-charge interactions between buried residues. When applied to 233 carboxyls, 12 cysteines, 45 hystidines and 24 lysines in various proteins, PROPKA was able to predict pKa values with an RMSD of 0.79, which is comparable to other PBE-based methodologies, but behaved better for Asp, Glu and Cys residues with highly shifted pKas. Thus, the relationship between protein structure and the chemical properties of its ionizable groups was shown to be understandable and quantitatively predictable by means of a few simple empirical rules. Later versions [144] provide a direct link between the structure and pKa predictions by calculating the contribution from each titratable residue to the pH-dependent unfolding free energy at a given pH.

Czodrowski concluded that a combination of different methodologies provides better results than using a single approach [145]. PROPKA and MCCE [96] (vide supra), recognized as the best performing tools in recent benchmarks [79, 146], were compared for five approved drugs for which the crystallographic protein-ligand complex has been solved. The recently introduced BIPS (Binary Protonation State) measure was used to translate pKas into discrete protonation states; i.e., only fully protonated or fully deprotonated residues are considered. Both methods were equally able to predict most of the sites, but the combination of the two approaches is recommended for a better and more nuanced evaluation of atypical protonation states. An alternative multi-task regression model approach was recently adopted by Skolidis et al. to improve pKa

predictions. Here, data of related classes of compounds were mixed and classified. On a published set of 698 monoprotic compounds that were divided into fifteen different classes, this method, when compared to linear Gaussian regression models, performed best in 85% of all the experiments [147].

HINT Computational Titration (CT) [29, 148-150] is a protocol developed by Kellogg, Cozzini, Mozzarelli and coworkers that exhaustively explores the manifold of ionization state combinations available to a protein-ligand complex by building the full set of models, optimizes the rotatable hydrogen bonds (i.e., in –OH, -NH2, etc.) and scores each model. The concept of isocrystallographic was introduced [29] to represent the notion that many molecular models including hydrogens can be fit into the electron density envelope provided by an X-ray crystallographic experiment, which, as noted above, does not generally observe the hydrogens. Also of interest, the Boltzmann-weighted energies calculated for the manifold of states may represent a better target quantity for correlating experimental and computational free energies. More discussion of HINT CT is provided below.

MoKaBio is a tool recently developed by Cruciani and coworkers [151] to automatically identify the ionizable residues in a protein, generate the fingerprint of the surrounding environment and predict their relative pKas based on comparison to information deposited in a database of 434 experimental protein pKas. In preparing the database, protein structures were used to generate multisphere fingerprints (encoding the character, i.e., physicochemical descriptors and distances, of the surrounding chemical environment) for each ionizable residue for which an experimental pKa was available. For prediction, a fingerprint for each new ionizable site is generated and a set of similarity indices (SI) is calculated by comparing the fingerprint of the new residue with all database fingerprints. The pKa is predicted by considering both the pKas and the SIs of the ten most similar sites in the database. When applied to a test set of 117 residues extracted from 10 proteins, MoKaBio achieved an RMSD of 0.78, while PROPKA and MCCE gave RMSDs of 0.91 and 0.81, respectively. It is also of note that MoKaBio takes into account protein conformational flexibility by using multiple protein models and by averaging the predicted pKas over each model.

The pKa Cooperative was established to evaluate the strengths and weaknesses of existing approaches in a first step towards developing new algorithms or by combining the strongest components of existing methodologies [80]. The cooperative has brought together a number of laboratories with expertise and interest in theoretical, computational and experimental studies of protein electrostatics with the primary aims of


providing multiple responses to the urgent need for reliable and useful methods for structure-based calculation of pKa values and electrostatic energies in proteins, and to improve the understanding of the underlying physics in electrostatic effects in proteins (see http://www.pkacoop.org) [70]. In the 1st Blind Prediction Challenge of 2009 [80], computational groups were asked to make blind predictions and reproduce an unpublished set of 100 pKa values of residues introduced by site-directed mutagenesis in the interior of Staphylococcal nuclease (SNase), where very large magnitude pKa shifts relative to their values in water were observed.

The Challenge tested and compared many computational approaches including: calculations based on empirical methods [152-154], calculations based on continuum electrostatic methods [155-160], calculations based on constant pH/MD [67, 161-163] and calculations based on other microscopic or semi-microscopic methods [120, 164]. In general, all methods gave comparable performances and none did significantly better than the others, thus suggesting that all have problems with their underlying physics [80]. Failures and pitfalls were commonly related to the use of single input protein structures, thus omitting micro and macro flexibility [165], to the lack of metrics for rigorous comparison of calculated and experimental data [166], and lastly to the need for methods that qualitatively describe and characterize the performance of pKa calculation algorithms.

In particular, MD-based methodologies encountered significant problems in modulating the protonation of buried residues because strong interactions between neighboring groups tend to persist for long times, resulting in slow and difficult convergence. Longer trajectories are usually too demanding and time consuming; thus, different strategies supporting the use of CpHaMD [136] were highlighted. Alternatively, enhanced sampling techniques such as replica exchange may help in reaching a faster and more stable convergence [167]. Empirical approaches such as MoKaBio [151] found difficulties in identifying enough similarity in the original training set for the introduced mutations, while PROPKA was not able to properly predict the conformational rearrangement following the introduction of a charged group into a hydrophobic environment [168].

In summary, better results for the empirical methods could likely be obtained by enlarging the training datasets with a diverse set of residues that include significantly shifted pKa values. Further emphasis should also be paid to the coupling between ionizable groups, since the ionization of different residues, even if distant, mutually and strongly influence each other [169]. Also, conformational reorganization induced by the ionization of internal groups should be taken into account, since preliminary analyses have already shown that the ionization of internal groups can trigger local conformational reorganization [170-172]. Lastly, the effects of explicit water on pKa shifts should be better investigated.

From the Ligand’s Point of View

The ionization states of small molecules can strongly affect their ability to interact with their targets, their absorption, distribution, metabolism and excretion (ADME) profile and their rate and site of metabolism. Indeed, pKa is often considered an early indicator of possible ADME-related problems. pKa is also a commonly used descriptor in QSAR models [173]. Because in silico approaches represent a valuable adjunct to experimental methods, as physical samples are not needed or their results can illuminate an understanding of experiment, many developments towards improving pKa prediction methods have been undertaken in recent years. In the case of small molecules, it is possible to analytically solve the pKa prediction problem; methods in this category are considered ab initio approaches. A much broader category is represented by empirical approaches, which include linear free energy relationships (LFER), multiple linear regression (MLR), quantitative structure-activity (and structure-property) relationships (QSAR/QSPR), artificial neural networks (ANN), and many other methods.

Ab initio approaches [174-176] normally use thermodynamic cycles since deprotonation is easier to determine in the gas phase. Great accuracy, comparable to experimental determination, can be reached, but these calculations with “first principles” high-level theory are still computationally expensive. Common problems arise from the necessity of dealing with small datasets of structurally related compounds and from inherent conformational flexibility that often makes identifying the global energy minimum difficult [177]. Although many studies have been published [174, 176-186] with their high computational cost, ab initio, density functional theory and semi-empirical methods [187, 188] are generally not suitable for analyses of large data sets or virtual screening experiments.

Linear free energy relationship (LFER) methods are based on the existence of “a linear correlation between the logarithm of a rate constant or equilibrium constant for one series of reactions and the logarithm of the rate constant or equilibrium constant for a related series of reactions” [189]. According to the following equation, pKas are directly correlated to the Gibbs free energy:

–G0 = RT ln Ka

https://www.researchgate.net/publication/244740775_Glossary_of_Terms_Used_in_Physical_Organic_Chemistry_IUPAC?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/11338451_Absolute_p_K_a_Determinations_for_Substituted_Phenols?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/245235951_Prediction_of_the_p_K_a_Values_of_Amines_Using_ab_Initio_Methods_and_Free-Energy_Perturbations?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/51481916_Developing_hybrid_approaches_to_predict_pKa_values_of_ionizable_groups?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx



pKa = G0 / 2.303 RT LFER methods were mostly used in the early stages of pKa predictions [190, 191], but still remain useful

tools [192] as they are implemented in software packages such as ACD/pKa [193], Epik [194], Pallas/pKalc [195] and SPARC [196].

SPARC, in particular, uses LFER and Perturbed Molecular Orbital (PMO) theory for describing resonance, solvation, electrostatic and quantum effects. As described in publications [197, 198], SPARC fragments each molecule into a reaction center “C” and a perturber “P” which are, respectively, the smallest subunit that has the potential to ionize and loose a proton to the solvent, and the molecular structure appended to that reaction center. The pKa of a molecule is determined by the contributions of both C and P with the following equation:

pKa = (pKa)C + δP(pKa)C

where (pKa)C corresponds to the ionization behavior of the reaction center and δP(pKa)C represents the ionization induced by the perturber structure. In particular, the δP(pKa)C contributions are directly calculated, in terms of potential mechanisms for the interaction of P and C, by the program as the sum:

δP(pKa)C = δele(pKa)C + δres(pKa)C + δsolv(pKa)C + δH-bond(pKa)C

with δele(pKa)C, δres(pKa)C, δsolv(pKa)C and δH-bond(pKa)C accounting for, respectively, the differential electrostatic interactions, resonance, solvation and H-bonding of P with the protonated and deprotonated states of C [199]. These calculated microscopic ionization constants can be then used for determining macro constants and related characteristics, with the only limitations on the number of parameterized substituents and the reaction center, for which the related information needs to be experimentally determined. When applied to a set of 3685 compounds including multiprotic molecules with up to six centers, the model was quite accurate with an RMSD of 0.37.

QSPR (QSAR) methods are, in a sense, free form empirical tools where the first step is the collection of a large reference set of experimentally determined pKas, followed by the identification of descriptors that are able to correlate structural and pKa variations, and finally, the definition of a statistically validated relationship between experimental values and these descriptors. The first QSPRs for pKa values were published in the early 1940s [200, 201]. The predictive power of these methodologies is directly related to the quality of the correlation (linear or non-linear) between the chemical descriptors and the modeled pKa values [202]. The use of the term “quality” implies that the training set encompasses the range of compounds to be predicted and that, while the selection of descriptors can be unconstrained, they should in the end be physically or chemically relevant and meaningful.

MoKa, for predicting the pKas of small organic molecules [203], was developed by the authors of MoKaBio (vide supra). MoKa is a 3D QSPR method where the descriptors are 3D Molecular Interaction Fields (MIFs) generated by GRID [8] using the topological distances to describe the molecular structure around each ionizable center. MoKa: 1) computes MIFs for a large database of fragments; 2) describes each new molecule with the pre-computed database MIFs; 3) describes the entire molecular structure by summing binned MIFs at each topological distance; and 4) calculates the statistical correlation between experimental pKas and molecular descriptors for each class of ionizable group. The use of pre-computed MIFs by MoKa overcomes the necessity of dealing with 3D structures, and pKa values are rapidly calculated from the 2D connectivity matrices. The methodology was developed, trained and cross-validated with a large set of 24617 pKa values and then tested on 28 novel compounds giving r2 = 0.85 and RMSD = 0.90. Not surprisingly, the prediction of pKas for novel groups not properly represented in the original fragment dataset was challenging.

A 3D QSAR method, Comparative molecular field analysis (CoMFA) [204], was also applied to pKa prediction [205-207], but the success of this approach was found to be extremely dependent on the conformations and alignments of the molecules analyzed, which is often the major drawback in 3D QSAR.

A decision tree approach using the SMARTS (SMiles ARbitrary Target Specification) language and the MDL MACCS (Molecular ACCess System) keys was developed by Crippen and colleagues [208]. The model, for the pKa prediction of monoprotic compounds, is based on SMARTS strings from a training set of 1693 compounds that define a decision tree where the leaf nodes provide pKa predictions. Unlike other methods, the model was derived by using only one training set from which a set of 139 optimized SMARTS descriptors were

obtained. The validation with leave-some-out (10%) cross-validation showed a q2 of 0.91 and RMSD of 0.80, while analyses of an external test set produced r2 = 0.94 and RMSD = 0.68.

Artificial neural networks (ANNs) [209] have been applied in QSPR/QSAR modeling [210, 211], including pKa predictions, because they can often detect complex non-linear relationships in data. Kernel methods have also been used for approaching the ionization problem [212]. Recently, the dissociation constants of several compounds were predicted with the Iterative Similarity Optimal Assignment Kernel (ISOAK) [213], a kernel specifically designed for comparing molecular structure graphs that had been previously applied in prospective virtual screening. Later, when applied to 698 compounds [214], the ISOAK model performed similarly to the semiempirical frontier electron theory approach of Tehan and coworkers [188, 215], without requiring any structural optimization.

The Qσ inductive descriptor, introduced by Gasteiger and coworkers, models the inductive effect of neighboring atoms with respect to a central atom in a molecule [216]. It is determined as the sum of the partial atomic sigma charges qσ of all atoms within a fixed distance d, and can be considered a direct quantitative measure of the σ electron-withdrawing or donating ability of neighboring atoms for that central atom. The good correlation observed for aliphatic carboxylic acids between the new descriptor and the Taft σ* constants (r2 = 0.848, standard error = 0.436) demonstrated its applicability for measurements of the inductive effects of substituent groups, and provided an alternative approach for estimating σ* constants directly from molecular structures. In the proposed pKa model other descriptors accounting for accessibility of the central atom, accessibility and polarizability of the acidic oxygen atom in an acid, the π-electronegativity of the R-carbon in an acid and an indicator variable for α-amino acids were used. The analyses performed with 1122 alphatic carboxylic acids yielded an RMSD of 0.42 and r2 = 0.81.

While only monoprotic molecules were originally considered, Jelfs et al. [217] extended the analyses to multiprotic molecules by developing an algorithm that applies multiple predictive models in order to reproduce the correct ionization order within a single compound. The adopted descriptors included semi-empirical properties such as partial charge and electrophilic superdelocalizability and information-based descriptors such as tree-structured fingerprints and 2D substructure flags. A simple algorithm assigns pKas in a stepwise manner to all the ionizable groups in a molecule starting from that presumed to be most basic and ending with the most acidic group. However, errors were encountered when dealing with highly basic groups possessing similar pKa values leading to the development of a more robust method that included a further comparison between the two most basic groups. RMSDs of 0.48 and 0.81 were obtained for the training and test sets, respectively. The two classes of descriptors complement one another well, resulting in predictive models for a variety of compound categories, i.e., alcohols, amidines, amines, anilines, carboxylic acids, guanidines, imidazoles, imines, phenols, pyridines and pyrimidines. The speed and accuracy of the model also led to the development of an associated web application [218].

Rupp and coworkers’ recent review [202] reports that, among the QSPR-based programs, the most accurate and widely used are ADMET Predictor [219], ADME Boxes [220], Marvin [221], MoKa [222], Pipeline Pilot [223] and OCHEM [224]. This review also nicely summarizes the advantages, problems, recent progress and challenges in pKa prediction for small molecules. Again, as reported for the protein case, many problems are related to conformational flexibility, since conformation may significantly affect the formation of intramolecular hydrogen bonds and thus impact the pKa of a compound, and the presence of more than one ionizable group on the same molecule, implying the necessity of considering multiple microstates. Last, but not least, the quality of pKa data, which is often not easily accessible in electronic form, needs to be improved. It is likely that the most significant improvements in pKa prediction could be achieved in ab initio calculation approaches and in statistical models based on kernel learning, although the interpretation of the physicochemical models of the latter is anything but simple. The combination of first principles-based methods with QSPR descriptors could also be a novel and productive approach, but a reliable combination has yet to be found [177].

THE HINT MODEL: WHERE WATER IS THE LEADING ACTOR For a number of years we have been developing an interaction model and forcefield based on a rather

simple premise: that the thermodynamically rich information encoded in experimental measurements of the log of the partition coefficient for 1-octanol/water (LogPo/w) can reveal a wealth of information concerning the biological environment. This model, HINT, has been reviewed previously [225-227], so we will not repeat the basic theory here. It is sufficient to state that the forces and interactions that drive a molecule to be soluble in one of these two solvents over the other one are the same forces and interactions that drive ligand binding, protein-protein associations, etc. Indeed, the ratio of solubilities in the two phases, P, is a free energy (G) for solute transfer [228]. LogPo/w has been measured for decades as it is a uniquely useful physicochemical

descriptor and many computational methods have been developed to predict LogPo/w [227]. HINT exploits this data to describe all biological interactions. A particular strength is that HINT reveals and quantifies hydrophobic interactions within this free energy framework. Also, since the entire experiment is performed in water rather than in vacuo, solvation and desolvation, as well as related phenomena, are represented in a natural way. Thus, the prediction of water and water conservation, evaluation of ionization states, exploration of tautomers, etc. are accessible through a unified free energy framework within the HINT model.

As noted above, we coined a term – isocrystallographic – in our publication on the computational titration (CT) of an HIV-1 protease-peptide complex [29]. In effect, any alternate molecular model that fits within the experimental electron density envelope of an X-ray crystal structure is isocrystallographic with the published structure model. This is simply an acknowledgment that many subtle and not so subtle chemical details, especially involving protons, are not visible in macromolecular crystallography and must be inferred from other experiments, treated with default protocols (e.g., acidic and basic residues are protonated in their “pH 7” form), or assigned through somewhat questionable assumptions (e.g., all isolated electron density peaks “must” be water molecules). In addition, Yvonne Martin has recently highlighted the necessity of also considering tautomers, present in about a quarter of drug-like compounds, in computer-aided drug discovery [229]. Thus, envisioning singular models for protein-ligand, protein-protein, etc., complexes is deceptively satisfying. The “true” situation is much more complex, especially since biology takes place at temperatures far above liquid nitrogen temperatures where most X-ray crystallography is performed. Clearly, this is another consideration beyond structure flexibility in building molecular models for drug discovery or chemical biology, since these models must represent many energetically relevant and populated “states” in an intuitive and accessible way [230].

Hydrophobic Interactions

While it is not a force, and “hydrophobic bonds” are definitely not created, the hydrophobic effect is a real phenomenon of crucial importance for life. It is a consequence of the biomolecular ensemble preferentially exposing more polar parts to the water (solvent), with which they can make hydrogen bonds. In turn, the hydrophobic parts of the ensemble are pushed together, thus giving the illusion of the hydrophobic force [231]. As a further consequence, previously coordinated water may be released to bulk, which increases the entropy of the ensemble. To date, no quantum mechanics calculation or first principles molecular mechanics force field has modeled the hydrophobic effect or its entropic contribution to free energy. However, molecular dynamics simulations with explicit water have potential to reach this goal [33, 232-234]. The HINT model views all biomolecular interactions empirically and we postulate that the water and 1-octanol solvents, respectively, are representative of polar and hydrophobic regions in biomacromolecules. A long list of studies using HINT have shown that the HINT score correlates very well with experimental free energy of binding and/or give valuable insight into the biomolecular system [24, 29, 148-150, 235-242].

We suspected, also, that this information is also potentially valuable in macromolecular X-ray crystallography, especially in cases where the data resolution is poor (e.g., > 3.0 Å). We tested this [243] by incorporating a HINT term within the CNS [244] refinement target function, and evaluating the resulting structure models with a suite of commonly used metrics such as Ramachandran scores and MolProbity clash scores [245]. We examined 25 high-resolution (≤ 1.5 Å) structures and simulated low-resolution datasets [246]; the results are shown in Figure 2. Ideally, the normalized intramolecular HINT score (Figure 2a) would maintain a value of 1 as resolution decreases – if the intramolecular interaction networks are conserved. For both native CNS and CNS incorporating its optional electrostatic term, this is not the case (after ~3.0-3.5 Å), but the CNS+HINT result does show the desired trend. The Ramachandran plot (Figure 2b) indicates that, even at simulated resolution > 4.0 Å, 83% of residues are in favored regions with CNS+HINT vs. 72% with native CNS. Clash scores (Figure 2c) are also dramatically improved. Finally (Figure 2d), the HINT force field has a larger cumulative effect on maintaining/modeling hydrophobic networks than on polar networks, although HINT performs as well as the CNS+electrostatics protocol on the latter.

Prediction of Water and the HINT Relevance Metric

The water Relevance metric [25] was initially conceived for the situation of a ligand binding into a protein pocket and displacing one or more water molecules, while leaving others behind with a variety of roles. Its robustness and the relative ease with which it could be calculated suggested that Relevance might be able to reveal useful information about water molecules in other environments.

One of the long-standing questions regarding protein-protein interactions has been related to the many water molecules found at these interfaces. Some progress is notable: Baker and colleagues have shown [36, 247] that designing protein-protein interfaces with discrete water molecules can be a critical feature; Janin has examined interface water [248, 249] in modeling protein-protein structure and interactions; Nussinov has shown [250] that water provides alternative strategies for proteins to optimally associate; and Pisabarro examined

https://www.researchgate.net/publication/40895010_MolProbity_All-Atom_Structure_Validation_for_Macromolecular_Crystallography?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/12667738_Wet_and_Dry_Interfaces_The_Role_of_Solvent_in_Protein-Protein_and_Protein-DNA_Recognition?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/11336988_Simple_Intuitive_Calculations_of_Free_Energy_of_Binding_for_Protein-Ligand_Complexes_1_Models_without_Explicit_Constrained_Water?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/41434898_The_Subunit_Interfaces_of_Weakly_Associated_Homodimeric_Proteins?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/19827023_Amphiphile_orientation_Physical_chemistry_and_biological_function?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/221966789_Super-resolution_biomolecular_crystallography_with_low-resolution_data_Abstract_Nature?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/5486464_Energy-based_prediction_of_amino_acid-nucleotide_base_recognition?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

solvent at protein-protein interfaces [251, 252] in order to improve protein contacts predictions. However, most tools for predicting positions of waters, while quite successful in protein-ligand systems as described above, have rarely been used in conjunction in protein-protein docking [253] or for guiding protein-protein associations [254]. Developing an understanding of hydrated protein-protein interfaces is a significant goal with wide-ranging implications.

With this in mind, we applied [35] Relevance to a large data set of protein-protein interfaces in 179 high-resolution X-ray crystal structures of protein-protein complexes from the PDB. There were 4741 unique waters, selected by being within 4.0 Å of both proteins of the complex. We showed that only about 21% of the waters were truly bridging, i.e., being Relevant with respect to both proteins, while 26% appeared to be only trapped at the interface, i.e., Relevant with respect to neither protein. This latter unfavorable interaction motif, which we termed a hydrophobic bubble [35] when the water was in a hydrophobic environment, is conserved – it was observed for 69% of the water molecules within (i.e., with solvent-accessible surface areas ≤ 10 Å2) the interface. It may be that some instability is required in protein-protein interfaces to ensure dynamic associations. For example, the role of some waters found at the interface in the colicin E9 endonuclease-immunity protein 2 structure (1.77 Å) were described as “aggravating” the interaction [36] and it was proposed [255] that water is a “lubricant” in protein folding and interaction.

We have also expanded our algorithm for generating water solvent arrays around proteins or in binding pockets to use Relevance-based criteria to predict the locations of water molecules computationally [256]. It is especially noteworthy that the Relevance metric does not require (experimentally-determined) crystallographic data like B-factors in contrast to some of the other tools described above. However, the fact that as many as one-in-four waters are found in energetically unfavorable positions suggests that such predictions will be challenging.

Computational Titration and Ligand Tautomerization

With our CT algorithm [29, 148-150, 257], we have illustrated the multiplicity of states due to ionizations, have incorporated the energetic contributions of Relevant water molecules, and suggested that a Boltzmann-like treatment of the ensemble may lead to more realistic interaction energy estimates. This algorithm enumerates the potential combinations of ionization states for a complex or interface, but is subject to combinatorial explosion. For example, if one were considering the case of a carboxylic acid interacting with an amine, there would be six possibilities: 1) neither moiety protonated; 2) only one oxygen of the acid protonated; 3) only the other oxygen of the acid protonated; 4) only the amine protonated; 5) the amine and the first oxygen of the acid both protonated; and 6) the amine and the second oxygen of the acid both protonated. Clearly, with ten or more protonatable groups in a system, e.g., at a protein-protein interface, there could be many millions of combinations. We are implementing machine learning, genetic algorithmic and multi-processor-enabled tools to facilitate this calculation. It is important to note, however, that we need to generate a reasonably sized sample that includes highly relevant (energetically more accessible) states along with less accessible states in order to calculate the weighted ensemble energies, which we suggest are more useful estimates of experimental binding energies.

We are also extending CT to include more effects, e.g., additional acid/base functionalities (like sulfates and phosphates), a more extensive and sophisticated pKa penalty function library, and ligand tautomerism [229, 258, 259], in order to build more comprehensive and accurate state ensembles. The latter deserves more attention. Tautomerism naturally fits into the CT protocol, with the only caveat being that enumerating possible ligand tautomers, of which there can be many, adds more microstate combinations to the manifold. The other technical issue is the de novo identification of molecules that are subject to tautomerism and evaluating the relative internal energies (penalties and rewards) of the various forms. This work is ongoing [260]. The importance of this problem can be illustrated with the pterins (2-aminopteridin-4(3H)-one), which function as cofactors in enzyme catalysis. There are multiple tautomers (Figure 3a) of which five are reasonably energetically accessible. Of particular interest is that the lowest energy conformation of these is not the one observed in complex (Figure 3b) with the ricin A chain [229, 261]. In our protocol, tautomers are detected by tree searches of ligand structures using a SMARTS-like approach – with specific mol2 atom types – and are then enumerated and scored in situ. HINT analysis of pterin shows the same energetic trends as others have reported [261, 262], but at a very low computational cost. Clearly, a virtual screening or docking protocol that does not examine the possibility of tautomerization will likely yield flawed results.

SUMMARY AND CONCLUSIONS The review above showed that much substantive progress is being made in the ability to computationally

predict water conservation and the pKas of protein and ligand functional groups in the biological milieu. We

https://www.researchgate.net/publication/26318589_Web_application_for_studying_the_free_energy_of_binding_and_protonation_states_of_protein-ligand_complexes_based_on_HINT?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx

https://www.researchgate.net/publication/26811994_Solvent_Effects_in_Chemical_Processes_Water-Assisted_Proton_Transfer_Reaction_of_Pterin_in_Aqueous_Environment?el=1_x_8&enrichId=rgreq-efb7907e-6898-4b9e-a78e-89c065fbc6da&enrichSource=Y292ZXJQYWdlOzIzMzc0NDI5NjtBUzo5ODg4NTkyNjEyOTY2NEAxNDAwNTg3NjUwMDMx


have not attempted to rank or rate the various approaches, as it is quite obvious that most – if not all – have strengths and weaknesses that are situation dependent. Ab initio approaches suffer from more difficult and time-consuming calculations, but are not compromised by the quality or extent of training data unlike the empirical and QSPR approaches.

We do have a prejudice for “unified” approaches that couple all of these related problems into a single framework. Hence, our continued development of HINT and its ancillary tools for predicting water placement, water Relevance, computational titration, ligand tautomerism and, most importantly, hydropathic networks. One result that has come through loudly is that the search for a single molecular model that is the “true” one is likely to be fruitless. There are many molecular models that are isocrystallographic under the conditions of the diffraction experiment, but many many more at the temperatures at which biology happens! In this sense, we are lukewarm on the idea of using scoring functions to exactly reproduce a crystal structure – although there is certainly value in being close. The real world problem for drug discovery, understanding biology, etc. is more related to getting the energetics correct. With that shift in perspective, some problems change their importance. For example, a bridging water acting as a donor with respect to two otherwise repulsive acceptors, e.g., a pair of carboxylates, may assume a large number of positions in space with exactly the same stabilizing energetic effect. Likewise, does it matter which is protonated when the two oxygens of a carboxylic acid are equidistant to a strong acceptor? So, to answer the question of the title, “do correct protonation states and relevant waters = better computational simulations?”, the answer is … maybe. Certainly, cognizance of these effects is crucial, but their level of representation should correspond to the level of detail needed in the result. One shouldn’t perform an all atom, explicit solvent, multiple nanosecond timescale MD simulation when a lesser computation, employing only the handful of waters that are truly important, would suffice.

The increasing availability of CPU cycles for computational biology research is a boon for increasing understanding. The next one to two decades will likely unravel some of the most significant problems in structure solution, exploitation, manipulation and, indeed, prediction. Less than a decade ago, few believed that the G-protein coupled receptors would ever be crystallized. Now, several unique and exciting structures are released every year and the pace is increasing. Few can say what new disease treatments will come from just these structures!

LIST OF ABBREVIATIONS

ADME = Absorption, Distribution, Metabolism and Excretion aMD = accelerated Molecular Dynamics ANN = Artificial Neural Networks CoMFA = Comparative Molecular Field Analysis CpHaMD = Constant pH accelerated Molecular Dynamics CpHMD = Constant pH Molecular Dynamics CSD = Cambridge Structural Database CT = Computational Titration FDPB = Finite Different Poisson Boltzmann FEP = Free Energy Perturbation GB = Generalized Born GSBP = Generalized Solvent Boundary Potential IFST = Inhomogeneous Fluid Solvation Theory ISOAK = Iterative Similarity Optimal Assignment Kernel LFER = Linear Free Energy Relationships MACCS = Molecular ACCess System MC = Monte Carlo MCCE = Multi-Conformation Continuum Electrostatics MD = Molecular Dynamics MIFs = Molecular Interaction Fields MLR = Multiple Linear Regression MM = Molecular Mechanics NMR = Nuclear Magnetic Resonance PBE = Poisson-Boltzmann Equation PMO = Perturbed Molecular Orbital QM = Quantum Mechanics QSAR = Quantitative Structure Activity Relationships QSPR = Quantitative Structure Property Relationships RMSD = Root Mean Square Deviation SMARTS = SMiles ARbitrary Target Specification STOW = Solvation Thermodynamics of Ordered Water

ACKNOWLEDGMENTS G.E.K. would like to acknowledge his colleague, Dr. Neel Scarsdale for his long-term partnership in

bringing many of these ideas to fruition and to his group members, past and present, which have contributed in many ways to the development and promotion of HINT. Drs. Donald J. Abraham (VCU) and Andrea Mozzarelli (Parma) have had major roles in these projects from the beginning.

Figure 1. 3D Representation of the Ligand Binding Pocket of HIV-1 Protease (PDB code: 1HIH) [263]. The surface defining the volume of the cavity is represented in gray, the protein is represented in ribbon style, Ile50 and the ligand are represented in stick style (color coded by atom type), and water molecules are represented as red balls that define their occupied space. Wat301 is located in the middle of the cavity center (near Ile50) while the other relevant waters, Wat313, Wat313bis and the symmetric Wat313’ and Wat313bis’ are located at the top of the pocket. All other waters and some portions of the protein have been deleted for clarity.

Figure 2. HINT-Assisted X-ray Refinement [243]. Low-resolution data were simulated and the resulting models were refined for 25 high-resolution PDB structures using CNS. a) Intramolecular HINT score for refined models – only with HINT is the quality maintained through low resolutions; b) Ramachandran scores (fraction of residues in favored regions); c) MolProbity clash scores (note that other metrics, e.g., RMSD and Rfree, also indicate that low-resolution structures refined with HINT are of higher quality and more similar to the high-resolution models); and d) the quality is maintained by both polar and hydrophobic portions of the HINT forcefield, and the optional electrostatics term of CNS is of marginal value – really only shifting scores to more positive values at all resolutions.

Figure 3. Tautomerism. a) Pterin presents multiple tautomers that differ in hydrogen position; five are shown; b) tautomers 1 and 3 are shown as docked to the ricin A-chain; and c) mol2 representations in a SMARTS-like formula for the five tautomers together with five measures of their energetics. Egas, their calculated internal molecular energies using a 6-31g** basis set [261], Ggas, their relative free energies from MP2/aug-cc-pVDZ calculations [262], and Evac, their vacuum enthalpies from the Tripos forcefield (ε=80), all show that tautomer 1 is the most stable; however, for the bound state, Gint, Sybyl net interaction energies [261], and Gbind, HINT-calculated binding scores (translated by usual factor of 515 score units = 1 kcal mol-1), indicate that tautomer 3 is more stable. As suggested by Yan et al. [261] and illustrated in b, above, tautomer 3 is able to form more hydrogen bonds than 1 in its bound state.

REFERENCES




[1] Simonson T, Brooks CL, 3rd. Charge Screening and the Dielectric Constant of Proteins: Insights from Molecular Dynamics. J Am Chem Soc 1996; 118: 8452‐8458.

[2] Li Z, Lazaridis T. Computing the thermodynamic contributions of interfacial water. Methods Mol Biol 2012; 819: 393‐404.

[3] Klebe G. Virtual ligand screening: strategies, perspectives and limitations. Drug Discov Today 2006; 11: 580‐594.

[4] Teyra J, Pisabarro MT. Characterization of interfacial solvent in protein complexes and contribution of wet spots to the interface description. Proteins 2007; 67: 1087‐1095.

[5] Santos R, Hritz J, Oostenbrink C. Role of water in molecular docking simulations of cytochrome P450 2D6. J Chem Inf Model 2010; 50: 146‐154.

[6] Nucci NV, Pometun MS, Wand AJ. Site‐resolved measurement of water‐protein interactions by solution NMR. Nat Struct Mol Biol 2011; 18: 245‐249.

[7] Li S, Su Y, Luo W, Hong M. Water‐protein interactions of an arginine‐rich membrane peptide in lipid bilayers investigated by solid‐state nuclear magnetic resonance spectroscopy. J Phys Chem B 2010; 114: 4063‐4069.

[8] Goodford PJ. A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J Med Chem 1985; 28: 849‐857.

[9] Kellogg GE, Fornabaio M, Chen DL, Abraham DJ. New application design for a 3D hydropathic map based search for potential water molecules bridging between protein and ligand. Internet Electr J Mol Design 2005; 4: 194‐209.

[10] Pierce BG, Hourai Y, Weng Z. Accelerating protein docking in ZDOCK using an advanced 3D convolution library. PLoS One 2011; 6: e24657.

[11] Csermely P. Water and cellular folding processes. Cell Mol Biol (Noisy‐le‐grand) 2001; 47: 791‐800.

[12] Amadasi A, Spyrakis F, Cozzini P, Abraham DJ, Kellogg GE, Mozzarelli A. Mapping the energetics of water‐protein and water‐ligand interactions with the "natural" HINT forcefield: predictive tools for characterizing the roles of water in biomolecules. J Mol Biol 2006; 358: 289‐309.

[13] Li Z, Lazaridis T. Thermodynamic contributions of the ordered water molecule in HIV‐1 protease. J Am Chem Soc 2003; 125: 6636‐6637.

[14] Lu Y, Yang CY, Wang S. Binding free energy contributions of interfacial waters in HIV‐1 protease/inhibitor complexes. J Am Chem Soc 2006; 128: 11830‐11839.

[15] Pillai B, Kannan KK, Hosur MV. 1.9 A x‐ray study shows closed flap conformation in crystals of tethered HIV‐1 PR. Proteins 2001; 43: 57‐64.

[16] Miller M, Schneider J, Sathyanarayana BK, Toth MV, Marshall GR, Clawson L, Selk L, Kent SB, Wlodawer A. Structure of complex of synthetic HIV‐1 protease with a substrate‐based inhibitor at 2.3 A resolution. Science 1989; 246: 1149‐1152.

[17] Erickson J, Neidhart DJ, VanDrie J, Kempf DJ, Wang XC, Norbeck DW, Plattner JJ, Rittenhouse JW, Turon M, Wideburg N, et al. Design, activity, and 2.8 A crystal structure of a C2 symmetric inhibitor complexed to HIV‐1 protease. Science 1990; 249: 527‐533.

[18] Swain AL, Miller MM, Green J, Rich DH, Schneider J, Kent SB, Wlodawer A. X‐ray crystallographic structure of a complex between a synthetic protease of human immunodeficiency virus 1 and a substrate‐based hydroxyethylamine inhibitor. Proc Natl Acad Sci U S A 1990; 87: 8805‐8809.

[19] Jaskolski M, Tomasselli AG, Sawyer TK, Staples DG, Heinrikson RL, Schneider J, Kent SB, Wlodawer A. Structure at 2.5‐A resolution of chemically synthesized human immunodeficiency virus type 1 protease complexed with a hydroxyethylene‐based inhibitor. Biochemistry 1991; 30: 1600‐1609.

[20] Abdel‐Meguid SS, Metcalf BW, Carr TJ, Demarsh P, DesJarlais RL, Fisher S, Green DW, Ivanoff L, Lambert DM, Murthy KH, et al. An orally bioavailable HIV‐1 protease inhibitor containing an imidazole‐derived peptide bond replacement: crystallographic and pharmacokinetic analysis. Biochemistry 1994; 33: 11671‐11677.

[21] Wlodawer A, Vondrasek J. Inhibitors of HIV‐1 protease: a major success of structure‐assisted drug design. Annu Rev Biophys Biomol Struct 1998; 27: 249‐284.

[22] Lam PY, Jadhav PK, Eyermann CJ, Hodge CN, Ru Y, Bacheler LT, Meek JL, Otto MJ, Rayner MM, Wong YN, et al. Rational design of potent, bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science 1994; 263: 380‐384.

[23] Schaal W, Karlsson A, Ahlsen G, Lindberg J, Andersson HO, Danielson UH, Classon B, Unge T, Samuelsson B, Hulten J, Hallberg A, Karlen A. Synthesis and comparative molecular field analysis (CoMFA) of symmetric and nonsymmetric cyclic sulfamide HIV‐1 protease inhibitors. J Med Chem 2001; 44: 155‐169.

[24] Fornabaio M, Spyrakis F, Mozzarelli A, Cozzini P, Abraham DJ, Kellogg GE. Simple, intuitive calculations of free energy of binding for protein‐ligand complexes. 3. The free energy contribution of structural water molecules in HIV‐1 protease complexes. J Med Chem 2004; 47: 4507‐4516.

[25] Amadasi A, Surface JA, Spyrakis F, Cozzini P, Mozzarelli A, Kellogg GE. Robust classification of "relevant" water molecules in putative protein binding sites. J Med Chem 2008; 51: 1063‐1067.

[26] Smith R, Brereton IM, Chai RY, Kent SB. Ionization states of the catalytic residues in HIV‐1 protease. Nat Struct Biol 1996; 3: 946‐950.

[27] Wang YX, Freedberg DI, Grzesiek S, Torchia DA, Wingfield PT, Kaufman JD, Stahl SJ, Chang CH, Hodge CN. Mapping hydration water molecules in the HIV‐1 protease/DMP323 complex in solution by NMR spectroscopy. Biochemistry 1996; 35: 12694‐12704.

[28] Louis JM, Dyda F, Nashed NT, Kimmel AR, Davies DR. Hydrophilic peptides derived from the transframe region of Gag‐Pol inhibit the HIV‐1 protease. Biochemistry 1998; 37: 2105‐2110.

[29] Spyrakis F, Fornabaio M, Cozzini P, Mozzarelli A, Abraham DJ, Kellogg GE. Computational titration analysis of a multiprotic HIV‐1 protease‐ligand complex. J Am Chem Soc 2004; 126: 11764‐11765.

[30] Roberts BC, Mancera RL. Ligand‐protein docking with water molecules. J Chem Inf Model 2008; 48: 397‐408.

[31] Huang N, Shoichet BK. Exploiting ordered waters in molecular docking. J Med Chem 2008; 51: 4862‐4865.

[32] Birch L, Murray CW, Hartshorn MJ, Tickle IJ, Verdonk ML. Sensitivity of molecular docking to induced fit effects in influenza virus neuraminidase. J Comput Aided Mol Des 2002; 16: 855‐869.

[33] de Beer SB, Vermeulen NP, Oostenbrink C. The role of water molecules in computational drug design. Curr Top Med Chem 2010; 10: 55‐66.

[34] Ball P. Water as an active constituent in cell biology. Chem Rev 2008; 108: 74‐108.

[35] Ahmed MH, Spyrakis F, Cozzini P, Tripathi PK, Mozzarelli A, Scarsdale JN, Safo MA, Kellogg GE. Bound water at protein‐protein interfaces: partners, roles and hydrophobic bubbles as a conserved motif. PLoS One 2011; 6: e24712.

[36] Meenan NA, Sharma A, Fleishman SJ, Macdonald CJ, Morel B, Boetzel R, Moore GR, Baker D, Kleanthous C. The structural and energetic basis for high selectivity in a high‐affinity protein‐protein interaction. Proc Natl Acad Sci U S A 2010; 107: 10080‐10085.

[37] Lu Y, Wang R, Yang CY, Wang S. Analysis of ligand‐bound water molecules in high‐resolution crystal structures of protein‐ligand complexes. J Chem Inf Model 2007; 47: 668‐675.

[38] Tame JR, Murshudov GN, Dodson EJ, Neil TK, Dodson GG, Higgins CF, Wilkinson AJ. The structural basis of sequence‐independent peptide binding by OppA protein. Science 1994; 264: 1578‐1581.

[39] Sleigh SH, Seavers PR, Wilkinson AJ, Ladbury JE, Tame JR. Crystallographic and calorimetric analysis of peptide binding to OppA protein. J Mol Biol 1999; 291: 393‐415.

[40] Quiocho FA, Wilson DK, Vyas NK. Substrate specificity and affinity of a protein modulated by bound water molecules. Nature 1989; 340: 404‐407.

[41] Davis AM, Teague SJ, Kleywegt GJ. Application and limitations of X‐ray crystallographic data in structure‐based ligand and drug design. Angew Chem Int Ed Engl 2003; 42: 2718‐2736.

[42] Raymer ML, Sanschagrin PC, Punch WF, Venkataraman S, Goodman ED, Kuhn LA. Predicting conserved water‐mediated and polar ligand interactions in proteins using a K‐nearest‐neighbors genetic algorithm. J Mol Biol 1997; 265: 445‐464.

[43] Punch WF, Goodman ED, Pei M, Chia‐Shun L, Hovland P, Enbody R. Further research on feature selection and classification using genetic algorithms. Proceedings of the International Conference on Genetic Algorithms; 1993

[44] Overington J, Johnson MS, Sali A, Blundell TL. Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc Biol Sci 1990; 241: 132‐145.

[45] Kuhn LA, Swanson CA, Pique ME, Tainer JA, Getzoff ED. Atomic and residue hydrophilicity in the context of folded protein structures. Proteins 1995; 23: 536‐547.

[46] Garcia‐Sosa AT, Mancera RL, Dean PM. WaterScore: a novel method for distinguishing between bound and displaceable water molecules in the crystal structure of the binding site of protein‐ligand complexes. J Mol Model 2003; 9: 172‐182.

[47] Imai T, Hiraoka R, Kovalenko A, Hirata F. Locating missing water molecules in protein cavities by the three‐dimensional reference interaction site model theory of molecular solvation. Proteins 2007; 66: 804‐813.

[48] Verdonk ML, Cole JC, Taylor R. SuperStar: a knowledge‐based approach for identifying interaction sites in proteins. J Mol Biol 1999; 289: 1093‐1108.

[49] Bruno IJ, Cole JC, Lommerse JP, Rowland RS, Taylor R, Verdonk ML. IsoStar: a library of information about nonbonded interactions. J Comput Aided Mol Des 1997; 11: 525‐537.

[50] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235‐242.

[51] Barillari C, Taylor J, Viner R, Essex JW. Classification of water molecules in protein binding sites. J Am Chem Soc 2007; 129: 2577‐2587.

[52] Rossato G, Ernst B, Vedani A, Smiesko M. AcquaAlta: a directional approach to the solvation of ligand‐protein complexes. J Chem Inf Model 2011; 51: 1867‐1881.

[53] Ross GA, Morris GM, Biggin PC. Rapid and accurate prediction and scoring of water molecules in protein binding sites. PLoS One 2012; 7: e32036.

[54] Pitt WR, Goodfellow JM. Modelling of solvent positions around polar groups in proteins. Protein Eng 1991; 4: 531‐537.

[55] Kortvelyesi T, Dennis S, Silberstein M, Brown L, 3rd, Vajda S. Algorithms for computational solvent mapping of proteins. Proteins 2003; 51: 340‐351.

[56] Schymkowitz JW, Rousseau F, Martins IC, Ferkinghoff‐Borg J, Stricher F, Serrano L. Prediction of water and metal binding sites and their affinities by using the Fold‐X force field. Proc Natl Acad Sci U S A 2005; 102: 10147‐10152.

[57] Kellogg GE, Chen DL. The importance of being exhaustive. Optimization of bridging structural water molecules and water networks in models of biological systems. Chem Biodivers 2004; 1: 98‐105.

[58] Michel J, Tirado‐Rives J, Jorgensen WL. Prediction of the water content in protein binding sites. J Phys Chem B 2009; 113: 13337‐13346.

[59] Zhang L, Hermans J. Hydrophilicity of cavities in proteins. Proteins 1996; 24: 433‐438.

[60] Damjanovic A, Garcia‐Moreno B, Lattman EE, Garcia AE. Molecular dynamics study of water penetration in staphylococcal nuclease. Proteins 2005; 60: 433‐449.

[61] Abel R, Young T, Farid R, Berne BJ, Friesner RA. Role of the active‐site solvent in the thermodynamics of factor Xa ligand binding. J Am Chem Soc 2008; 130: 2817‐2831.

[62] Young T, Abel R, Kim B, Berne BJ, Friesner RA. Motifs for molecular recognition exploiting hydrophobic enclosure in protein‐ligand binding. Proc Natl Acad Sci U S A 2007; 104: 808‐813.

[63] Wang L, Berne BJ, Friesner RA. Ligand binding to protein‐binding pockets with wet and dry regions. Proc Natl Acad Sci U S A 2012; 109: 1326‐1330.

[64] Lazaridis T. Inhomogeneous fluid approach to solvation thermodynamics 1. Theory. J Phys Chem B 1998; 102: 3531‐3541.

[65] Lazaridis T. Inhomogeneous fluid approach to solvation thermodynamics. 2. Applications to simple fluids. J Phys Chem B 1998; 102:

[66] Hamelberg D, McCammon JA. Standard free energy of releasing a localized water molecule from the binding pockets of proteins: double‐decoupling method. J Am Chem Soc 2004; 126: 7683‐7689.

[67] Williams SL, Blachly PG, McCammon JA. Measuring the successes and deficiencies of constant pH molecular dynamics: a blind prediction study. Proteins 2011; 79: 3381‐3388.

[68] Kim J, Mao J, Gunner MR. Are acidic and basic groups in buried proteins predicted to be ionized? J Mol Biol 2005; 348: 1283‐1298.

[69] Spassov VZ, Ladenstein R, Karshikoff AD. Optimization of the electrostatic interactions between ionized groups and peptide dipoles in proteins. Protein Sci 1997; 6: 1190‐1196.

[70] Nielsen JE, Gunner MR, Garcia‐Moreno BE. The pKa Cooperative: a collaborative effort to advance structure‐based calculations of pKa values and electrostatic effects in proteins. Proteins 2011; 79: 3249‐3259.

[71] Bashford D. Macroscopic electrostatic models for protonation states in proteins. Front Biosci 2004; 9: 1082‐1099.

[72] Chen J, Brooks CL, 3rd, Khandogin J. Recent advances in implicit solvent‐based methods for biomolecular simulations. Curr Opin Struct Biol 2008; 18: 140‐148.

[73] Garcia‐Moreno EB, Fitch CA. Structural interpretation of pH and salt‐dependent processes in proteins with computational methods. Methods Enzymol 2004; 380: 20‐51.

[74] Gunner MR, Mao J, Song Y, Kim J. Factors influencing the energetics of electron and proton transfers in proteins. What can be learned from calculations. Biochim Biophys Acta 2006; 1757: 942‐968.

[75] Wallace JA, Shen JK. Predicting pKa values with continuous constant pH molecular dynamics. Methods Enzymol 2009; 466: 455‐475.

[76] Isom DG, Cannon BR, Castaneda CA, Robinson A, Garcia‐Moreno B. High tolerance for ionizable residues in the hydrophobic interior of proteins. Proc Natl Acad Sci U S A 2008; 105: 17784‐17788.

[77] Isom DG, Castaneda CA, Cannon BR, Velu PD, Garcia‐Moreno EB. Charges in the hydrophobic interior of proteins. Proc Natl Acad Sci U S A 2010; 107: 16096‐16100.

[78] Antosiewicz J, McCammon JA, Gilson MK. The determinants of pKas in proteins. Biochemistry 1996; 35: 7819‐7833.

[79] Stanton C, Houk K. Benchmarking pKa prediction methods for residues in proteins. J Chem Theory Comput 2008; 4: 951‐966.

[80] Alexov E, Mehler EL, Baker N, Baptista AM, Huang Y, Milletti F, Nielsen JE, Farrell D, Carstensen T, Olsson MH, Shen JK, Warwicker J, Williams S, Word JM. Progress in the prediction of pKa values in proteins. Proteins 2011; 79: 3260‐3275.

[81] Tanford C, Kirkwood J. Theory of protein titration curves. I. General equations for impenetrable spheres. J Am Chem Soc 1957; 79: 5333‐5339.

[82] Reynolds JA, Gilbert DB, Tanford C. Empirical correlation between hydrophobic free energy and aqueous cavity surface area. Proc Natl Acad Sci U S A 1974; 71: 2925‐2927.

[83] Bashford D, Karplus M. pKa's of ionizable groups in proteins: atomic detail from a continuum electrostatic model. Biochemistry 1990; 29: 10219‐10225.

[84] Potter M, Gilson MK, McCammon J. Small molecule pKa prediction with continuum electrostatic calculations. J Am Chem Soc 1994; 116: 10298‐10299.

[85] Demchuk E, Wade R. Imrpoving the continuum dielectric approach to calculating pKa's ionizable groups in proteins. J Phys Chem 1996; 100: 17373‐17387.

[86] Yang AS, Gunner MR, Sampogna R, Sharp K, Honig B. On the calculation of pKas in proteins. Proteins 1993; 15: 252‐265.

[87] Gilson MK. Multiple‐site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins. Proteins 1993; 15: 266‐282.

[88] Karshikoff A. A simple algorithm for the calculation of multiple site titration curves. Protein Eng 1995; 8: 243‐248.

[89] Teixeira VH, Cunha CA, Machuqueiro M, Oliveira AS, Victor BL, Soares CM, Baptista AM. On the use of different dielectric constants for computing individual and pairwise terms in poisson‐boltzmann studies of protein ionization equilibrium. J Phys Chem B 2005; 109: 14691‐14706.

[90] Nielsen JE, Vriend G. Optimizing the hydrogen‐bond network in Poisson‐Boltzmann equation‐based pK(a) calculations. Proteins 2001; 43: 403‐412.

[91] You TJ, Bashford D. Conformation and hydrogen ion titration of proteins: a continuum electrostatic model with conformational flexibility. Biophys J 1995; 69: 1721‐1733.

[92] Beroza P, Case DA. Including Side Chain Flexibility in Continuum Electrostatic Calculations of Protein Titration. J Phys Chem 1996; 100: 20156‐20163.

[93] Barth P, Alber T, Harbury PB. Accurate, conformation‐dependent predictions of solvent effects on protein ionization constants. Proc Natl Acad Sci U S A 2007; 104: 4898‐4903.

[94] Alexov EG, Gunner MR. Incorporating protein conformational flexibility into the calculation of pH‐dependent protein properties. Biophys J 1997; 72: 2075‐2093.

[95] Georgescu RE, Alexov EG, Gunner MR. Combining conformational flexibility and continuum electrostatics for calculating pK(a)s in proteins. Biophys J 2002; 83: 1731‐1748.

[96] Song Y, Mao J, Gunner MR. MCCE2: improving protein pKa calculations with extensive side chain rotamer sampling. J Comput Chem 2009; 30: 2231‐2247.

[97] Brooks BR, Brooks CL, 3rd, Mackerell AD, Jr., Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M. CHARMM: the biomolecular simulation program. J Comput Chem 2009; 30: 1545‐1614.

[98] Grant A, Pickup B, Nicholls A. A smooth permittivity function for Poisson‐Boltzmann solvation methods. J Comput Chem 2001; 22: 608‐640.

[99] Jo S, Vargyas M, Vasko‐Szedlar J, Roux B, Im W. PBEQ‐Solver for online visualization of electrostatic potential of biomolecules. Nucleic Acids Res 2008; 36: W270‐275.

[100] Lu B, Cheng X, Huang J, McCammon JA. An Adaptive Fast Multipole Boundary Element Method for Poisson‐Boltzmann Electrostatics. J Chem Theory Comput 2009; 5: 1692‐1699.

[101] Lu B, Cheng X, Huang J, McCammon JA. AFMPB: An Adaptive Fast Multipole Poisson‐Boltzmann Solver for Calculating Electrostatics in Biomolecular Systems. Comput Phys Commun 2010; 181: 1150‐1160.

[102] Rocchia W, Sridharan S, Nicholls A, Alexov E, Chiabrera A, Honig B. Rapid grid‐based construction of the molecular surface and the use of induced surface charge to calculate reaction field energies: applications to the molecular systems and geometric objects. J Comput Chem 2002; 23: 128‐137.

[103] Yu Z, Holst MJ, Cheng Y, McCammon JA. Feature‐preserving adaptive mesh generation for molecular shape modeling and simulation. J Mol Graph Model 2008; 26: 1370‐1380.

[104] Zhou YC, Feig M, Wei GW. Highly accurate biomolecular electrostatics in continuum dielectric environments. J Comput Chem 2008; 29: 87‐97.

[105] Hoijtink G, de Boer E, van der Meer P, Weijland W. Reduction potentials of various aromatic hydrocarbons and their univalent anions. Rec Trav Chim 1956; 75: 487‐503.

[106] Still W, Tempczyk A, Hawley R, Hendrickson T. Semianalytical treatment of solvation for molecular mechanics and dynamics. J Am Chem Soc 1990; 112: 6127‐6129.

[107] Onufriev A, Bashford D, Case DA. Modification of the Generalized Born Model Suitable for Macromolecules. J Phys Chem B 2000; 104: 3712‐3720.

[108] Feig M, Im W, Brooks CL, 3rd. Implicit solvation based on generalized Born theory in different dielectric environments. J Chem Phys 2004; 120: 903‐911.

[109] Fenley AT, Gordon JC, Onufriev A. An analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysis. J Chem Phys 2008; 129: 075101.

[110] Gallicchio E, Levy RM. AGBNP: an analytic implicit solvent model suitable for molecular dynamics simulations and high‐resolution modeling. J Comput Chem 2004; 25: 479‐499.

[111] Gallicchio E, Zhang LY, Levy RM. The SGB/NP hydration free energy model based on the surface generalized born solvent reaction field and novel nonpolar hydration free energy estimators. J Comput Chem 2002; 23: 517‐529.

[112] Gordon JC, Fenley AT, Onufriev A. An analytical approach to computing biomolecular electrostatic potential. II. Validation and applications. J Chem Phys 2008; 129: 075102.

[113] Im W, Lee MS, Brooks CL, 3rd. Generalized born model with a simple smoothing function. J Comput Chem 2003; 24: 1691‐1702.

[114] Lee MS, Feig M, Salsbury FR, Jr., Brooks CL, 3rd. New analytic approximation to the standard molecular volume definition and its application to generalized Born calculations. J Comput Chem 2003; 24: 1348‐1356.

[115] Schaefer M, Bartels C, Leclerc F, Karplus M. Effective atom volumes for implicit solvent models: comparison between Voronoi volumes and minimum fluctuation volumes. J Comput Chem 2001; 22: 1857‐1879.

[116] Sigalov G, Fenley A, Onufriev A. Analytical electrostatics for biomolecules: beyond the generalized Born approximation. J Chem Phys 2006; 124: 124902.

[117] Jensen JH, Li H, Robertson AD, Molina PA. Prediction and rationalization of protein pKa values using QM and QM/MM methods. J Phys Chem A 2005; 109: 6634‐6643.

[118] Schaefer P, Riccardi D, Cui Q. Reliable treatment of electrostatics in combined QM/MM simulation of macromolecules. J Chem Phys 2005; 123: 014905.

[119] Riccardi D, Schaefer P, Cui Q. pKa calculations in solution and proteins with QM/MM free energy perturbation simulations: a quantitative test of QM/MM protocols. J Phys Chem B 2005; 109: 17715‐17733.

[120] Ghosh N, Cui Q. pKa of residue 66 in Staphylococal nuclease. I. Insights from QM/MM simulations with conventional sampling. J Phys Chem B 2008; 112: 8387‐8397.

[121] Macdermaid CM, Kaminski GA. Electrostatic polarization is crucial for reproducing pKa shifts of carboxylic residues in Turkey ovomucoid third domain. J Phys Chem B 2007; 111: 9036‐9044.

[122] Li H, Fajer M, Yang W. Simulated scaling method for localized enhanced sampling and simultaneous "alchemical" free energy simulations: a general method for molecular mechanical, quantum mechanical, and quantum mechanical/molecular mechanical simulations. J Chem Phys 2007; 126: 024106.

[123] Fukunishi H, Watanabe O, Takada S. On the hamiltonian replica exchange method for efficient sampling of biomolecular systems: Appliacion to protein structure prediction. J Chem Phys 2002; 116: 9058‐9067.

[124] Burgi R, Kollman PA, Van Gunsteren WF. Simulating proteins at constant pH: An approach combining molecular dynamics and Monte Carlo simulation. Proteins 2002; 47: 469‐480.

[125] van Gunsteren WF, Beutler T, Fraternali F, King P, Mark A, Smith P. Computation of free energy in practice choice of approximations and accuracy limiting factors. In: Van Gunsteren WF, Weiner P, Wilkinson A, eds. Computer simulation of biomolecular systems, theoretical and experimental applications. Vol 2. Leiden, the Netherlands: ESCOM Science Publisher; 1993:315‐348.

[126] Mongan J, Case DA, McCammon JA. Constant pH molecular dynamics in generalized Born implicit solvent. J Comput Chem 2004; 25: 2038‐2048.

[127] Khandogin J, Brooks CL, 3rd. Constant pH molecular dynamics with proton tautomerism. Biophys J 2005; 89: 141‐157.

[128] Lee MS, Salsbury FR, Jr., Brooks CL, 3rd. Constant‐pH molecular dynamics using continuous titration coordinates. Proteins 2004; 56: 738‐752.

[129] Kong X, Brooks CI. l‐Dynamics: a new approach to free energy calculations. J Chem Phys 1996; 105: 2314‐2423.

[130] Khandogin J, Brooks CL, 3rd. Linking folding with aggregation in Alzheimer's beta‐amyloid peptides. Proc Natl Acad Sci U S A 2007; 104: 16880‐16885.

[131] Khandogin J, Raleigh DP, Brooks CL, 3rd. Folding intermediate in the villin headpiece domain arises from disruption of a N‐terminal hydrogen‐bonded network. J Am Chem Soc 2007; 129: 3056‐3057.

[132] Chen J, Im W, Brooks CL, 3rd. Balancing solvation and intramolecular interactions: toward a consistent generalized Born force field. J Am Chem Soc 2006; 128: 3728‐3736.

[133] Okur A, Wickstrom L, Simmerling C. Evaluation of Salt Bridge Structure and Energetics in Peptides Using Explicit, Implicit, and Hybrid Solvation Models. J Chem Theory Comput 2008; 4: 488‐498.

[134] Wallace J, Shen J. Continuous constant pH molecular dynamics in explicit solvent with pH‐based replica exchange. J Chem Theory Comput 2011; 7: 2617‐2629.

[135] Darve E, Wilson M, Pohorille A. Calculating free energies using a scaled‐force molecular dynamics algortihm. Mol Simul 2002; 28: 113‐144.

[136] de Oliveira CA, Hamelberg D, McCammon JA. Coupling Accelerated Molecular Dynamics Methods with Thermodynamic Integration Simulations. J Chem Theory Comput 2008; 4: 1516‐1525.

[137] Williams SL, de Oliveira CA, McCammon JA. Coupling Constant pH Molecular Dynamics with Accelerated Molecular Dynamics. J Chem Theory Comput 2010; 6: 560‐568.

[138] Bucher D, Pierce LC, McCammon JA, Markwick PR. On the Use of Accelerated Molecular Dynamics to Enhance Configurational Sampling in Ab Initio Simulations. J Chem Theory Comput 2011; 7: 890‐897.

[139] Baptista AM, Martel PJ, Petersen SB. Simulation of protein conformational freedom as a function of pH: constant‐pH molecular dynamics using implicit titration. Proteins 1997; 27: 523‐544.

[140] Machuqueiro M, Baptista AM. Acidic range titration of HEWL using a constant‐pH molecular dynamics method. Proteins 2008; 72: 289‐298.

[141] Machuqueiro M, Baptista AM. Molecular dynamics at constant pH and reduction potential: application to cytochrome c(3). J Am Chem Soc 2009; 131: 12586‐12594.

[142] Baptista AM, Teixeira V, Soares CM. Constant‐pH MD method based on stochastic protonation changes. J Chem Phys 2002; 117: 4184‐4192.

[143] Li H, Robertson AD, Jensen JH. Very fast empirical prediction and rationalization of protein pKa values. Proteins 2005; 61: 704‐721.

[144] Rostkowski M, Olsson MH, Sondergaard CR, Jensen JH. Graphical analysis of pH‐dependent properties of proteins predicted using PROPKA. BMC Struct Biol 2011; 11: 6.

[145] Czodrowski P. Who cares for the protons? Bioorg Med Chem 2012;

[146] Davies MN, Toseland CP, Moss DS, Flower DR. Benchmarking pK(a) prediction. BMC Biochem 2006; 7: 18.

[147] Skolidis G, Hansen K, Sanguinetti G, Rupp M. Multi‐task learning for pK(a) prediction. J Comput Aided Mol Des 2012; 26: 883‐895.

[148] Fornabaio M, Cozzini P, Mozzarelli A, Abraham DJ, Kellogg GE. Simple, intuitive calculations of free energy of binding for protein‐ligand complexes. 2. Computational titration and pH effects in molecular models of neuraminidase‐inhibitor complexes. J Med Chem 2003; 46: 4487‐4500.

[149] Kellogg GE, Fornabaio M, Chen DL, Abraham DJ, Spyrakis F, Cozzini P, Mozzarelli A. Tools for building a comprehensive modeling system for virtual screening under real biological conditions: The Computational Titration algorithm. J Mol Graph Model 2006; 24: 434‐439.

[150] Tripathi A, Fornabaio M, Spyrakis F, Mozzarelli A, Cozzini P, Kellogg GE. Complexity in modeling and understanding protonation states: computational titration of HIV‐1‐protease‐inhibitor complexes. Chem Biodivers 2007; 4: 2564‐2577.

[151] Milletti F, Storchi L, Cruciani G. Predicting protein pK(a) by environment similarity. Proteins 2009; 76: 484‐495.

[152] Carstensen T, Farrell D, Huang Y, Baker NA, Nielsen JE. On the development of protein pKa calculation algorithms. Proteins 2011; 79: 3287‐3298.

[153] Olsson MH. Protein electrostatics and pKa blind predictions; contribution from empirical predictions of internal ionizable residues. Proteins 2011; 79: 3333‐3345.

[154] Shan J, Mehler EL. Calculation of pK(a) in proteins with the microenvironment modulated‐screened coulomb potential. Proteins 2011; 79: 3346‐3355.

[155] Gunner MR, Zhu X, Klein MC. MCCE analysis of the pKas of introduced buried acids and bases in staphylococcal nuclease. Proteins 2011; 79: 3306‐3319.

[156] Meyer T, Kieseritzky G, Knapp EW. Electrostatic pKa computations in proteins: role of internal cavities. Proteins 2011; 79: 3320‐3332.

[157] Song Y. Exploring conformational changes coupled to ionization states using a hybrid Rosetta‐MCCE protocol. Proteins 2011; 79: 3356‐3363.

[158] Warwicker J. pKa predictions with a coupled finite difference Poisson‐Boltzmann and Debye‐Huckel method. Proteins 2011; 79: 3374‐3380.

[159] Witham S, Talley K, Wang L, Zhang Z, Sarkar S, Gao D, Yang W, Alexov E. Developing hybrid approaches to predict pKa values of ionizable groups. Proteins 2011; 79: 3389‐3399.

[160] Word JM, Nicholls A. Application of the Gaussian dielectric boundary in Zap to the prediction of protein pKa values. Proteins 2011; 79: 3400‐3409.

[161] Arthur EJ, Yesselman JD, Brooks CL, 3rd. Predicting extreme pKa shifts in staphylococcal nuclease mutants with constant pH molecular dynamics. Proteins 2011; 79: 3276‐3286.

[162] Machuqueiro M, Baptista AM. Is the prediction of pKa values by constant‐pH molecular dynamics being hindered by inherited problems? Proteins 2011; 79: 3437‐3447.

[163] Wallace JA, Wang Y, Shi C, Pastoor KJ, Nguyen BL, Xia K, Shen JK. Toward accurate prediction of pKa values for internal protein residues: the importance of conformational relaxation and desolvation energy. Proteins 2011; 79: 3364‐3373.

[164] Kato M, Warshel A. Using a charging coordinate in studies of ionization induced partial unfolding. J Phys Chem B 2006; 110: 11566‐11570.

[165] Nielsen JE, McCammon JA. On the evaluation and optimization of protein X‐ray structures for pKa calculations. Protein Sci 2003; 12: 313‐326.

[166] Nicholls A. What do we know?: simple statistical techniques that help. Methods Mol Biol 2011; 672: 531‐581.

[167] Khandogin J, Brooks CL, 3rd. Toward the accurate first‐principles prediction of ionization equilibria in proteins. Biochemistry 2006; 45: 9363‐9373.

[168] Bas DC, Rogers DM, Jensen JH. Very fast prediction and rationalization of pKa values for protein‐ligand complexes. Proteins 2008; 73: 765‐783.

[169] Pey AL, Rodriguez‐Larrea D, Gavira JA, Garcia‐Moreno B, Sanchez‐Ruiz JM. Modulation of buried ionizable groups in proteins with engineered surface charge. J Am Chem Soc 2010; 132: 1218‐1219.

[170] Chimenti MS, Khangulov VS, Robinson AC, Heroux A, Majumdar A, Schlessman JL, Garcia‐Moreno B. Structural reorganization triggered by charging of lys residues in the hydrophobic interior of a protein. Structure 2012; 20: 1071‐1085.

[171] Damjanovic A, Schlessman JL, Fitch CA, Garcia AE, Garcia‐Moreno EB. Role of flexibility and polarity as determinants of the hydration of internal cavities and pockets in proteins. Biophys J 2007; 93: 2791‐2804.

[172] Damjanovic A, Wu X, Garcia‐Moreno EB, Brooks BR. Backbone relaxation coupled to the ionization of internal groups in proteins: a self‐guided Langevin dynamics study. Biophys J 2008; 95: 4091‐4101.

[173] Wan H, Ulander J. High‐throughput pKa screening and prediction amenable for ADME profiling. Expert Opin Drug Metab Toxicol 2006; 2: 139‐155.

[174] da Silva C, da Silva E, Nascimento M. Ab initio calculations of absolute pKa values in aqueous solution I. carboxylic acids. J Phys Chem A 1999; 103: 11194‐11199.

[175] da Silva E, Svendsen H. Prediction of the pKa values of amines using ab initio methods end free‐energy perturbations. Ind. Eng. Chem. Res. 2003; 42:

[176] Schuurmann G, Cossi M, Brone V, Tomasi J. Prediction of the pKa of carboxylic acids using the ab initio continuum‐solvation model PCM‐UAHF. J Phys Chem A 1998; 102: 6706‐6712.

[177] Zhang S, Baker J, Pulay P. A reliable and efficient first principles‐based method for predicting pK(a) values. 2. Organic acids. J Phys Chem A 2010; 114: 432‐442.

[178] Brown TN, Mora‐Diez N. Computational determination of aqueous pKa values of protonated benzimidazoles (part 1). J Phys Chem B 2006; 110: 9270‐9279.

[179] Gross KC, Seybold PG, Peralta‐Inga Z, Murray JS, Politzer P. Comparison of quantum chemical parameters and Hammett constants in correlating pK(a) values of substituted anilines. J Org Chem 2001; 66: 6919‐6925.

[180] Jang YH, Hwang S, Chang SB, Ku J, Chung DS. Acid dissociation constants of melamine derivatives from density functional theory calculations. J Phys Chem A 2009; 113: 13036‐13040.

[181] Liptak MD, Gross KC, Seybold PG, Feldgus S, Shields GC. Absolute pK(a) determinations for substituted phenols. J Am Chem Soc 2002; 124: 6421‐6427.

[182] Liptak MD, Shields GC. Accurate pK(a) calculations for carboxylic acids using complete basis set and Gaussian‐n models combined with CPCM continuum solvation methods. J Am Chem Soc 2001; 123: 7314‐7319.

[183] Liu S, Pedersen LG. Estimation of molecular acidity via electrostatic potential at the nucleus and valence natural atomic orbitals. J Phys Chem A 2009; 113: 3648‐3655.

[184] Magill AM, Cavell KJ, Yates BF. Basicity of nucleophilic carbenes in aqueous and nonaqueous solvents‐theoretical predictions. J Am Chem Soc 2004; 126: 8717‐8724.

[185] Parthasarathi R, Padmanabhan J, Elango M, Chitra K, Subramanian V, Chattaraj PK. pKa prediction using group philicity. J Phys Chem A 2006; 110: 6540‐6544.

[186] Schmidt am Busch M, Knapp EW. Accurate pKa determination for a heterogeneous group of organic molecules. Chemphyschem 2004; 5: 1513‐1522.

[187] Otha K. Prediction of pKa values for alkylphosphonic acids. Bull Chem Soc Jpn 1992; 65: 2543‐2545.

[188] Tehan B, Lloyd E, Wong M, Pitt W, Gancia E, Manallack D. Estimation of pKausing semiempirical molecular orbital methods. Part 2: Application to amines, anilines and various nitrogen containing heterocyclic compounds. Quant Struct Act Rel 2002; 21: 473‐485.

[189] Muller P. Glossary of terms used in physical organic chemistry Pure Appl. Chem. 1994; 66: 1077‐1184.

[190] Jaffè H. A re‐examination of the Hammett equation. Chem Rev 1953; 53: 191‐261.

[191] Wells P, Linear free energy relationships, Academic Press, 1968.

[192] Hilal S, Karickhoff SW, Carreira LA. A rigorous test for SPARC's chemical reactivity models: Estimation of more than 4300 ionization pKas. Quant Struct Act Relat 1995; 14: 348‐355.

[193] Advanced Chemistry Development Inc., Toronto, Canada. www.acdlabs.com.

[194] Schrodinger LLC, New York, USA. www.schrodinger.com.

[195] CompuDrug Inc., Sedona, Arizona, USA. www.compudrug.com.

[196] University of Georgia, Athens, Georgia, USA. www.ibmlc2.chem.uga.edu/sparc.

[197] Hilal S, Carreira L, Melton C, Karickhoff S. Prediction of electron affinity by computer. Quant Struct Act Relat 1993; 12: 389.

[198] Karickhoff S, McDaniel V, Melton C, Vellino A, Nute D, Carreira L. Predicting chemical‐reactivity by computer. Environ Toxicol Chem 1991; 10: 1405‐1416.

[199] Hilal SH, El‐Shabrawy Y, Carreira LA, Karickhoff SW, Toubar SS, Rizk M. Estimation of the ionization pK(a) of pharmaceutical substances using the computer program Sparc. Talanta 1996; 43: 607‐619.

[200] Albert A, Rubbo S, Goldacre R, Darcy M, Stove J. The influence of chemical constitution on antibacterial activity II. A general survey of the acridine series. Br J Exp Pathol 1945; 26: 160‐192.

[201] Bell P, Roblin RJ. Studies in chemotherapy. VII. A theory of the relation of structure to activity of sulfanilamide type compounds. J Am Chem Soc 1942; 64: 2905‐2917.

[202] Rupp M, Korner R, Tetko IV. Predicting the pKa of small molecule. Comb Chem High Throughput Screen 2011; 14: 307‐327.

[203] Milletti F, Storchi L, Sforna G, Cruciani G. New and original pKa prediction method using grid molecular interaction fields. J Chem Inf Model 2007; 47: 2172‐2181.

[204] Cramer RD, Patterson DE, Bunce JD. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 1988; 110: 5959‐5967.

[205] Gargallo R, Sotriffer CA, Liedl KR, Rode BM. Application of multivariate data analysis methods to comparative molecular field analysis (CoMFA) data: proton affinities and pKa prediction for nucleic acids components. J Comput Aided Mol Des 1999; 13: 611‐623.

[206] Kim KH, Martin YC. Direct prediction of dissociation constants (pKa's) of clonidine‐like imidazolines, 2‐substituted imidazoles, and 1‐methyl‐2‐substituted‐imidazoles from 3D structures using a comparative molecular field analysis (CoMFA) approach. J Med Chem 1991; 34: 2056‐2060.

[207] Kim KH, Martin YC. Substituent effects from 3D structures using comparative molecular field analysis. 1. electronic effects of substituted benzoic acids. J Org Chem 1991; 56: 2723‐2729.

[208] Lee AC, Yu JY, Crippen GM. pKa prediction of monoprotic small molecules the SMARTS way. J Chem Inf Model 2008; 48: 2042‐2053.

[209] Livingston D, Ed., Artificial Neural Networks: Methods and Applications (Methods in Molecular Biology), Humana Press, 2008.

[210] Habibi‐Yangjeh A, Danandeh‐Jenagharad M, Nooshyar M. Prediction acidity constant of various benzoic acids and phenols in water using linear and nonlinear QSPR models. Bull Kor Chem Soc 2005; 26: 2007‐2016.

[211] Jover J, Bosque R, Sales J. Neural network based QSPOR study for predicting pKa of phenols in different solvents. QSAR Comb Sci 2007; 26: 385‐397.

[212] Hofmann T, Scholkopf B, Smola A. Kernel methods in machine learning. Ann Stat 2008; 36: 1171‐1220.

[213] Rupp M, Proschak E, Schneider G. Kernel approach to molecular similarity based on iterative graph similarity. J Chem Inf Model 2007; 47: 2280‐2286.

[214] Rupp M, Korner R, Tetko IV. Estimation of dissociation constants using graph kernels. Mol. Inf. 2010; 29: 731‐740.

[215] Tehan B, Lloyd E, Wong E, Pitt W, Montana J, Manallack D, Gancia E. Estimation of pKausing semiempirical molecular orbital methods. Part 1: Application to phenols and carboxylic acids. Quant Struct Act Rel 2002; 21: 457‐472.

[216] Zhang J, Kleinoder T, Gasteiger J. Prediction of pKa values for aliphatic carboxylic acids and alcohols with empirical atomic charge descriptors. J Chem Inf Model 2006; 46: 2256‐2266.

[217] Jelfs S, Ertl P, Selzer P. Estimation of pKa for druglike compounds using semiempirical and information‐based descriptors. J Chem Inf Model 2007; 47: 450‐459.

[218] Ertl P, Selzer P, Muhlbacher J. Web‐based cheminformatics tools deployed via corporate intranets. Drug Discovery Today 2004; 2: 201‐207.

[219] Simulations Plus Inc., Lancaster, California, USA. www.simulations‐plus.com.

[220] Pharma Algorithms, Toronto, Canada. www.ap‐algorithms.com.

[221] ChemAxon Ltd., Budapest, Hungary. www.chemaxon.com/marvin.

[222] Molecular Discovery Ltd., Pinner, United Kingdom. www.moldiscovery.com.

[223] Accelrys Inc., San Diego, California, USA. www.accelrys.com.

[224] Helmoltz Center Munich, Neuherberg, Germany. www.ochem.eu, www.qspr.eu.

[225] Kellogg GE, Abraham DJ. Hydrophobicity: Is LogPo/w more than the sum of its parts? Eur J Med Chem 2000; 35: 651‐661.

[226] Kellogg GE, Fornabaio M, Spyrakis F, Lodola A, Cozzini P, Mozzarelli A, Abraham DJ. Getting it right: modeling of pH, solvent and "nearly" everything else in virtual screening of biological targets. J Mol Graph Model 2004; 22: 479‐486.

[227] Sarkar A, Kellogg GE. Hydrophobicity: shake flasks, protein folding and drug discovery. Curr Top Med Chem 2010; 10: 67‐83.

[228] Abraham DJ, Leo AJ. Extension of the fragment method to calculate amino acid zwitterion and side chain partition coefficients. Proteins 1987; 2: 130‐152.

[229] Martin YC. Let's not forget tautomers. J Comput Aided Mol Des 2009; 23: 693‐704.

[230] Spyrakis F, BidonChanal A, Barril X, Luque FJ. Protein flexibility and ligand recognition: challenges for molecular modeling. Curr Top Med Chem 2011; 11: 192‐210.

[231] Tanford C. Amphiphile orientation: physical chemistry and biological function. Biochem Soc Trans 1987; 15 Suppl: 1S‐7S.

[232] Jana M, Bandyopadhyay S. Restricted dynamics of water around a protein‐carbohydrate complex: Computer simulation studies. J Chem Phys 2012; 137: 055102.

[233] Ramakrishnan V, Rajagopalan R. Dynamics and thermodynamics of water around EcoRI bound to a minimally mutated DNA chain. Phys Chem Chem Phys 2012;

[234] Spyrakis F, Faggiano S, Abbruzzetti S, Dominici P, Cacciatori E, Astegno A, Droghetti E, Feis A, Smulevich G, Bruno S, Mozzarelli A, Cozzini P, Viappiani C, Bidon‐Chanal A, Luque FJ. Histidine E7 dynamics modulates ligand exchange between distal pocket and solvent in AHb1 from Arabidopsis thaliana. J Phys Chem B 2011; 115: 4138‐4146.

[235] Cashman DJ, Kellogg GE. A computational model for anthracycline binding to DNA: tuning groove‐binding intercalators for specific sequences. J Med Chem 2004; 47: 1360‐1374.

[236] Cozzini P, Fornabaio M, Marabotti A, Abraham DJ, Kellogg GE, Mozzarelli A. Simple, intuitive calculations of free energy of binding for protein‐ligand complexes. 1. Models without explicit constrained water. J Med Chem 2002; 45: 2469‐2483.

[237] Da C, Telang N, Barelli P, Jia X, Gupton JT, Mooberry SL, Kellogg GE. Pyrrole‐Based Antitubulin Agents: Two Distinct Binding Modalities are Predicted for C‐2 Analogs in the Colchicine Site. ACS Med Chem Lett 2012; 3: 53‐57.

[238] Salsi E, Bayden AS, Spyrakis F, Amadasi A, Campanini B, Bettati S, Dodatko T, Cozzini P, Kellogg GE, Cook PF, Roderick SL, Mozzarelli A. Design of O‐acetylserine sulfhydrylase inhibitors by mimicking nature. J Med Chem 2010; 53: 345‐356.

[239] Spyrakis F, Amadasi A, Fornabaio M, Abraham DJ, Mozzarelli A, Kellogg GE, Cozzini P. The consequences of scoring docked ligand conformations using free energy correlations. Eur J Med Chem 2007; 42: 921‐933.

[240] Marabotti A, Spyrakis F, Facchiano A, Cozzini P, Alberti S, Kellogg GE, Mozzarelli A. Energy‐based prediction of amino acid‐nucleotide base recognition. J Comput Chem 2008; 29: 1955‐1969.

[241] Spyrakis F, Cozzini P, Bertoli C, Marabotti A, Kellogg GE, Mozzarelli A. Energetics of the protein‐DNA‐water interaction. BMC Struct Biol 2007; 7: 4.

[242] Cozzini P, Fornabaio M, Mozzarelli A, Spyrakis F, Kellogg GE, Abraham DJ. Water: How to evaluate its contribution in protein‐ligand interactions. International Journal of Quantum Chemistry 2006; 106: 647‐651.

[243] Koparde VN, Scarsdale JN, Kellogg GE. Applying an empirical hydropathic forcefield in refinement may improve low‐resolution protein X‐ray crystal structures. PLoS One 2011; 6: e15920.

[244] Brunger AT, Adams PD, Clore GM, DeLano WL, Gros P, Grosse‐Kunstleve RW, Jiang JS, Kuszewski J, Nilges M, Pannu NS, Read RJ, Rice LM, Simonson T, Warren GL. Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta Crystallogr D Biol Crystallogr 1998; 54: 905‐921.

[245] Chen VB, Arendall WB, 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: all‐atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 2010; 66: 12‐21.

[246] Schroder GF, Levitt M, Brunger AT. Super‐resolution biomolecular crystallography with low‐resolution data. Nature 2010; 464: 1218‐1222.

[247] Joachimiak LA, Kortemme T, Stoddard BL, Baker D. Computational design of a new hydrogen bond network and at least a 300‐fold specificity switch at a protein‐protein interface. J Mol Biol 2006; 361: 195‐208.

[248] Dey S, Pal A, Chakrabarti P, Janin J. The subunit interfaces of weakly associated homodimeric proteins. J Mol Biol 2010; 398: 146‐160.

[249] Janin J. Wet and dry interfaces: the role of solvent in protein‐protein and protein‐DNA recognition. Structure 1999; 7: R277‐279.

[250] Keskin O, Ma B, Nussinov R. Hot regions in protein‐‐protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol 2005; 345: 1281‐1294.

[251] Samsonov S, Teyra J, Pisabarro MT. A molecular dynamics approach to study the importance of solvent in protein interactions. Proteins 2008; 73: 515‐525.

[252] Samsonov SA, Teyra J, Anders G, Pisabarro MT. Analysis of the impact of solvent on contacts prediction in proteins. BMC Struct Biol 2009; 9: 22.

[253] Samsonov SA, Teyra J, Pisabarro MT. Docking glycosaminoglycans to proteins: analysis of solvent inclusion. J Comput Aided Mol Des 2011; 25: 477‐489.

[254] Kier LB, Cheng CK, Testa B. A cellular automata model of ligand passage over a protein hydrodynamic landscape. J Theor Biol 2002; 215: 415‐426.

[255] Sundaralingam M, Sekharudu YC. Water‐inserted alpha‐helical segments implicate reverse turns as folding intermediates. Science 1989; 244: 1333‐1337.

[256] Kellogg GE, Tripathi A, Zaidi S. Unpublished data.

[257] Bayden AS, Fornabaio M, Scarsdale JN, Kellogg GE. Web application for studying the free energy of binding and protonation states of protein‐ligand complexes based on HINT. J Comput Aided Mol Des 2009; 23: 621‐632.

[258] Milletti F, Storchi L, Sforna G, Cross S, Cruciani G. Tautomer enumeration and stability prediction for virtual screening on large chemical databases. J Chem Inf Model 2009; 49: 68‐75.

[259] Oellien F, Cramer J, Beyer C, Ihlenfeldt WD, Selzer PM. The impact of tautomer forms on pharmacophore‐based virtual screening. J Chem Inf Model 2006; 46: 2342‐2354.

[260] Da C, Kellogg GE. manuscript in preparation.

[261] Yan X, Day P, Hollis T, Monzingo AF, Schelp E, Robertus JD, Milne GW, Wang S. Recognition and interaction of small rings with the ricin A‐chain binding site. Proteins 1998; 31: 33‐41.

[262] Jaramillo P, Coutinho K, Canuto S. Solvent effects in chemical processes. water‐assisted proton transfer reaction of pterin in aqueous environment. J Phys Chem A 2009; 113: 12485‐12495.

[263] Priestle JP, Fassler A, Rosel J, Tintelnot‐Blomley M, Strop P, Grutter MG. Comparative analysis of the X‐ray structures of HIV‐1 and HIV‐2 proteases in complex with CGP 53820, a novel pseudosymmetric inhibitor. Structure 1995; 3: 381‐389.

Documents

Correct Protonation States and Relevant Waters = Better Computational Simulations?