Upload
truongngoc
View
229
Download
0
Embed Size (px)
Citation preview
CHAPTER1
INTRODUCTION
1
CHAPTER-1 : INTRODUCTION
1.1 General
Over the past two decades, the centre of gravity (the intellectual focus) of medicinal
chemistry has shifted dramatically from, how to make molecule, to what molecule to
make(1). The challenge now is the acquiring of information so to make appropriate
decisions regarding the use of resources in drug design. The input for the drug design
effort is, therefore increasingly quantitative, building upon recent development in
molecular structure description, combinatorial mathematics, statistics and computer
simulations. Collectively these areas have led to vital transformation in drug design which
has been referred to as quantitative information analysis.(1)
Another approach to drug design is employed when there is a paucity of information
about an effector and its structure. This is generally the case especially when the design
goal is a physical or pharmacodynamic property where there is no effector. This approach
depends upon the illumination of information by probing a biological system with series
of molecules. The relationship of the molecular structure to some confined properties is
crafted into a mathematical model. The term used for this approach is quantitative-
structure-activity-relationship (QSAR) (2,3) . The importance of QSAR accelerated with
the explosive growth in combinatorial chemistry. Using this approach, it is possible to
synthesize and test thousands of compounds in short time. The QSAR methodology is
essential in the processing of this mammoth information into predictive models. From a
quality model it is possible to predict and to design compounds for synthesis and testing
that have good possibility for an activity.
QSAR is based on the hypothesis that changes in molecular structure reflect proportional
changes in the observed response or biological activity. Thus it is a tool for numerically
estimating biochemical endpoints of interest for substances for which experimental data
are missing. On June, 16, 2005, the International Academy of Mathematical Chemistry
(IMAC) was founded in Dudrovnic, Croatia by Milan Randic. The academy members
were 81 (2009) from twenty. One of the scientist Jerome Karle awarded with Nobel Prize.
2
1.2 Introduction to Mathematical QSAR modeling: The present revolution in the drug discovery process derives considerable thrust by the
recent progress in combinatorial chemistry and high throughput computational
techniques. The random testing of chemicals is not the best method to obtain the
maximum possible information from a combinatorial library, while the synthesis and
screening of a very large number of compounds is extremely costly. QSAR is a widely
accepted predictive and diagnostic tool used for finding associations between chemical
structures and biological activity. QSAR has emerged and has evolved trying to fulfill
the medicinal chemist’s need and desire to predict biological response (4). In a seminal
paper Kubinyi describes the history of QSAR (5).
At the beginning of modern science during the 18th century, Auguste Compte, a French
philosopher, made a comment in the context of the progress of early chemistry: “Every
attempt to employ mathematical methods in the study of chemical questions must be
considered profoundly irrational and contrary to the spirit of chemistry.” Furthermore: “If
mathematical analysis should ever hold an important place in chemistry, it would
occasion a swift and general degeneration of that science.” But later on integration of
structure and activity led to one of the most admired research area viz. QSAR.
Intuitive medicinal chemist's mindset is based on the assumption that there is an inherent
association between chemical structure and biological activity. Thus a platform is created
to use mathematical tools to correlate structural descriptors (predictors, regressors) and
biological activity (response or target variable). Although the researcher uses the
fundamental similarity principle that states that similar structures have similar activities (6), and despite significant advances in computer-based similarity searching and
similarity-based virtual screening (7-9), it is still a challenge to devise meaningful concepts
and uses of structural similarity (10-17). There are different types of computational methods
in QSAR for different levels of data complexity (18): two dimensional (2D), three-
dimensional (3D) and higher dimensional methods. At present, the QSAR science,
founded on the systematic use of mathematical models and on the multivariate point of
view, is one of the basic tool of modern drug and pesticide design and has an increasing
role in environmental sciences. QSAR model exist at the intersection of chemistry,
statistics and biology. Data used in QSAR evaluation are obtained either from the
literature or generated specifically for QSAR-type analysis. A structure-activity model is
3
defined and limited by the nature and quality of the data used in model development and
should be applied only within the model’s applicability domain.
The ideal QSAR should: (1) Consider sufficient number of molecules for a reasonable
statistical representation (2) Have a wide range of quantified end-point potency (i.e.
several orders of magnitude) for regression models (3) Be applicable for reliable
predictors of new chemicals (validation and applicability domain) and (4) Allowed to
obtain mechanistic information on the modeled endpoint.
The limiting factor in developing QSAR’s is the availability of high quality experimental
data. In QSAR analysis, it is imperative that the input data be both accurate and precise to
develop a meaningful model. In fact, it must be emphasized that any resulting QSAR
model that is developed is only as valid statistically as the data that led to its
development.
A variety of properties have been also used in QSAR modeling. These include: physico-
chemical, quantum chemical and binding properties. Examples of molecular properties
are electron distribution, spatial disposition (conformation, geometry and shape) and
molecular volume. Physico-chemical properties include descriptors for the hydrophobic,
electronic and steric properties of a molecule as well as properties including solubility
and ionization constant. Quantum chemical properties include charge and energy values.
Binding properties are concerned with biological macromolecules and are important in
receptor mediated responses. In modern QSAR approach, it is becoming quite common to
use a wide set of theoretical molecular descriptors of different kinds, able to capture all
the structural aspects of a chemical to translate the molecular structure into numbers. A
molecular descriptor is the final result of a logical and mathematical procedure which
transforms chemical information encoded in a symbolic representation of a molecule into
an useful number or the result of some standardized experiments. (19) The term useful
stands with double meaning, it means that the number can give more insight into the
interpretation of the molecular properties and/or is able to take part in a model for the
prediction of some interesting property of the molecule.(19) Different descriptors are
different ways or perspectives to view a molecule, taking into account the various features
of its chemical structure, not only counts of atoms or groups, but also bi-dimensional
from the topological graph or three dimensional from a minimum energy conformation. A
lot of softwares calculate wide sets of different theoretical descriptors, from 2D-graphs to
4
3D-x, y, z co-ordinates. Some of the more used are mentioned here : ADAPT,(20-21)
CODESSA(22), MolConnz (23) and DRAGON(24).
It has been estimated that more than 3000 molecular descriptors(19, 25-26) are now
available, and most of them have been summarized and explained. Modeling methods
used in the development of QSAR are of two types in relation to the modeled response: a
potency of an end-point (a defined value of EC50) or a category/class (like Mutagen/Not
mutagen). For the potency modeling, the most widely used mathematical technique is
Multiple Regression Analysis (MRA). The Regression analysis is a simple approach to
develop a statistical model that can predict the values of a dependent (response) variable
based upon the values of the independent (explanatory) variables. Regression analysis is
the attempt to explain the variation in a dependent variable using the variation in
independent variables. The most valuable and correct use of regression is in making
predictions. Regression is thus an explanation of causation. If the independent variable(s)
sufficiently explain the variation in the dependent variable, the model can be used for
prediction. This leads to a result that is easy to understand and for this reason, most
QSARs are derived using regression analysis. Regression analysis is a powerful means
for establishing a correlation between independent variable (molecular descriptors X) and
a dependent variable Y, such as biological activity:
Where, b is a constant and a, c are the regression coefficients of molecular descriptors.
The current trend is represented by an efficient utilization of computational techniques in
order to increase the drug-like character and diversity of the compounds proposed for
study(27‐28). Estrada (29)and coworkers made a review on the use of topological indices in
drug design and discovery. Natarajan and Nirdosh(30) worked on the application of
topological indices to QSAR modeling and selection of mineral collectors.
In order to be successful, this process of in silico compound selection must incorporate
additional target-specific information (experimental inhibion values). Conventionally, the
computational screening of chemical libraries is generally a four-step process (31)
(i) Assemble the compounds from a group of building blocks;
(ii) Computation for each chemical compound of a set of structural descriptors;
5
(iii) Dimensionality reduction by selecting from the descriptors set a chemical space that
is relevant for the investigated target;
(iv) Compound selection with a statistical parameters that implements a similarity,
diversity, or drug-like paradigm.
A large number of structural descriptors, many of them traditionally used in QSAR, are
in order to transform into a numerical form, the structural features of molecules, physico-
chemical or empirical (log P, molecular polarizability), constitutional (number of
aromatic rings, number of rotatable bonds, number of hydrogen-bond donors, number of
hydrogen-bond acceptors), structure keys and fingerprints; graph invariants (cyclomatic
number, atom pairs, path counts), topological indices(28) (Wiener,(32-33), Randic,(34) Kier
and Hall,(35-36) Balaban,(37-38)), geometric (polar surface area, molecular volume), quantum
(HOMO energy, atomic charges) and grid (various steric, electrostatic, and lipophilic
fields).
Three major categories in QSAR work exist: (1) The simplest-type (one-dimensional
information) provides constitutional information such as molecular weight, number and
types of atoms and bonds in a condensed formula, i.e. counts for a whole molecule or its
chemical fragments and groups like the number of aliphatic ethers or tertiary amines.; (2)
2-D descriptors based on two dimensional properties of either fragments (substituent
constants sigma, pi, MR) or whole molecule (log P, reactivity). Topological indices based
on atoms and their bond connectivities, (3) 3-D descriptors reflect the three-dimensional
nature of molecular structure (conformation, isomerism) and surrounding space
(stereochemistry). Another way to consider descriptor classification is according to
fragmental or whole molecule properties( 39)
Although QSAR studies offer a rich variety of structural descriptors while the typical
number of compounds in QSAR is usually between 10 and 100, and can easily exceed.
In order to be efficient, the in silico compound screening must use descriptors that require
small computational resources, thus explaining the wide popularity of counts of atom
types, counts of functional groups, fingerprints, constitutional descriptors, graph
invariants and topological indices.
6
1.3 Historical Background and various approaches used in Modeling: QSARs are based on the assumption that the structure of a molecule (i.e. its geometric,
steric and electronic properties) must contain the features responsible for its physical,
chemical and biological properties, and the ability to represent the chemical properties by
one or more numerical descriptors. It has been nearly 40 years since the QSAR modeling
was first used into the practice of agro chemistry, drug design, toxicology industrial and
environmental chemistry. Its growing power in the following years may be attributed also
to the rapid and extensive development in methodologies and computational techniques
that have allowed to delineate and refine the many variables. Such approaches used in the
modeling are as follows:
Crum-Brown-Fraser approach:
More than a century ago, Crum-Brown and Fraser(40) expressed the idea that the
physiological action of a substance in a certain biological system (phi) was a function
(f) of its chemical constitution c:
=fc
Thus, an alteration in Chemical constitution delta C ,would be reflected by an alteration in
biological activity(∆ . In following years on the physical organic front, the seminal
work of Hammett gave rise to the σ-π culture.(41)
Hansch QSAR approach:
In 1962,Hansch et.al.(42) published their study on the structure-activity-relationships of
plant growth regulators and their dependence on Hammett constant and
hydrophobicity.(43-44) Using the octanol/water system, a whole series of partition
coefficients were measured, and thus a new hydrophobic scale was introduced. The
parameter π, which is the relative hydrophobicity of the substituent, was defined in a
manner analogous to the definition of sigma.(41)
Where Px and PH represents the partition coefficient of a derivative and a parent molecule,
respectively.
The contribution of Hammett and Taft(41,43) together laid the basis for the development of
the QSAR paradigm by Hansch and Fujita(42,44), which is a combination of the
hydrophobic constant with Hammett’s electronic constant to yield the linear Hansch
7
equation and its many extended forms. There is consensus among current predictive
toxicology that Crown –Hansch is the founder of modern QSAR. In the classic article(45)
it was illustrated that, in general, biological activity for a group of ‘congeneric’
chemicals can be described by a comprehensive model:
1
In which C, the toxicant concentration at which an endpoint is manifested(e.g.50%
molarity or effect),is related to a hydrophobicity term, ρ, (this is a substituent constant)
denoting the difference in hydrophobicity between a parent compound and a substituted
analog. It has been replaced with more general molecular term the log of the I-
octanol/water partition coefficient, (log Kow), an electronic term, I (originally the
Hammett substituent constant) and a steric terms (typically Taft’s substituent constant
ES). Due to curvilinear or bilinear, relationship between log 1/50 and hydrophobicity
normally found in single dose tests the quadratic term was later introduced to the
model.
The rational for the above equation was given by Mc Farland(46). He hypothesized that the
relative activity of a biological active molecule, such as toxicant, is dependant on (i) the
probability (Pr1) that the toxicant reaches its site of action. (ii) the probability (Pr2) that
the toxicant will interact with the target at this site, and (iii) the external concentration or
dose.
The delineation of these models led to explosive development in QSAR analysis and
related approaches.(47)
In the year after 1960’s, the need to solve new problems, together with the contribution of
many other investigators generated thousands of variations of the Hansch approach to
QSAR modeling, as well as approaches that are formally completely new. Hans
Konemann(48) and Gilman Weith (49)who in early 1980’s developed multiclass-based,
hydrophobic dependent model for industrial organic chemical, must share credit for the
revival of QSAR.
It is evident from literature analysis that the QSAR world has undergone profound
changes since the pioneering work of Hansch, consider the founder of modern QSAR
modeling.(44-45,47)
8
1.4: Typical Steps in QSAR Modeling(50): The following steps should be followed while developing a QSAR model:
1. Response data collection: Experimental measurements (imperfections) and
biological material (variations) are error-prone but data thereof should be normally
distributed. Systematic errors should be absent.
2. Selection of congeners: Congeners are similar enough to guarantee both, the same
interaction mechanism and a wide potency range of several log units.
3. Clustering: Divide the series into chemical groups of more specific homologous
variations.
4. 3D-QSAR: Build molecule models and perform conformational analysis and
alignment.
5. Descriptor selection: Calculating parameters to numerically represent structural
features of the compounds.
6. Model generation applying statistical means: PCA for complexity reduction, SLR
and MLR, PLS for linear regression; clustering, and factorial design.
7. Internal model validation: Using LOO – cross validation to improve the Q2
criterion.
8. Test set: Evaluate preliminary equations in the test set.
9. Interpretation: Interpret the final model on the basis of various statistical
parameters.
e.g. R2, AR2, Fischer value (f-value), Pogliani’s factor (Q-number) Se etc.
1.5: Rational and Objective: It is interesting to mention that 16 different forms of the carbonic anhydrase (CA) appear
in the mammalian body, each having specific functionality(51). Disease caused by
problematic acid-base secretion chemistry in body , particularly in the eye, have been
linked to the dysfunctional activities of several types of Carbonic Anhydrase.(52).Excess
secretion of aqueous humour in the eye can cause pressure gradient to occur permanent
damaging eye tissue. Employing drugs, which reduces the rate of formation of aqueous
humour , can treat diseases such as macular edema and open angle glaucoma. It is
believed that certain CA-II enzymes contribute to the secretion of eye humor through
production of bicarbonates ions.(53)
9
The objective of the present work is to develop QSAR model to predict inhibition values
of aromatic and heteromatic sulphonamides towards CA-II isozymes. These models could
help in finding potential drugs to treat eye diseases such as glaucoma and macular
edema. Furthermore , possible insight could be obtained so as to find out what molecular
features are deemed important when developing inhibitor of the CA-II enzymes with a
hope that the pathway or mechanism of inhibitor can be more clearly understood.
1.6 About carbonic Anhydrase: 1.6.1 Versatility of Carbonic Anhydrase:
Carbonic anhydrase (CA) is an enzyme that assists rapid interversion of carbon dioxide
and water into carbonic acid, protons and bicarbonate ions. This enzyme was first
identified in 1933, in red blood cells of ox by Meldrum and Roughton.(54) Since then, it
has been found abundant in all mammalian tissues, plants, algae and bacteria. (55)
Mammalian carbonic anhydrase (56) occurs in about 16 slightly different forms depending
upon the tissue or cellular compartment in which they are located . These isozymes have
some sequence variations leading to specific differences in their activity. The isozyme
found in some muscle fibers has low enzyme activity compared to that secreted by
salivary glands. While most carbonic anhydrase isozymes are soluble and secreted, some
are bound to the membranes of specific epithelial cells.(57)
Carbonic anhydrase are important enzymes found in red blood cells, gastric mucosa,
pancreatic cells and renal tubes. The catalytic action of CA is fundamental for respiration
and transportation of CO2 between metabolizing tissues and excretion sites, secretion of
electrolytes in a variety of tissues and organs, pH regulation and homeostasis , CO2
fixation (for algae and green plants)(58-59) several metabolic biosynthetic pathways, such
as glucogenesis, lipid genesis and urea genesis, bone resorption ,calcification and
tumoriogenecity ( in vertebrates).(60)
This ancient enzyme has five different classes namely α, β, γ, δ, ε.(60-61 ) Members of
these different classes share very little sequence or structural similarity, yet they all
perform the same function and require a zinc ion at the active site. Carbonic anhydrase
from mammalian belongs to the α-class, while the plant enzyme belong to β-class, while
the enzyme from methane-producing bacteria that grow in hot springs forms γ-class.(57,60)
It is to be noted that alpha enzymes are monomer ,while gamma enzymes are trimeric.
Although the beta enzyme is a dimer, there are four zinc ions bound to the structure
10
indicating four zinc ions bound to the structure indicating four possible enzyme active
sites.
α-CA found in mammals are divided into four broad subgroups, which in turn consists of
several isoforms.(60-61)
*The cytosolic CA’s (CA-1,CA-2,CA-3,CA-7,and CA-8)
*Mitochondrial CA’s (CA-5 and CA-5B)
* Secreted CA’s (CA-6)
*Membrane associated CA (CA-4,CA-9,CA-12,CA-14 & CA-15)
There are three additional “acatalytic” CA isoforms (CA-8,CA-10 & CA-11) whose
functions remain unclear.
Thus, Carbonic anhydrase is a versatile enzyme in the living world. In our lungs, oxygen
diffuses into the blood and is transported to all the cells of our body by red blood cells.
Carbon dioxide diffuses out of the cells and most of it is converted to carbonic acid to be
carried to the lungs. Carbonic anhydrase present in red blood cells aids in the conversion
of carbon dioxide to carbonic acid and bicarbonate ions. When red blood cells reach the
lungs, the same enzyme helps to convert the bicarbonate ions back to carbon dioxide,
which we breath out.
In plants, gaseous carbon dioxide is stored in the form of bicarbonate ions. Carbonic
anhydrase plays a role in converting bicarbonate ions back to carbon dioxide for
photosynthesis.
1.6.2 Catalytic action of Human Carbonic anhydrase-II: Carbonic anhydrases are enzymes that catalyze the hydration of carbon dioxide and the
dehydration of bicarbonate:
CO2 + H2O ↔ HCO3- + H+
These carbonic anhydrase reactions are of great importance in a number of tissues.
Example include: (62)
• Parietal cells in the stomach secrete massive amount of acid (i.e. hydrogen ions or
protons) into the lumen and a corresponding amount of bicarbonate ion into blood.
11
• Pancreatic duct cells do essentially the opposite, with bicarbonate as their main
secretary product.
• Secretion of hydrogen ions by the renal tubes is a critical mechanism for
maintaining acid-base fluid balance.
• Carbon dioxide generated by metabolism in all cells is removed from the body by
red blood cells that convert most of it to bicarbonate for transport, then back to
carbon dioxide to be exhaled from the lungs.
Carbonic Anhydrases are metalloenzymes consisting of a single polypeptide chain
complexed to an atom of zinc.(63) They are incredibly active catalyst, with a turnover rate
of about 106 reactions per second. An anhydrase is defined as an enzyme that catalyses
the removal of a water molecule from a compound , and so it is thus a“reverse” reaction
that gives carbonic anhydrase it’s name, because it removes a water molecule from
carbonic acid. Close-up rendering of active site of human Carbonic Anhydrase-
II,showing three histidine residues and a hydroxide group coordinating the zinc ion at
centre. A zinc prosthetic group in the enzyme is coordinated in three positions by
histidine side chains. The fourth coordination position is occupied by water. This causes
polarization of the hydrogen oxygen bond, making the oxygen slightly more negative,
thereby weakening the bond. The properties of carbonic anhydrase can be summerised as: (63)
• It lies in deep pocket 15A from protein surface.
• Zn is tetrahedrally coordinated.
• Three histidine group (94, 98 and 119) and a water molecule.
• Ionisation through general base catalysis by Glu 106 or Glu117.
• Also with H2O (low pH) or hydroxyl (high pH).
12
Fig-1
13
Carbonic anhydrase catalyzes the reversible hydration of CO2 to form bicarbonate anion
and a proton:
CO2 + H2O ↔ HCO3- + H+
Following are the steps in the hydration reaction (63)
1. Start: Active site: The zinc is co-ordinated by the imidazole rings of three histidines
(94, 96 and 119) and an OH- ion. The geometry of active site is tetrahedral.
2. CO2 bonding: A water molecule (HOH338) is displaced by CO2 . The main chain
-NH of thr 119 orients and polarises the CO2 molecule.
3. Nucleophilic attack: The zinc bound OH- attacks the carbon of CO2 to form HCO3-.
4. HCO3 dissociation: A water molecule (HOH263) replaces the HCO- product.
5. H+ dissociation/Shuttle: The proton product dissociates from HOH 263 and is
transferred in three steps along “wire” of H-bonded waters to His 64.
6. His 64 Rotation/Flip: The protonated His 64-side-chain rotates from the ‘in”
position to the “out” position where it is exposed to bulk solvent on the enzyme
exterior.
7. Return to start: After releasing the H + product, His 64 rotates back to starting (in)
position. The enzyme is ready to another cycle of catalysis.(Fig.2-3)
14
Fig-2
15
1.6. 3 Inhibition of Carbonic Anhydrase-II:
Studies to find correlation between physico-chemical properties and biological activity
sulphonamides indicated the dominating role played by their proton-ligand formation
constant, more commonly known as pKa of the sulphonamides.(64-66 )
At physiological pH, aromatic and heterocyclic unsubstituted sulphonamides (R-
SO2NH2), which are known to inhibit CA-II have an ionisable sulphonamide group (pKa~
6-10). Upon binding the sulphonamide group displaces the water molecule from the zinc
co-ordinated sphere. Substitution of the -RSO2NH2 hydrogen substantially decreases the
activity due to steric hindrance.(67-68).The aromatic side chains of sulphonamide interacts
with hydrophobic amino acid residues in the binding site
e.g.Phe131,Leu141,Val143,Ala45 and stabilize the reaction. Unsubstituted amides such
i.e. R-CO2NH2 such as urethane, phenyl carbamate are a second albeit much less potent,
class of known CA-II inhibitor. In contrast to sulphonamides the compounds are basic
and much weaker CA-II inhibitor amidst such as SCN-,ClO4-,I-are also weak inhibitor
with Ki (binding constant) values of 8-13/um(69-70).The CA-II inhibition mechanism by
sulphonamide are shown in Fig.3 (60)
16
Fig.-3
17
1.7 Multiple Regression Analysis:
The term “regression” was first coined by Francis Galton, a cousin of Charles Darwin in
the nineteenth century to describe a biological phenomenon. (71).Later extended by Yule,
G. U. and Pearson, K.(72) to a more general statistical context. Further Fisher, R.A.(73)
advances this work.
Regression analysis is defined as the analysis of relationships among variables (in our
case the molecular descriptors) for predicting models. It is one of the most widely used
statistical tools because it provides a simple method for establishing relationship among
variables. The Regression analysis can be well understood by following chart : (74)
18
Mathematically exact procedure
for the treatment of data with experimental errors
( mean value, standard devation).
Minimization of the sum of squared errors (= squared deviations between yi and ycalc)
produces the best fit
of the observed values to a certain model.
- independent variables xi (definition: can be determined without experimental error),
and
- dependent variables yi (contain experimental error).
Hypothesis: there is a significant relationship (95% level) between xi and yi values: yes/no
The variables used being dependent and independent variables. In drug designing activity
is taken as dependent variables. While the parameters are considered as independent
variables. If only one variable is used for modeling the activity, the regression is called a
simple regression and the corresponding expression is called simple regression equation.
In multiple regression analysis , the expression contains more than one independent
variable.
The regression expression takes the following form:
Activity=y=bo+b1x1+b2x2+b3x3+…….. (1)
Regression Analysis
Regression Analysis describes the relationship between
19
Where x1, x2, x3….. etc. are the correlating parameters/independent variables/ molecular
descriptors used to develop statistically significant model(s). bo is a constant, while b1,
b2, b3, etc. are coefficients of molecular descriptors used. Coefficients are the values for
regression equation for predicting the dependent variable from the independent variable.
It tells us about the nature of the relationship between the variables. The sign and
magnitude of bi (i = 1, 2…) decides how and to what extent the molecular descriptors are
participating to develop statistically significant model (s). Such a model/ relationship are
usually called Quantitative- Structure- Activity -Relationship (QSAR).(75) It is noticeable
that molecular descriptors are directly related to the structure of the organic molecule
acting as drugs. In case where topological indices used as molecular descriptors, then the
model provides1:1 correlation between structure and activity. The reason is being
topological index is numerical representation of structure. The explicit determination of
the regression equation is taken as the final product of the QSAR analysis. The model i.e.
the regression equation obtained may be used to evaluate the importance of molecular
descriptors (topological indices) used to analyze the effects of policy that involves
changing values of the molecular descriptors or to forecast biological activity for a given
set of molecular descriptors. We need to examine some of the basic characteristics of
multiple regressions. In a QSAR model, the goal is to develop a formula for making
predictions about the biological activity, based on the observed values of the molecular
descriptors, precisely topological indices.(76) For prediction studies, multiple regression
makes it possible to combine many molecular descriptors (topological indices) to produce
optimal predictions of the biological activity. Multiple regressions method find capable to
separate the effects of molecular descriptors (topological indices) on the biological
activity so that one can examine the unique contribution of each molecular descriptor
(topological index).
In the last three decades, statisticians have developed many more sophisticated methods
that achieve similar goals (77-79). In these methods logistic regression, Poisson regression,
structural equation models and survival analysis are keeping importance.
We need to examine the various statistical parameters to describe a model as “best”. Out
of these R-squared, Adjusted R-squared, t-statistics and standard error multicolinearity
are being mostly used and explained here in brief.
20
Multiple regressions always produce the “best” set of linear predictions for a given set of
data. The most common statistical parameter for doing this is called the coefficient of
determination, viz., R2, pronounced as r-squared considered as a measure of how “good”
the models are. R2 is the proportion of variance in the dependent variable which can be
predicted from independent variable. It is an overall measure of strength of association
and does not reflect the extent of each particular independent variable. Its value is found
always between 0-1. The researchers feel terrific if they get a R2 of 0.75 and they feel
terrible if the R2 is only 0.10. It is certainly true that higher is better, there is no reason to
reject a model if the R2 is small. Despite the small R2 we still can get a clear confirmation.
The eq. (1) should be more correctly written by mentioning standard error of each of the
coefficients in parentheses, that is as-
Activity= bo+b1(+) x1 +b2 (±) x2 +….(2)
In the above equation we assume that x1, x2, etc are the true molecular descriptors.
If we divide each coefficient by its standard error, we get t-statistics. The t-statistics has a
distribution that is essentially a standard normal distribution. The another statistical
parameter called in adjustable R2 symbolized as R2A. The adjusted R2 attempts to yield a
more honest value to estimate the R2 for the test-set. The regression models with lots of
independent variables have a natural advantage over models with two independent
variable in predicting the dependent variable. The adjusted R2 removes that advantage.
The adjusted R2 is a modification of the R2 that adjust for the number of independent
variables. The adjusted R2 is always less than or equal to original R2. When the number
of observation is small and the number of predictor is large , there will be much greater
difference between R2 and adjusted R2 and vice-versa.
The standard error of estimates is also one of the important statistical parameter. This
parameter can be interpreted as the standard deviation of the dependent variable after
effects of the independent variables have been removed. If we perfectly predict the
independent variable from the set of independent variables (which will correspond to R2
of 1.0), the standard error of estimate would be zero. On the other hand, when the R2 is
zero ( no predictive power) ; the standard error of the estimate is the same as the standard
deviation of the dependent variable.
21
In evaluating any regression model, it is just as important to think about what is not in the
model as what is in it. These are two possible reasons for putting a variable in a
regression model. The first is, we want to know the effect of the variable on the
Dependent variable, and how to control for the variable. However, it is worthy to mention
that the multiple regression models make no distinction between the study variables and
the control variables. (80)
In order to account for the importance of control variable, it will be needed to find a
casual effect on the dependent variable by that particular variable and its incorporation
with the variable under ones focus of study. If the answers is “yes” we conclude the
particular variable has a strong effect on the dependent variable but is unrelated to the
independent variables already in the model, there is no need to include it.
Sample size (number of compounds in drug modeling) has a profound effect on tests of
statistical significance. One should keep Sample size in mind when looking at the results
or significance tests. The general principle of “In a small sample, statistically significant
coefficients should be taken seriously, but a non-significant coefficient is extremely weak
evidence for the absence of an effect”.
Statisticians often describe small samples as having low power to test hypothesis. This is
another, entirely different problem with small samples that is frequently confused with
the issue of power. Most of the test statistics that researcher use, such as t-tests, F-tests
and chi-square tests, are only approximations. These approximations are usually quite
good when the sample is large but may deteriorate markedly when the sample is small.
One must look carefully at the magnitude of the coefficient to see if it is large enough to
have theoretical or practical importance. (81)
Thus multiple regression produces correlation equation used for modeling. The meaning
of different statistical parameters along with correlation equation is given below:
22
Meaning of statistical parameter in a Correlation Equation:
Log Ki= 1.15 (±0.2) J - 1.46 (±0.4) F + 7.82 (±0.2)
(n = 25; r2 = 0.945; AR2= Se = 0.196; F = 78.6; Q = 0.841)
Multiple regressions are designated precisely for separation of the effects produced by
two or more independent variables on a dependent variable when the independent
variables are correlated with one another, but there is a limit to what regression can do.
This problem goes by the name of multicollinearity and that is an extreme case where two
variables are perfectly correlated.
Multicolinearity is a statistical phenomenon in which two or more predictor variable in a
multiple regression model are highly correlated, meaning that one can be predicted from
others with non-trivial degree of degeneracy. Multicollinearity does not have to be so
Binding constant that causes inhibition
Values of regression
95% confidence of the coefficient and the constant term
Parameter
No. of compounds
Correlation coefficient : the measure of relative quality of model
Modification of R2
Standard deviation of dependent variable
Fischer value: me-asure of statistical significance
Pogliani,sQuality factor: measure of internal predictivity
23
extreme to cause problems and unfortunately, those problems often so undetected.
Multicollinearity shows another effect- the possibility of concluding that two variables
have no effect when one or other of the actually has a strong effect.
Multicollinearity is sometimes that nearly all users of multiple regressions have heard
about. No any method section ever claim that “multicolinearity is not present” generally
this will be untrue. A better statement to make is something along the lines of “ there was
no problem with multicolinearity” . It does not reduce the predictive power or reliability
of model as a whole, it only affects calculation regarding individual predictors. It is a
problem if one is interested in effects of individual predictor. Multicolinearity reduces the
effective amount of information available to access the unique effects of predictors.
Beyond those truths, there is an enormous possibility of confusion and mythology
surrounding multicollinearity. Multicollinearity may become extreme and near extreme.
Extreme multicollinearity means that at least two of the independent variables in a
regression equation are perfectly related by a linear function. Suppose we are trying to
estimate the model:
Y=A+B1X1+B2X2+B3X3+U (3)
Suppose also that in our sample, it happens to be the case that-
X1=2+3 X2 (4)
Then the correlation between x1 and x2 is 1.0 and we have a case of extreme
multicollinearity. The consequence of extreme multicollinearity makes it impossible to
get separate estimates for the coefficients B1 and B2 . However, that multiple regression
separate out the effects of two or more variables, even though they are correlated with
each other. To do this, these must be some remaining variation on each X variable when
the other X variables is held constant. If two variables are perfectly correlated, when one
is kept constant, the other must be constant as well. Hence, it is possible to separate their
effects on the dependent variable.Multicollinearity only affects the coefficient estimates
for those variable that are collinear. This is true for both extreme and near-extreme
multicollinearity. To check the the effects of multicolinearity Variance Inflation Factor
(VIF) and condition numbers are supposed to carry out by Ridge regression Analysis.
24
References:
1. Kier, L.B.; Hall, L.H., Quantitative Information Analysis: The new centre of
Gravity in Medicinal Chemistry, Med Chem, Res., 1997, 7, 335.
2. Balaban, A.T., From chemical Graphs to 3D molecular modeling.In From chemical
topology to Three Dimensional geometry, Balaban, A. T., Ed.: Plenum, New York,
1997,1.
3. Balaban, A.T. Motoc,I.; Bonchev,D.; Meneyan,O. Topological indices for
Structure-Activity correlation Topp. Curr Chem 1983,114,21.
4. Seidel, J. K.; Schaper, K.J., Chemische Struktur and biologische Aktivität von
Wirkstoffen;Methoden der Quantitiven Struktur-Wirkung-Analyse, Verlag Chemie
Weinheim , 1979, 1.
5. Kubinyi, H., Quant. Struct.-Act. Relat., 2002, 21, 348.
6. Johnson, M.; Maggiora, G.M. Concepts and applications of molecular similarity,
John Wiley & Sons: New York, 2006.
7. Willett, P.; Barnard, J.M.; Downs, G.M. J. Chem. Inf. Comput. Sci. 1998, 38, 983.
8. Andrew G.C.; Graham R.W. Perspectives in Drug Discovery and Design, Stevenage
Netherlands: UK. 1998, 321.
9. Engel, T., J. Chem. Inf. Model. 2006, 46, 2267.
10. Willett, P., J. Med. Chem., 2005, 48, 4183.
11. Willett, P., Drug Discov. Today, 2006, 11, 1046.
12. Medina-Franco, J. L.; Maggiora, G. M.; Giulianotti, M. A.; Pinilla, C.; Houghten, R.
A. Chem. Biol. Drug Des., 2007, 70, 393.
13. Martínez-Mayorga, K.; Medina-Franco, J.L.; Giulianotti, M.A.; Pinilla, C.; Dooley,
C.T.; Appel, J. R.; Houghten, R.A. Bioorg. Med. Chem. 2008, 16, 5932.
14. Breneman C. M.; Bennett, K. P.; Embrechts, M. J.; Bi, J.; Demiriz, A.; Lockwood,
L.; Momma, M.; Sukumar, N., 21st National Meeting, American Chemical Society,
San Diego, 2001.
15. Winkler, D. A. Mol. Biotechnol., 2004, 27(2), 138.
25
16. Hopfinger, A.; Wang, S.; Tokarski, J.; Jin, B.; Albuquerque, M.; Madhav, P.;
Duraiswami, C. J. Am. Chem. Soc., 1997, 119, 1050.
17. Livingstone, D.J., Predicting Chemical Toxicity and Fate, CRC Press LLC: Boca
Raton, FL, 2004, 151.
18. Stuper, A. J.; Jurs, P. C., J. Chem. Inf. Comput. Sci. 1976, 16, 99.
19. Todeschini,R.; Consinni, V.; Handbook of Molecular descriptors Wiley-VCH,
Weinheim (Germany), 2000.
20. http://research.chem.psu.edu/pcjgroup/ADAPT.html
21. Mekenyan, O.; Bonchev, D., Acta Pharm Jugosl., 1986, 36, 225.
22. Katrizky, A.R.; Lobanov, V.S.; CODESSA, version 5.3, University of Florida,
Ganisville, 1994.
23. Molconnz Ver.4.05 Hall consult., Quiney, MA. , 2003
24. Todeschini, R.; Consinni, V.; Mauri, A.; Pavan, M., DRAGON-Software for the
calculation of molecular descriptor.version 5.4 for windows, 2006, Talete srl, Milan,
Italy.
25. Devillers, J.; Balaban, A. T.; Topological indices and related descriptors in QSAR
& QSPR, Amsterdem , Gordon and Breach Sci. pub. 1999, 130, 138, 210.
26. Karelson, M.; Molecular description in QSAR/QSPR, New York Wiley-inter
science 2000.
27. Fauchère, J. L.; Boutin, J.A.; Henlin, J. M.; Kucharczyk, N.; Ortuno, J. C.,
Chemom. Intell. Lab. Syst., 1998, 43, 43.
28. Ivanciuc,O.; Balaban, A. T., The Graph Description of Chemical Structures, in:
Devillers, J.; Balaban, A. T., (Eds.), Topological Indices and Related Descriptors
in QSAR and QSPR, Gordon and Breach Science Publishers, Amsterdam, 1999, pp.
59.
29. Estrade, A. E.; Patlewicz, G.; Uriarte, E., Ind. J. Chem., 42A 2003 1315.
30. Natarajan, B.R.; Kamalakanan, P. ; I. Nirdosh, I., Ind. J. Chem., 42A, 2003 , 1330.
31. Ivanciuc, O.; Klein, D. J. , Croa. Chem Act 2002,75 (2), 577.
32. Weiner, H. J. Am. Chem. Soc., 1947, 69, 2636.
26
33. Weiner, H. J., J. Am. Chem. , 1947, 69, 17.
34. Randic, M., J. Am. Chem. Soc. 1975,97,6609.
35. Kier, L. B.; Hall, L. H. , “Molecular-Connectivity in Structure-activity analysis”
research studies press,Willey,Chichester,1986.
36. Kier,L.B.; Hall, L.H. “Molecular-Connectivity and Drug Research” Academic
Press, New York, 1976.
37. Balaban, A. T., Chem. Phys. Lett. 89, 1982 , 399.
38. Balaban, A.T., Pure Appl. Chem. 55, 1983, 199.
39. Winkler, D. A. , Brief Bioinform, 2002, Mar 3(1),73.
40. Crum-Brown- Fraser, T.R. Trans. R. Soc. Edinburgh, 1868, 25, 151, 693,.
41. Hammet, L.P., Chem, Rev. 1935, 17 (1), 125.
42. Hansch, C.; Malony, P. P.; Fujita T.; and Muir, R. M., Nature, 1962 194, 178,
43. Taft, R.W.; J. Am. Chem. Soc. 1952, 74, 3126.
44. Hansch, C.; Fujita, T., J. Am. Chem. 1964, 86, 1616.
45. Hansch, C. Leo, A. Taft, R.W., Chem Rev. 1991, 91(2), 165.
46. McFarland, J.W.; J. Med Chem. 1970, 13, 1092.
47. Hansch, C.; Leo, A.; Exploring QSAR, Fundamentals and applications in chemistry
and biology, ACS Professional Reference book, American Chemical Society,
Waschington, D. C. 1995.
48. H. Konemann, Toxicology, 1981, 19, 209.
49. Weith, G. D.; Call, D. J.; Brooke, L. T.; Can. J., J. Aquat. Sci. 1983, 40,743.
50. Scior et al. Current medicinal chemistry, 2009, 16, (32), 4298.
51. Supuran, C.T., Curr. Pharm Des. 2008, 14 (7) 603.
52. Maren, T. H., J. Glaucoma 1995, 4, 49.
53. Gray, W.D.; Maren, T.H.; Sisson, G.M.; Smith, F.H., J. Pharmocol. Exp. Jher,
1957, 121, 160.
54. Meldrum, N. U., Roughton, F. J .W., J. Physiol. 1933, Dec.5, 80 (2) 113.
27
55. Moroney, J. V.; Bartlett, S. G.; Samuelson, G., Plant, Cell and Environment, 2001,
24, 141.
56. Parkkila, S., Springer, 2000, 90, 79.
57. Brian, C. T.; Smith, K.; Ferry, J. G., J. of Bio chem , 2001, 276, 48615.
58. Hewett-Emett, D.; Tashian, R. E., Mol. Phylogenet. Evo., 1996, 5, 50.
59. Bradfield, J. R. G., Nature, 1947, 467.
60. Supuran, C. T.; Scozzafava, A.; Convay, J., Carbonic Anhydrase: Its inhibitors and
activators, CRC press, New York, 2004.
61. Chegwidden, W. R.; Carter, N. D.; Edwards, Y. H., Carbonic Anhydrase: New
Horizons, Birkhaseur, Basel, Switzerland, 2000.
62. Ray, W. J., Biochemistry 1983, 22, 4625.
63. Lindskog S., Pharmacology and Therapeutics, 1997, 74 (1) , 1.
64. Duffel, M. W.; Ing, I. S.; Segarra, T. M.; Dixon, J. A.; Barfknechl, C. F.;
Schoehwold, R. D., J. Med . Chem 1986, 29, 1488.
65. Hies, M. A.; Masereel, B.; Rolin, S.; Scozzafava, A.; Calpaeneu, G.; Cimpeaneu,
V.; Supuran, C.T., Bioorg. Med. Chem. 2004, 12, 2772.
66. Supuran, C. T.; Scozzafava, A.; Menabuoni, L.; Minicion, F.; Bribanys, C. T., Eu. J.
Pharm Sci., 1999, 8, 317.
67. Thakur, A.; Thakur, M.; Khadikar, P.V.; Supuran, C.T., Bioorg. Med. Chem. 2004,
12, 789.
68. Agrawal, V.K.; Khadikar, P.V., Bioorg. Med Chem.. Lett. 2003, 13, 447.
69. Verpoorte, J. A.; Mehta, S. I.; Edsau, J., J .T. Biol .Chem., 1967, 244, 1421.
70. Supuran, C. T., Nature Rev. Drug Discov. 2008, 7, 168.
71. Galton, F. , Presidential address, SectionH, Anthropology , 1885.
72. Pearson, K.; Yule, G. U.; Norman, B.; Alice, L., Biometrika (Biometrika trust)
1903, 2(2), 211.
73. Fisher, R. A., J of Royal Soc. (Blackwell Publishing) 1992, 85 (4), 597.
28
74. 3D-QSAR in Drug-Design: Vol 3 Recent Advances , Edited by Kubinyi, H.;
Folkers, G.; Martin, Y. C., Kluwer Academic publishers, New York, Boston,
Dordrecht. 2002.
75. Kubinyi, H. Quant-Struct-Act-Relat, 1994, 13, 285, .
76. Uriarte L., Curr . Top. In Med Chem, 2007, 7 (10) 1015.
77. Mc Farland, J.W.; Gans, D. J., Quant-Struct-Act-Relat., 1994, 13, 11, .
78. Livingstone, D., Data Analysis for Chemist: Application to QSAR and Chemical
Product design , Oxford university Press, 1996.
79. Golbraikh, A.; Shen, M.; Xiaoz, M.; Xioaz, D.; Lee, K. H.; Tropscha, A.,
J. Comput-Aided Mol Des., 2003, 17, 241, .
80. Armstrong, J. S., Inter J of Forcasting 2012, 28 (93), 689.
81. Chaterjee, S.; Hadi, A. S.; Price, B., 2000, Regression Analysis by example (3rd ed.)
John-Wiley and sons.