CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/114209/1/chapter-1.pdf · CHAPTER-1 : INTRODUCTION 1.1 General ... geometry and shape) and molecular

CHAPTER1

INTRODUCTION

1

CHAPTER-1 : INTRODUCTION

1.1 General

Over the past two decades, the centre of gravity (the intellectual focus) of medicinal

chemistry has shifted dramatically from, how to make molecule, to what molecule to

make(1). The challenge now is the acquiring of information so to make appropriate

decisions regarding the use of resources in drug design. The input for the drug design

effort is, therefore increasingly quantitative, building upon recent development in

molecular structure description, combinatorial mathematics, statistics and computer

simulations. Collectively these areas have led to vital transformation in drug design which

has been referred to as quantitative information analysis.(1)

Another approach to drug design is employed when there is a paucity of information

about an effector and its structure. This is generally the case especially when the design

goal is a physical or pharmacodynamic property where there is no effector. This approach

depends upon the illumination of information by probing a biological system with series

of molecules. The relationship of the molecular structure to some confined properties is

crafted into a mathematical model. The term used for this approach is quantitative-

structure-activity-relationship (QSAR) (2,3) . The importance of QSAR accelerated with

the explosive growth in combinatorial chemistry. Using this approach, it is possible to

synthesize and test thousands of compounds in short time. The QSAR methodology is

essential in the processing of this mammoth information into predictive models. From a

quality model it is possible to predict and to design compounds for synthesis and testing

that have good possibility for an activity.

QSAR is based on the hypothesis that changes in molecular structure reflect proportional

changes in the observed response or biological activity. Thus it is a tool for numerically

estimating biochemical endpoints of interest for substances for which experimental data

are missing. On June, 16, 2005, the International Academy of Mathematical Chemistry

(IMAC) was founded in Dudrovnic, Croatia by Milan Randic. The academy members

were 81 (2009) from twenty. One of the scientist Jerome Karle awarded with Nobel Prize.

2

1.2 Introduction to Mathematical QSAR modeling: The present revolution in the drug discovery process derives considerable thrust by the

recent progress in combinatorial chemistry and high throughput computational

techniques. The random testing of chemicals is not the best method to obtain the

maximum possible information from a combinatorial library, while the synthesis and

screening of a very large number of compounds is extremely costly. QSAR is a widely

accepted predictive and diagnostic tool used for finding associations between chemical

structures and biological activity. QSAR has emerged and has evolved trying to fulfill

the medicinal chemist’s need and desire to predict biological response (4). In a seminal

paper Kubinyi describes the history of QSAR (5).

At the beginning of modern science during the 18th century, Auguste Compte, a French

philosopher, made a comment in the context of the progress of early chemistry: “Every

attempt to employ mathematical methods in the study of chemical questions must be

considered profoundly irrational and contrary to the spirit of chemistry.” Furthermore: “If

mathematical analysis should ever hold an important place in chemistry, it would

occasion a swift and general degeneration of that science.” But later on integration of

structure and activity led to one of the most admired research area viz. QSAR.

Intuitive medicinal chemist's mindset is based on the assumption that there is an inherent

association between chemical structure and biological activity. Thus a platform is created

to use mathematical tools to correlate structural descriptors (predictors, regressors) and

biological activity (response or target variable). Although the researcher uses the

fundamental similarity principle that states that similar structures have similar activities (6), and despite significant advances in computer-based similarity searching and

similarity-based virtual screening (7-9), it is still a challenge to devise meaningful concepts

and uses of structural similarity (10-17). There are different types of computational methods

in QSAR for different levels of data complexity (18): two dimensional (2D), three-

dimensional (3D) and higher dimensional methods. At present, the QSAR science,

founded on the systematic use of mathematical models and on the multivariate point of

view, is one of the basic tool of modern drug and pesticide design and has an increasing

role in environmental sciences. QSAR model exist at the intersection of chemistry,

statistics and biology. Data used in QSAR evaluation are obtained either from the

literature or generated specifically for QSAR-type analysis. A structure-activity model is

3

defined and limited by the nature and quality of the data used in model development and

should be applied only within the model’s applicability domain.

The ideal QSAR should: (1) Consider sufficient number of molecules for a reasonable

statistical representation (2) Have a wide range of quantified end-point potency (i.e.

several orders of magnitude) for regression models (3) Be applicable for reliable

predictors of new chemicals (validation and applicability domain) and (4) Allowed to

obtain mechanistic information on the modeled endpoint.

The limiting factor in developing QSAR’s is the availability of high quality experimental

data. In QSAR analysis, it is imperative that the input data be both accurate and precise to

develop a meaningful model. In fact, it must be emphasized that any resulting QSAR

model that is developed is only as valid statistically as the data that led to its

development.

A variety of properties have been also used in QSAR modeling. These include: physico-

chemical, quantum chemical and binding properties. Examples of molecular properties

are electron distribution, spatial disposition (conformation, geometry and shape) and

molecular volume. Physico-chemical properties include descriptors for the hydrophobic,

electronic and steric properties of a molecule as well as properties including solubility

and ionization constant. Quantum chemical properties include charge and energy values.

Binding properties are concerned with biological macromolecules and are important in

receptor mediated responses. In modern QSAR approach, it is becoming quite common to

use a wide set of theoretical molecular descriptors of different kinds, able to capture all

the structural aspects of a chemical to translate the molecular structure into numbers. A

molecular descriptor is the final result of a logical and mathematical procedure which

transforms chemical information encoded in a symbolic representation of a molecule into

an useful number or the result of some standardized experiments. (19) The term useful

stands with double meaning, it means that the number can give more insight into the

interpretation of the molecular properties and/or is able to take part in a model for the

prediction of some interesting property of the molecule.(19) Different descriptors are

different ways or perspectives to view a molecule, taking into account the various features

of its chemical structure, not only counts of atoms or groups, but also bi-dimensional

from the topological graph or three dimensional from a minimum energy conformation. A

lot of softwares calculate wide sets of different theoretical descriptors, from 2D-graphs to

4

3D-x, y, z co-ordinates. Some of the more used are mentioned here : ADAPT,(20-21)

CODESSA(22), MolConnz (23) and DRAGON(24).

It has been estimated that more than 3000 molecular descriptors(19, 25-26) are now

available, and most of them have been summarized and explained. Modeling methods

used in the development of QSAR are of two types in relation to the modeled response: a

potency of an end-point (a defined value of EC50) or a category/class (like Mutagen/Not

mutagen). For the potency modeling, the most widely used mathematical technique is

Multiple Regression Analysis (MRA). The Regression analysis is a simple approach to

develop a statistical model that can predict the values of a dependent (response) variable

based upon the values of the independent (explanatory) variables. Regression analysis is

the attempt to explain the variation in a dependent variable using the variation in

independent variables. The most valuable and correct use of regression is in making

predictions. Regression is thus an explanation of causation. If the independent variable(s)

sufficiently explain the variation in the dependent variable, the model can be used for

prediction. This leads to a result that is easy to understand and for this reason, most

QSARs are derived using regression analysis. Regression analysis is a powerful means

for establishing a correlation between independent variable (molecular descriptors X) and

a dependent variable Y, such as biological activity:

Where, b is a constant and a, c are the regression coefficients of molecular descriptors.

The current trend is represented by an efficient utilization of computational techniques in

order to increase the drug-like character and diversity of the compounds proposed for

study(27‐28). Estrada (29)and coworkers made a review on the use of topological indices in

drug design and discovery. Natarajan and Nirdosh(30) worked on the application of

topological indices to QSAR modeling and selection of mineral collectors.

In order to be successful, this process of in silico compound selection must incorporate

additional target-specific information (experimental inhibion values). Conventionally, the

computational screening of chemical libraries is generally a four-step process (31)

(i) Assemble the compounds from a group of building blocks;

(ii) Computation for each chemical compound of a set of structural descriptors;

5

(iii) Dimensionality reduction by selecting from the descriptors set a chemical space that

is relevant for the investigated target;

(iv) Compound selection with a statistical parameters that implements a similarity,

diversity, or drug-like paradigm.

A large number of structural descriptors, many of them traditionally used in QSAR, are

in order to transform into a numerical form, the structural features of molecules, physico-

chemical or empirical (log P, molecular polarizability), constitutional (number of

aromatic rings, number of rotatable bonds, number of hydrogen-bond donors, number of

hydrogen-bond acceptors), structure keys and fingerprints; graph invariants (cyclomatic

number, atom pairs, path counts), topological indices(28) (Wiener,(32-33), Randic,(34) Kier

and Hall,(35-36) Balaban,(37-38)), geometric (polar surface area, molecular volume), quantum

(HOMO energy, atomic charges) and grid (various steric, electrostatic, and lipophilic

fields).

Three major categories in QSAR work exist: (1) The simplest-type (one-dimensional

information) provides constitutional information such as molecular weight, number and

types of atoms and bonds in a condensed formula, i.e. counts for a whole molecule or its

chemical fragments and groups like the number of aliphatic ethers or tertiary amines.; (2)

2-D descriptors based on two dimensional properties of either fragments (substituent

constants sigma, pi, MR) or whole molecule (log P, reactivity). Topological indices based

on atoms and their bond connectivities, (3) 3-D descriptors reflect the three-dimensional

nature of molecular structure (conformation, isomerism) and surrounding space

(stereochemistry). Another way to consider descriptor classification is according to

fragmental or whole molecule properties( 39)

Although QSAR studies offer a rich variety of structural descriptors while the typical

number of compounds in QSAR is usually between 10 and 100, and can easily exceed.

In order to be efficient, the in silico compound screening must use descriptors that require

small computational resources, thus explaining the wide popularity of counts of atom

types, counts of functional groups, fingerprints, constitutional descriptors, graph

invariants and topological indices.

6

1.3 Historical Background and various approaches used in Modeling: QSARs are based on the assumption that the structure of a molecule (i.e. its geometric,

steric and electronic properties) must contain the features responsible for its physical,

chemical and biological properties, and the ability to represent the chemical properties by

one or more numerical descriptors. It has been nearly 40 years since the QSAR modeling

was first used into the practice of agro chemistry, drug design, toxicology industrial and

environmental chemistry. Its growing power in the following years may be attributed also

to the rapid and extensive development in methodologies and computational techniques

that have allowed to delineate and refine the many variables. Such approaches used in the

modeling are as follows:

Crum-Brown-Fraser approach:

More than a century ago, Crum-Brown and Fraser(40) expressed the idea that the

physiological action of a substance in a certain biological system (phi) was a function

(f) of its chemical constitution c:

=fc

Thus, an alteration in Chemical constitution delta C ,would be reflected by an alteration in

biological activity(∆ . In following years on the physical organic front, the seminal

work of Hammett gave rise to the σ-π culture.(41)

Hansch QSAR approach:

In 1962,Hansch et.al.(42) published their study on the structure-activity-relationships of

plant growth regulators and their dependence on Hammett constant and

hydrophobicity.(43-44) Using the octanol/water system, a whole series of partition

coefficients were measured, and thus a new hydrophobic scale was introduced. The

parameter π, which is the relative hydrophobicity of the substituent, was defined in a

manner analogous to the definition of sigma.(41)

Where Px and PH represents the partition coefficient of a derivative and a parent molecule,

respectively.

The contribution of Hammett and Taft(41,43) together laid the basis for the development of

the QSAR paradigm by Hansch and Fujita(42,44), which is a combination of the

hydrophobic constant with Hammett’s electronic constant to yield the linear Hansch

7

equation and its many extended forms. There is consensus among current predictive

toxicology that Crown –Hansch is the founder of modern QSAR. In the classic article(45)

it was illustrated that, in general, biological activity for a group of ‘congeneric’

chemicals can be described by a comprehensive model:

1

In which C, the toxicant concentration at which an endpoint is manifested(e.g.50%

molarity or effect),is related to a hydrophobicity term, ρ, (this is a substituent constant)

denoting the difference in hydrophobicity between a parent compound and a substituted

analog. It has been replaced with more general molecular term the log of the I-

octanol/water partition coefficient, (log Kow), an electronic term, I (originally the

Hammett substituent constant) and a steric terms (typically Taft’s substituent constant

ES). Due to curvilinear or bilinear, relationship between log 1/50 and hydrophobicity

normally found in single dose tests the quadratic term was later introduced to the

model.

The rational for the above equation was given by Mc Farland(46). He hypothesized that the

relative activity of a biological active molecule, such as toxicant, is dependant on (i) the

probability (Pr1) that the toxicant reaches its site of action. (ii) the probability (Pr2) that

the toxicant will interact with the target at this site, and (iii) the external concentration or

dose.

The delineation of these models led to explosive development in QSAR analysis and

related approaches.(47)

In the year after 1960’s, the need to solve new problems, together with the contribution of

many other investigators generated thousands of variations of the Hansch approach to

QSAR modeling, as well as approaches that are formally completely new. Hans

Konemann(48) and Gilman Weith (49)who in early 1980’s developed multiclass-based,

hydrophobic dependent model for industrial organic chemical, must share credit for the

revival of QSAR.

It is evident from literature analysis that the QSAR world has undergone profound

changes since the pioneering work of Hansch, consider the founder of modern QSAR

modeling.(44-45,47)

8

1.4: Typical Steps in QSAR Modeling(50): The following steps should be followed while developing a QSAR model:

1. Response data collection: Experimental measurements (imperfections) and

biological material (variations) are error-prone but data thereof should be normally

distributed. Systematic errors should be absent.

2. Selection of congeners: Congeners are similar enough to guarantee both, the same

interaction mechanism and a wide potency range of several log units.

3. Clustering: Divide the series into chemical groups of more specific homologous

variations.

4. 3D-QSAR: Build molecule models and perform conformational analysis and

alignment.

5. Descriptor selection: Calculating parameters to numerically represent structural

features of the compounds.

6. Model generation applying statistical means: PCA for complexity reduction, SLR

and MLR, PLS for linear regression; clustering, and factorial design.

7. Internal model validation: Using LOO – cross validation to improve the Q2

criterion.

8. Test set: Evaluate preliminary equations in the test set.

9. Interpretation: Interpret the final model on the basis of various statistical

parameters.

e.g. R2, AR2, Fischer value (f-value), Pogliani’s factor (Q-number) Se etc.

1.5: Rational and Objective: It is interesting to mention that 16 different forms of the carbonic anhydrase (CA) appear

in the mammalian body, each having specific functionality(51). Disease caused by

problematic acid-base secretion chemistry in body , particularly in the eye, have been

linked to the dysfunctional activities of several types of Carbonic Anhydrase.(52).Excess

secretion of aqueous humour in the eye can cause pressure gradient to occur permanent

damaging eye tissue. Employing drugs, which reduces the rate of formation of aqueous

humour , can treat diseases such as macular edema and open angle glaucoma. It is

believed that certain CA-II enzymes contribute to the secretion of eye humor through

production of bicarbonates ions.(53)

9

The objective of the present work is to develop QSAR model to predict inhibition values

of aromatic and heteromatic sulphonamides towards CA-II isozymes. These models could

help in finding potential drugs to treat eye diseases such as glaucoma and macular

edema. Furthermore , possible insight could be obtained so as to find out what molecular

features are deemed important when developing inhibitor of the CA-II enzymes with a

hope that the pathway or mechanism of inhibitor can be more clearly understood.

1.6 About carbonic Anhydrase: 1.6.1 Versatility of Carbonic Anhydrase:

Carbonic anhydrase (CA) is an enzyme that assists rapid interversion of carbon dioxide

and water into carbonic acid, protons and bicarbonate ions. This enzyme was first

identified in 1933, in red blood cells of ox by Meldrum and Roughton.(54) Since then, it

has been found abundant in all mammalian tissues, plants, algae and bacteria. (55)

Mammalian carbonic anhydrase (56) occurs in about 16 slightly different forms depending

upon the tissue or cellular compartment in which they are located . These isozymes have

some sequence variations leading to specific differences in their activity. The isozyme

found in some muscle fibers has low enzyme activity compared to that secreted by

salivary glands. While most carbonic anhydrase isozymes are soluble and secreted, some

are bound to the membranes of specific epithelial cells.(57)

Carbonic anhydrase are important enzymes found in red blood cells, gastric mucosa,

pancreatic cells and renal tubes. The catalytic action of CA is fundamental for respiration

and transportation of CO2 between metabolizing tissues and excretion sites, secretion of

electrolytes in a variety of tissues and organs, pH regulation and homeostasis , CO2

fixation (for algae and green plants)(58-59) several metabolic biosynthetic pathways, such

as glucogenesis, lipid genesis and urea genesis, bone resorption ,calcification and

tumoriogenecity ( in vertebrates).(60)

This ancient enzyme has five different classes namely α, β, γ, δ, ε.(60-61 ) Members of

these different classes share very little sequence or structural similarity, yet they all

perform the same function and require a zinc ion at the active site. Carbonic anhydrase

from mammalian belongs to the α-class, while the plant enzyme belong to β-class, while

the enzyme from methane-producing bacteria that grow in hot springs forms γ-class.(57,60)

It is to be noted that alpha enzymes are monomer ,while gamma enzymes are trimeric.

Although the beta enzyme is a dimer, there are four zinc ions bound to the structure

10

indicating four zinc ions bound to the structure indicating four possible enzyme active

sites.

α-CA found in mammals are divided into four broad subgroups, which in turn consists of

several isoforms.(60-61)

*The cytosolic CA’s (CA-1,CA-2,CA-3,CA-7,and CA-8)

*Mitochondrial CA’s (CA-5 and CA-5B)

* Secreted CA’s (CA-6)

*Membrane associated CA (CA-4,CA-9,CA-12,CA-14 & CA-15)

There are three additional “acatalytic” CA isoforms (CA-8,CA-10 & CA-11) whose

functions remain unclear.

Thus, Carbonic anhydrase is a versatile enzyme in the living world. In our lungs, oxygen

diffuses into the blood and is transported to all the cells of our body by red blood cells.

Carbon dioxide diffuses out of the cells and most of it is converted to carbonic acid to be

carried to the lungs. Carbonic anhydrase present in red blood cells aids in the conversion

of carbon dioxide to carbonic acid and bicarbonate ions. When red blood cells reach the

lungs, the same enzyme helps to convert the bicarbonate ions back to carbon dioxide,

which we breath out.

In plants, gaseous carbon dioxide is stored in the form of bicarbonate ions. Carbonic

anhydrase plays a role in converting bicarbonate ions back to carbon dioxide for

photosynthesis.

1.6.2 Catalytic action of Human Carbonic anhydrase-II: Carbonic anhydrases are enzymes that catalyze the hydration of carbon dioxide and the

dehydration of bicarbonate:

CO2 + H2O ↔ HCO3- + H+

These carbonic anhydrase reactions are of great importance in a number of tissues.

Example include: (62)

• Parietal cells in the stomach secrete massive amount of acid (i.e. hydrogen ions or

protons) into the lumen and a corresponding amount of bicarbonate ion into blood.

11

• Pancreatic duct cells do essentially the opposite, with bicarbonate as their main

secretary product.

• Secretion of hydrogen ions by the renal tubes is a critical mechanism for

maintaining acid-base fluid balance.

• Carbon dioxide generated by metabolism in all cells is removed from the body by

red blood cells that convert most of it to bicarbonate for transport, then back to

carbon dioxide to be exhaled from the lungs.

Carbonic Anhydrases are metalloenzymes consisting of a single polypeptide chain

complexed to an atom of zinc.(63) They are incredibly active catalyst, with a turnover rate

of about 106 reactions per second. An anhydrase is defined as an enzyme that catalyses

the removal of a water molecule from a compound , and so it is thus a“reverse” reaction

that gives carbonic anhydrase it’s name, because it removes a water molecule from

carbonic acid. Close-up rendering of active site of human Carbonic Anhydrase-

II,showing three histidine residues and a hydroxide group coordinating the zinc ion at

centre. A zinc prosthetic group in the enzyme is coordinated in three positions by

histidine side chains. The fourth coordination position is occupied by water. This causes

polarization of the hydrogen oxygen bond, making the oxygen slightly more negative,

thereby weakening the bond. The properties of carbonic anhydrase can be summerised as: (63)

• It lies in deep pocket 15A from protein surface.

• Zn is tetrahedrally coordinated.

• Three histidine group (94, 98 and 119) and a water molecule.

• Ionisation through general base catalysis by Glu 106 or Glu117.

• Also with H2O (low pH) or hydroxyl (high pH).

12

Fig-1

13

Carbonic anhydrase catalyzes the reversible hydration of CO2 to form bicarbonate anion

and a proton:

CO2 + H2O ↔ HCO3- + H+

Following are the steps in the hydration reaction (63)

1. Start: Active site: The zinc is co-ordinated by the imidazole rings of three histidines

(94, 96 and 119) and an OH- ion. The geometry of active site is tetrahedral.

2. CO2 bonding: A water molecule (HOH338) is displaced by CO2 . The main chain

-NH of thr 119 orients and polarises the CO2 molecule.

3. Nucleophilic attack: The zinc bound OH- attacks the carbon of CO2 to form HCO3-.

4. HCO3 dissociation: A water molecule (HOH263) replaces the HCO- product.

5. H+ dissociation/Shuttle: The proton product dissociates from HOH 263 and is

transferred in three steps along “wire” of H-bonded waters to His 64.

6. His 64 Rotation/Flip: The protonated His 64-side-chain rotates from the ‘in”

position to the “out” position where it is exposed to bulk solvent on the enzyme

exterior.

7. Return to start: After releasing the H + product, His 64 rotates back to starting (in)

position. The enzyme is ready to another cycle of catalysis.(Fig.2-3)

14

Fig-2

15

1.6. 3 Inhibition of Carbonic Anhydrase-II:

Studies to find correlation between physico-chemical properties and biological activity

sulphonamides indicated the dominating role played by their proton-ligand formation

constant, more commonly known as pKa of the sulphonamides.(64-66 )

At physiological pH, aromatic and heterocyclic unsubstituted sulphonamides (R-

SO2NH2), which are known to inhibit CA-II have an ionisable sulphonamide group (pKa~

6-10). Upon binding the sulphonamide group displaces the water molecule from the zinc

co-ordinated sphere. Substitution of the -RSO2NH2 hydrogen substantially decreases the

activity due to steric hindrance.(67-68).The aromatic side chains of sulphonamide interacts

with hydrophobic amino acid residues in the binding site

e.g.Phe131,Leu141,Val143,Ala45 and stabilize the reaction. Unsubstituted amides such

i.e. R-CO2NH2 such as urethane, phenyl carbamate are a second albeit much less potent,

class of known CA-II inhibitor. In contrast to sulphonamides the compounds are basic

and much weaker CA-II inhibitor amidst such as SCN-,ClO4-,I-are also weak inhibitor

with Ki (binding constant) values of 8-13/um(69-70).The CA-II inhibition mechanism by

sulphonamide are shown in Fig.3 (60)

16

Fig.-3

17

1.7 Multiple Regression Analysis:

The term “regression” was first coined by Francis Galton, a cousin of Charles Darwin in

the nineteenth century to describe a biological phenomenon. (71).Later extended by Yule,

G. U. and Pearson, K.(72) to a more general statistical context. Further Fisher, R.A.(73)

advances this work.

Regression analysis is defined as the analysis of relationships among variables (in our

case the molecular descriptors) for predicting models. It is one of the most widely used

statistical tools because it provides a simple method for establishing relationship among

variables. The Regression analysis can be well understood by following chart : (74)

18

Mathematically exact procedure

for the treatment of data with experimental errors

( mean value, standard devation).

Minimization of the sum of squared errors (= squared deviations between yi and ycalc)

produces the best fit

of the observed values to a certain model.

- independent variables xi (definition: can be determined without experimental error),

and

- dependent variables yi (contain experimental error).

Hypothesis: there is a significant relationship (95% level) between xi and yi values: yes/no

The variables used being dependent and independent variables. In drug designing activity

is taken as dependent variables. While the parameters are considered as independent

variables. If only one variable is used for modeling the activity, the regression is called a

simple regression and the corresponding expression is called simple regression equation.

In multiple regression analysis , the expression contains more than one independent

variable.

The regression expression takes the following form:

Activity=y=bo+b1x1+b2x2+b3x3+…….. (1)

Regression Analysis

Regression Analysis describes the relationship between

19

Where x1, x2, x3….. etc. are the correlating parameters/independent variables/ molecular

descriptors used to develop statistically significant model(s). bo is a constant, while b1,

b2, b3, etc. are coefficients of molecular descriptors used. Coefficients are the values for

regression equation for predicting the dependent variable from the independent variable.

It tells us about the nature of the relationship between the variables. The sign and

magnitude of bi (i = 1, 2…) decides how and to what extent the molecular descriptors are

participating to develop statistically significant model (s). Such a model/ relationship are

usually called Quantitative- Structure- Activity -Relationship (QSAR).(75) It is noticeable

that molecular descriptors are directly related to the structure of the organic molecule

acting as drugs. In case where topological indices used as molecular descriptors, then the

model provides1:1 correlation between structure and activity. The reason is being

topological index is numerical representation of structure. The explicit determination of

the regression equation is taken as the final product of the QSAR analysis. The model i.e.

the regression equation obtained may be used to evaluate the importance of molecular

descriptors (topological indices) used to analyze the effects of policy that involves

changing values of the molecular descriptors or to forecast biological activity for a given

set of molecular descriptors. We need to examine some of the basic characteristics of

multiple regressions. In a QSAR model, the goal is to develop a formula for making

predictions about the biological activity, based on the observed values of the molecular

descriptors, precisely topological indices.(76) For prediction studies, multiple regression

makes it possible to combine many molecular descriptors (topological indices) to produce

optimal predictions of the biological activity. Multiple regressions method find capable to

separate the effects of molecular descriptors (topological indices) on the biological

activity so that one can examine the unique contribution of each molecular descriptor

(topological index).

In the last three decades, statisticians have developed many more sophisticated methods

that achieve similar goals (77-79). In these methods logistic regression, Poisson regression,

structural equation models and survival analysis are keeping importance.

We need to examine the various statistical parameters to describe a model as “best”. Out

of these R-squared, Adjusted R-squared, t-statistics and standard error multicolinearity

are being mostly used and explained here in brief.

20

Multiple regressions always produce the “best” set of linear predictions for a given set of

data. The most common statistical parameter for doing this is called the coefficient of

determination, viz., R2, pronounced as r-squared considered as a measure of how “good”

the models are. R2 is the proportion of variance in the dependent variable which can be

predicted from independent variable. It is an overall measure of strength of association

and does not reflect the extent of each particular independent variable. Its value is found

always between 0-1. The researchers feel terrific if they get a R2 of 0.75 and they feel

terrible if the R2 is only 0.10. It is certainly true that higher is better, there is no reason to

reject a model if the R2 is small. Despite the small R2 we still can get a clear confirmation.

The eq. (1) should be more correctly written by mentioning standard error of each of the

coefficients in parentheses, that is as-

Activity= bo+b1(+) x1 +b2 (±) x2 +….(2)

In the above equation we assume that x1, x2, etc are the true molecular descriptors.

If we divide each coefficient by its standard error, we get t-statistics. The t-statistics has a

distribution that is essentially a standard normal distribution. The another statistical

parameter called in adjustable R2 symbolized as R2A. The adjusted R2 attempts to yield a

more honest value to estimate the R2 for the test-set. The regression models with lots of

independent variables have a natural advantage over models with two independent

variable in predicting the dependent variable. The adjusted R2 removes that advantage.

The adjusted R2 is a modification of the R2 that adjust for the number of independent

variables. The adjusted R2 is always less than or equal to original R2. When the number

of observation is small and the number of predictor is large , there will be much greater

difference between R2 and adjusted R2 and vice-versa.

The standard error of estimates is also one of the important statistical parameter. This

parameter can be interpreted as the standard deviation of the dependent variable after

effects of the independent variables have been removed. If we perfectly predict the

independent variable from the set of independent variables (which will correspond to R2

of 1.0), the standard error of estimate would be zero. On the other hand, when the R2 is

zero ( no predictive power) ; the standard error of the estimate is the same as the standard

deviation of the dependent variable.

21

In evaluating any regression model, it is just as important to think about what is not in the

model as what is in it. These are two possible reasons for putting a variable in a

regression model. The first is, we want to know the effect of the variable on the

Dependent variable, and how to control for the variable. However, it is worthy to mention

that the multiple regression models make no distinction between the study variables and

the control variables. (80)

In order to account for the importance of control variable, it will be needed to find a

casual effect on the dependent variable by that particular variable and its incorporation

with the variable under ones focus of study. If the answers is “yes” we conclude the

particular variable has a strong effect on the dependent variable but is unrelated to the

independent variables already in the model, there is no need to include it.

Sample size (number of compounds in drug modeling) has a profound effect on tests of

statistical significance. One should keep Sample size in mind when looking at the results

or significance tests. The general principle of “In a small sample, statistically significant

coefficients should be taken seriously, but a non-significant coefficient is extremely weak

evidence for the absence of an effect”.

Statisticians often describe small samples as having low power to test hypothesis. This is

another, entirely different problem with small samples that is frequently confused with

the issue of power. Most of the test statistics that researcher use, such as t-tests, F-tests

and chi-square tests, are only approximations. These approximations are usually quite

good when the sample is large but may deteriorate markedly when the sample is small.

One must look carefully at the magnitude of the coefficient to see if it is large enough to

have theoretical or practical importance. (81)

Thus multiple regression produces correlation equation used for modeling. The meaning

of different statistical parameters along with correlation equation is given below:

22

Meaning of statistical parameter in a Correlation Equation:

Log Ki= 1.15 (±0.2) J - 1.46 (±0.4) F + 7.82 (±0.2)

(n = 25; r2 = 0.945; AR2= Se = 0.196; F = 78.6; Q = 0.841)

Multiple regressions are designated precisely for separation of the effects produced by

two or more independent variables on a dependent variable when the independent

variables are correlated with one another, but there is a limit to what regression can do.

This problem goes by the name of multicollinearity and that is an extreme case where two

variables are perfectly correlated.

Multicolinearity is a statistical phenomenon in which two or more predictor variable in a

multiple regression model are highly correlated, meaning that one can be predicted from

others with non-trivial degree of degeneracy. Multicollinearity does not have to be so

Binding constant that causes inhibition

Values of regression

95% confidence of the coefficient and the constant term

Parameter

No. of compounds

Correlation coefficient : the measure of relative quality of model

Modification of R2

Standard deviation of dependent variable

Fischer value: me-asure of statistical significance

Pogliani,sQuality factor: measure of internal predictivity

23

extreme to cause problems and unfortunately, those problems often so undetected.

Multicollinearity shows another effect- the possibility of concluding that two variables

have no effect when one or other of the actually has a strong effect.

Multicollinearity is sometimes that nearly all users of multiple regressions have heard

about. No any method section ever claim that “multicolinearity is not present” generally

this will be untrue. A better statement to make is something along the lines of “ there was

no problem with multicolinearity” . It does not reduce the predictive power or reliability

of model as a whole, it only affects calculation regarding individual predictors. It is a

problem if one is interested in effects of individual predictor. Multicolinearity reduces the

effective amount of information available to access the unique effects of predictors.

Beyond those truths, there is an enormous possibility of confusion and mythology

surrounding multicollinearity. Multicollinearity may become extreme and near extreme.

Extreme multicollinearity means that at least two of the independent variables in a

regression equation are perfectly related by a linear function. Suppose we are trying to

estimate the model:

Y=A+B1X1+B2X2+B3X3+U (3)

Suppose also that in our sample, it happens to be the case that-

X1=2+3 X2 (4)

Then the correlation between x1 and x2 is 1.0 and we have a case of extreme

multicollinearity. The consequence of extreme multicollinearity makes it impossible to

get separate estimates for the coefficients B1 and B2 . However, that multiple regression

separate out the effects of two or more variables, even though they are correlated with

each other. To do this, these must be some remaining variation on each X variable when

the other X variables is held constant. If two variables are perfectly correlated, when one

is kept constant, the other must be constant as well. Hence, it is possible to separate their

effects on the dependent variable.Multicollinearity only affects the coefficient estimates

for those variable that are collinear. This is true for both extreme and near-extreme

multicollinearity. To check the the effects of multicolinearity Variance Inflation Factor

(VIF) and condition numbers are supposed to carry out by Ridge regression Analysis.

24

References:

1. Kier, L.B.; Hall, L.H., Quantitative Information Analysis: The new centre of

Gravity in Medicinal Chemistry, Med Chem, Res., 1997, 7, 335.

2. Balaban, A.T., From chemical Graphs to 3D molecular modeling.In From chemical

topology to Three Dimensional geometry, Balaban, A. T., Ed.: Plenum, New York,

1997,1.

3. Balaban, A.T. Motoc,I.; Bonchev,D.; Meneyan,O. Topological indices for

Structure-Activity correlation Topp. Curr Chem 1983,114,21.

4. Seidel, J. K.; Schaper, K.J., Chemische Struktur and biologische Aktivität von

Wirkstoffen;Methoden der Quantitiven Struktur-Wirkung-Analyse, Verlag Chemie

Weinheim , 1979, 1.

5. Kubinyi, H., Quant. Struct.-Act. Relat., 2002, 21, 348.

6. Johnson, M.; Maggiora, G.M. Concepts and applications of molecular similarity,

John Wiley & Sons: New York, 2006.

7. Willett, P.; Barnard, J.M.; Downs, G.M. J. Chem. Inf. Comput. Sci. 1998, 38, 983.

8. Andrew G.C.; Graham R.W. Perspectives in Drug Discovery and Design, Stevenage

Netherlands: UK. 1998, 321.

9. Engel, T., J. Chem. Inf. Model. 2006, 46, 2267.

10. Willett, P., J. Med. Chem., 2005, 48, 4183.

11. Willett, P., Drug Discov. Today, 2006, 11, 1046.

12. Medina-Franco, J. L.; Maggiora, G. M.; Giulianotti, M. A.; Pinilla, C.; Houghten, R.

A. Chem. Biol. Drug Des., 2007, 70, 393.

13. Martínez-Mayorga, K.; Medina-Franco, J.L.; Giulianotti, M.A.; Pinilla, C.; Dooley,

C.T.; Appel, J. R.; Houghten, R.A. Bioorg. Med. Chem. 2008, 16, 5932.

14. Breneman C. M.; Bennett, K. P.; Embrechts, M. J.; Bi, J.; Demiriz, A.; Lockwood,

L.; Momma, M.; Sukumar, N., 21st National Meeting, American Chemical Society,

San Diego, 2001.

15. Winkler, D. A. Mol. Biotechnol., 2004, 27(2), 138.

25

16. Hopfinger, A.; Wang, S.; Tokarski, J.; Jin, B.; Albuquerque, M.; Madhav, P.;

Duraiswami, C. J. Am. Chem. Soc., 1997, 119, 1050.

17. Livingstone, D.J., Predicting Chemical Toxicity and Fate, CRC Press LLC: Boca

Raton, FL, 2004, 151.

18. Stuper, A. J.; Jurs, P. C., J. Chem. Inf. Comput. Sci. 1976, 16, 99.

19. Todeschini,R.; Consinni, V.; Handbook of Molecular descriptors Wiley-VCH,

Weinheim (Germany), 2000.

20. http://research.chem.psu.edu/pcjgroup/ADAPT.html

21. Mekenyan, O.; Bonchev, D., Acta Pharm Jugosl., 1986, 36, 225.

22. Katrizky, A.R.; Lobanov, V.S.; CODESSA, version 5.3, University of Florida,

Ganisville, 1994.

23. Molconnz Ver.4.05 Hall consult., Quiney, MA. , 2003

24. Todeschini, R.; Consinni, V.; Mauri, A.; Pavan, M., DRAGON-Software for the

calculation of molecular descriptor.version 5.4 for windows, 2006, Talete srl, Milan,

Italy.

25. Devillers, J.; Balaban, A. T.; Topological indices and related descriptors in QSAR

& QSPR, Amsterdem , Gordon and Breach Sci. pub. 1999, 130, 138, 210.

26. Karelson, M.; Molecular description in QSAR/QSPR, New York Wiley-inter

science 2000.

27. Fauchère, J. L.; Boutin, J.A.; Henlin, J. M.; Kucharczyk, N.; Ortuno, J. C.,

Chemom. Intell. Lab. Syst., 1998, 43, 43.

28. Ivanciuc,O.; Balaban, A. T., The Graph Description of Chemical Structures, in:

Devillers, J.; Balaban, A. T., (Eds.), Topological Indices and Related Descriptors

in QSAR and QSPR, Gordon and Breach Science Publishers, Amsterdam, 1999, pp.

59.

29. Estrade, A. E.; Patlewicz, G.; Uriarte, E., Ind. J. Chem., 42A 2003 1315.

30. Natarajan, B.R.; Kamalakanan, P. ; I. Nirdosh, I., Ind. J. Chem., 42A, 2003 , 1330.

31. Ivanciuc, O.; Klein, D. J. , Croa. Chem Act 2002,75 (2), 577.

32. Weiner, H. J. Am. Chem. Soc., 1947, 69, 2636.

26

33. Weiner, H. J., J. Am. Chem. , 1947, 69, 17.

34. Randic, M., J. Am. Chem. Soc. 1975,97,6609.

35. Kier, L. B.; Hall, L. H. , “Molecular-Connectivity in Structure-activity analysis”

research studies press,Willey,Chichester,1986.

36. Kier,L.B.; Hall, L.H. “Molecular-Connectivity and Drug Research” Academic

Press, New York, 1976.

37. Balaban, A. T., Chem. Phys. Lett. 89, 1982 , 399.

38. Balaban, A.T., Pure Appl. Chem. 55, 1983, 199.

39. Winkler, D. A. , Brief Bioinform, 2002, Mar 3(1),73.

40. Crum-Brown- Fraser, T.R. Trans. R. Soc. Edinburgh, 1868, 25, 151, 693,.

41. Hammet, L.P., Chem, Rev. 1935, 17 (1), 125.

42. Hansch, C.; Malony, P. P.; Fujita T.; and Muir, R. M., Nature, 1962 194, 178,

43. Taft, R.W.; J. Am. Chem. Soc. 1952, 74, 3126.

44. Hansch, C.; Fujita, T., J. Am. Chem. 1964, 86, 1616.

45. Hansch, C. Leo, A. Taft, R.W., Chem Rev. 1991, 91(2), 165.

46. McFarland, J.W.; J. Med Chem. 1970, 13, 1092.

47. Hansch, C.; Leo, A.; Exploring QSAR, Fundamentals and applications in chemistry

and biology, ACS Professional Reference book, American Chemical Society,

Waschington, D. C. 1995.

48. H. Konemann, Toxicology, 1981, 19, 209.

49. Weith, G. D.; Call, D. J.; Brooke, L. T.; Can. J., J. Aquat. Sci. 1983, 40,743.

50. Scior et al. Current medicinal chemistry, 2009, 16, (32), 4298.

51. Supuran, C.T., Curr. Pharm Des. 2008, 14 (7) 603.

52. Maren, T. H., J. Glaucoma 1995, 4, 49.

53. Gray, W.D.; Maren, T.H.; Sisson, G.M.; Smith, F.H., J. Pharmocol. Exp. Jher,

1957, 121, 160.

54. Meldrum, N. U., Roughton, F. J .W., J. Physiol. 1933, Dec.5, 80 (2) 113.

27

55. Moroney, J. V.; Bartlett, S. G.; Samuelson, G., Plant, Cell and Environment, 2001,

24, 141.

56. Parkkila, S., Springer, 2000, 90, 79.

57. Brian, C. T.; Smith, K.; Ferry, J. G., J. of Bio chem , 2001, 276, 48615.

58. Hewett-Emett, D.; Tashian, R. E., Mol. Phylogenet. Evo., 1996, 5, 50.

59. Bradfield, J. R. G., Nature, 1947, 467.

60. Supuran, C. T.; Scozzafava, A.; Convay, J., Carbonic Anhydrase: Its inhibitors and

activators, CRC press, New York, 2004.

61. Chegwidden, W. R.; Carter, N. D.; Edwards, Y. H., Carbonic Anhydrase: New

Horizons, Birkhaseur, Basel, Switzerland, 2000.

62. Ray, W. J., Biochemistry 1983, 22, 4625.

63. Lindskog S., Pharmacology and Therapeutics, 1997, 74 (1) , 1.

64. Duffel, M. W.; Ing, I. S.; Segarra, T. M.; Dixon, J. A.; Barfknechl, C. F.;

Schoehwold, R. D., J. Med . Chem 1986, 29, 1488.

65. Hies, M. A.; Masereel, B.; Rolin, S.; Scozzafava, A.; Calpaeneu, G.; Cimpeaneu,

V.; Supuran, C.T., Bioorg. Med. Chem. 2004, 12, 2772.

66. Supuran, C. T.; Scozzafava, A.; Menabuoni, L.; Minicion, F.; Bribanys, C. T., Eu. J.

Pharm Sci., 1999, 8, 317.

67. Thakur, A.; Thakur, M.; Khadikar, P.V.; Supuran, C.T., Bioorg. Med. Chem. 2004,

12, 789.

68. Agrawal, V.K.; Khadikar, P.V., Bioorg. Med Chem.. Lett. 2003, 13, 447.

69. Verpoorte, J. A.; Mehta, S. I.; Edsau, J., J .T. Biol .Chem., 1967, 244, 1421.

70. Supuran, C. T., Nature Rev. Drug Discov. 2008, 7, 168.

71. Galton, F. , Presidential address, SectionH, Anthropology , 1885.

72. Pearson, K.; Yule, G. U.; Norman, B.; Alice, L., Biometrika (Biometrika trust)

1903, 2(2), 211.

73. Fisher, R. A., J of Royal Soc. (Blackwell Publishing) 1992, 85 (4), 597.

28

74. 3D-QSAR in Drug-Design: Vol 3 Recent Advances , Edited by Kubinyi, H.;

Folkers, G.; Martin, Y. C., Kluwer Academic publishers, New York, Boston,

Dordrecht. 2002.

75. Kubinyi, H. Quant-Struct-Act-Relat, 1994, 13, 285, .

76. Uriarte L., Curr . Top. In Med Chem, 2007, 7 (10) 1015.

77. Mc Farland, J.W.; Gans, D. J., Quant-Struct-Act-Relat., 1994, 13, 11, .

78. Livingstone, D., Data Analysis for Chemist: Application to QSAR and Chemical

Product design , Oxford university Press, 1996.

79. Golbraikh, A.; Shen, M.; Xiaoz, M.; Xioaz, D.; Lee, K. H.; Tropscha, A.,

J. Comput-Aided Mol Des., 2003, 17, 241, .

80. Armstrong, J. S., Inter J of Forcasting 2012, 28 (93), 689.

81. Chaterjee, S.; Hadi, A. S.; Price, B., 2000, Regression Analysis by example (3rd ed.)

John-Wiley and sons.

Documents

CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/114209/1/chapter-1.pdf · CHAPTER-1 : INTRODUCTION 1.1 General ... geometry and shape) and molecular