Molecular Descriptors

Molecular Descriptors

INTRODUCTION

• Molecular descriptors are numerical values that characterize properties of molecules

• Examples:

– Physicochemical properties (empirical)

– Values from algorithms, such as 2D fingerprints

• Vary in complexity of encoded information and in compute time

Descriptors for Large Data Sets

• Descriptors representing properties of complete molecules

– Examples: LogP, Molar Refractivity

• Descriptors calculated from 2D graphs

– Examples: Topological Indexes, 2D fingerprints

• Descriptors requiring 3D representations

• Example: Pharmacophore descriptors

DESCRIPTORS CALCULATED FROM 2D STRUCTURES

• Simple counts of features

– Lipinski Rule of Five (H bonds, MW, etc.)

– Number of ring systems

– Number of rotatable bonds

• Not likely to discriminate sufficiently when used alone

• Combined with other descriptors for best effect

Physicochemical Properties

• Hydrophobicity

– LogP – the logarithm of the partition coefficient between n-octanol and water

• ClogP (Leo and Hansch) – based on small set of values from a small set of simple molecules

– BioByte: http://www.biobyte.com/

– Daylight’s MedChem Help page

– http://www.daylight.com/dayhtml/databases/medchem/medchem- help.html

– Isolating carbon: one not doubly or triply bonded to a heteroatom

ACD Labs Calculated Properties

• http://www.acdlabs.com

• ACD Labs values now incorporated into the CAS Registry File for millions of compounds

• I-Lab: http://ilab.acdlabs.com/

– Name generation

– NMR prediction

– Physical property prediction

Molar Refractivity

• MR = n2 – 1 MW

-------- -----

n2 + 2 d

where n is the refractive index, d is density, and MW is molecular weight.

• Measures the steric bulk of a molecule.

Topological Indexes

• Single-valued descriptors calculated from the 2D graph of the molecule

• Characterize structures according to size, degree of branching, and overall shape

• Example: Wiener Index – counts the number of bonds between pairs of atoms and sums the distances between all pairs

Wiener Index

• Add up all the off-diagonal elements and divide by 2 (because matrix is symmetrical)

• The Wiener index correlates well with the boiling points of alkanes

http://ilab.acdlabs.com/

http://www.acdlabs.com/

http://www.daylight.com/dayhtml/databases/medchem/medchem-help.html

http://www.daylight.com/dayhtml/databases/medchem/medchem-help.html

http://www.biobyte.com/

Zagreb Index

• For each non-hydrogen atom, add up the squares of the number of connections to other non-hydrogen atoms (regardless of bond order)

Topological Indexes: Others

• Molecular Connectivity Indexes

– Randić (et al.) branching index

• Defines a “degree” of an atom as the number of adjacent non-hydrogen atoms

• Bond connectivity value is the reciprocal of the square root of the product of the degree of the two atoms in the bond.

• Branching index is the sum of the bond connectivities over all bonds in the molecule.

– Chi indexes – introduces valence values to encode sigma, pi, and lone pair electrons

Kappa Shape Indexes

• Characterize aspects of molecular shape

– Compare the molecule with the “extreme shapes” possible for that number of atoms

• Range from linear molecules to completely connected graph

2D Fingerprints

• Two types:

– One based on a fragment dictionary

• Each bit position corresponds to a specific substructure fragment

• Fragments that occur infrequently may be more useful

– Another based on hashed methods

• Not dependent on a pre-defined dictionary

• Any fragment can be encoded

• Originally designed for substructure searching, not for molecular descriptors

Topological indexes

another type of numerical descriptor that can be calculated from a 2D structure diagram

there are many different topological indexes

• some are designed to represent structural features such as branching or shape

they can be calculated from connection tables, or closely-related formats

• e.g. the distance matrix

o an N x N table showing the distance (in bonds) between each pair of atoms

“Redundant” Connection Table

1. O 1 2 1

2. C 0 1 1 3 2 4 1

3. O 0 2 2

4. C 1 2 1 5 1 6 1

5. N 2 4 1

6. C 2 4 1 7 1

7. C 0 6 1 8 2 12 1

8. C 1 7 2 9 1

9. C 1 8 1 10 2

10. C 0 9 2 11 1 13 1

11. C 1 10 1 12 2

12. C 1 11 2 7 1

13. O 1 10 1

Distance Matrix

1 2 3 4 5 6 7 8 9 10 11 12 13

1. O 1 2 2 3 3 4 5 6 7 6 5 8

2. 1 C 1 1 2 2 3 4 5 6 5 4 7

3. 2 1 O 2 3 3 4 5 6 7 6 5 8

4. 2 1 2 C 1 1 2 3 4 5 4 3 6

5. 3 2 3 1 N 2 3 4 5 6 5 4 7

6. 3 2 3 1 2 C 1 2 3 4 3 2 5

7. 4 3 4 2 3 1 C 1 2 3 2 1 4

8. 5 4 5 3 4 2 1 C 1 2 3 2 3

9. 6 5 6 4 5 3 2 1 C 1 2 3 2

10. 7 6 7 5 6 4 3 2 1 C 1 2 1

11. 6 5 6 4 5 3 2 3 2 1 C 1 2

12. 5 4 5 3 4 2 1 2 3 2 1 C 3

13. 8 7 8 6 7 5 4 3 2 1 2 3 O

Kier Shape Indexes

Several indexes based on the number of atoms (N) and the number of bonds (P) in the graph

k1 = N (N-1)2 / P2

k 2 = (N-1) (N-2)2 / P2

k 3 = (N-1) (N-3)2 / P2 (if N is odd)

k 3 = (N-3) (N-2)2 / P2 (if N is even)

“alpha-modified” kappa indexes can be generated where N is adjusted take into account the sizes of atoms, relative to sp2-hybridised carbons

a “molecular flexibility index” is derived from these

j = k1a k2

a / N

Molecular Connectivity Indexes

a whole series of indexes, developed by Kier and Hall in the late 1970s, following earlier work by Randic

involves identifying all possible subgraphs of different sizes in the molecule

size of subgraph determines the order of the index

• 0 bond subgraphs give 0c index

• 1-bond subgraphs give 1c index

• 2-bond subgraphs give 2c index

• 3-bond subgraphs give 3c indexes etc.


At higher orders the subgraphs are divided into

“path” subgraphs (only 1 and 2-connected nodes)

“cluster” subgraphs (no 2-connected nodes)

“path-cluster” subgraphs (any sort of node)

“chain” subgraphs (involving rings)


For each subgraph order and type the index is calculated as

where di is number of connections of node i in the subgraph

molecular connectivity indexes also exist in a “valence-modified” form that takes into account the heteroatoms present


many experiments have been done to find correlations between them (and other indexes) and measured physico-chemical or biological properties

this uses a statistical technique called multiple regression analysis to build an equation of the form

Property = c0 + c1x1 + c2x2 + c3x3 + c4x4 + c5x5 + …

where x1, x2 etc. are topological indexes and c1, c2 etc. are constants

good correlations have often been obtained

What do topological indexes mean?

Good question!

it is often difficult to assign some chemical meaning to, e.g. the order-6 path-cluster, valence-modified Kier index

topological indexes effectively encode the same information as fingerprint fragments

• in a less obvious way

• but one which can be processed numerically

Atom-Pair Descriptors

• Encode all pairs of atoms in a molecule

• Include the length of the shortest bond-by-bond path between them

• Elemental type plus the number of non-hydrogen atoms and the number of π-bonding electrons

BCUT Descriptors

• Designed to encode atomic properties that govern intermolecular interactions

• Used in diversity analysis

• Encode atomic charge, atomic polarizability, and atomic hydrogen bonding ability

BCUT descriptors

• A type of topological index with a complex history

• B = Frank Burden

• C = Chemical Abstracts Service

• UT = University of Texas (Bob Pearlman)

• based on 3D structure of molecule

• 6 different indexes generated for each molecule

• often used as descriptors for cell-based partitioning of chemical space

• 6 descriptors = 6 dimensions

DESCRIPTORS BASED ON 3D REPRESENTATIONS

• Require the generation of 3D conformations

– Can be computationally time consuming with large data sets

– Usually must take into account conformational flexibility

– 3D fragment screens encode spatial relationships between atoms, ring centroids, and planes

Pharmacophore Keys & Other 3D Descriptors

• Based on atoms or substructures thought to be relevant for receptor binding

• Typically include hydrogen bond donors and acceptors, charged centers, aromatic ring centers and hydrophobic centers

• Others: 3D topographical indexes, geometric atom pairs, quantum mechanical calculations for HUMO and LUMO

DATA VERIFICATION AND MANIPULATION

• Data spread and distribution

– Coefficient of variation (standard deviation divided by the mean)

• Scaling (standardization): making sure that each descriptor has an equal chance of contributing to the overall analysis

• Correlations

• Reducing the dimensionality of a data set: Principal Components Analysis

Chemical Structure Representation and Search Systems

Topics to be Covered

Clustering

• identifying classes of molecules similar to each other, but different to those in other classes

Topological indexes

• numbers that can be calculated from connection tables

Property prediction

• predicting physicochemical or biological properties directly from connection tables

The Drug Discovery Process

• virtual screening

Cluster Analysis

process of putting molecules (or other objects) into classes, based on similarity

molecules in the same cluster are similar to each other

molecules in different clusters are different from each other

many different methods and algorithms

• different clustering methods will result in different clusters, with different relationships between them

• different algorithms can be used to implement the same method (some may be more efficient than others)

Downs, G. M., Barnard, J. M., Rev. Comput. Chem., 18 (2002)

Hierarchical and non-hierarchical

A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not

Hierarchical Agglomerative

the hierarchy is built from the bottom upwards

several different methods and algorithms

basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items

• at each step the most similar pair of molecules (or previously-formed clusters) are merged together

• until everything is in one big cluster

• methods differ in how they determine the similarity between clusters

o “single link” chooses clusters whose closest members are most similar

o “complete link” chooses clusters whose furthest members are most similar

o other methods (e.g. Group-average method and Ward’s method) use some sort of “average” member

Hierarchical Agglomerative

Lance-Williams algorithm is slow

• O(N2) to generate pairwise similarity table initially

• this table must be updated N times, once for each merge (agglomeration) of clusters

• overall time requirements are O(N3)

more efficient algorithms can be used for some methods

• single link can be O(N logN) with k-D trees algorithm

• Ward’s method and Group-Average method can be O(N2) using Murtagh’s Reciprocal Nearest-Neighbour algorithm

Hierarchical Divisive

the hierarchy is built from the top downwards

at each step a cluster is chosen to divide, until each cluster has only one member

various ways of choosing next cluster to divide

• one with most members

• one with least similar pair of members

• etc.

various ways of dividing it

• using a single descriptor (e.g. fingerprints bit) [“monothetic”]

• using all descriptors (based on similarities between pairs of members) [“polythetic”]

most polythetic methods are slow

Non-hierarchical methods

usually faster than hierarchical

several different methods

e.g. Leader algorithm

• make a single pass through the dataset (O(N))

o if molecule is similar enough (need to define threshold) to an existing cluster, it joins that cluster

o otherwise it starts (leads) a new cluster

• results depend on order of processing

Nearest neighbour methods

non-hierarchical

best known is example is Jarvis-Patrick method

• identify top k (e.g. 20) nearest neighbours for each molecule

• two molecules join same cluster if they have at least kmin of their top k nearest neighbours in common

very popular for chemical applications from mid 1980s

rather less popular now

tends to produce a few large heterogeneous clusters and a lot of singletons (single-member clusters)

some variations have been tried

• variable-length nearest-neighbour lists (threshold similarity)

• reclustering of singletons

Relocation methods

non-hierarchical

• clusters are initialised (sometimes randomly)

• iterative refinement then relocates molecules between clusters to improve some objective function

simplest and most common example is K-means

• select k random molecules to act as cluster seeds

o k is required number of clusters

• assign each remaining molecule to closest seed

• calculate “centroid” (mean) of each cluster

• relocate molecules to nearest cluster centroid if necessary

• recalculate centroids and repeat until no further changes

K-means clustering

K-means has the advantage of being fast (O(Nk)) and is popular with statisticians

however it has several disadvantages

• sensitive to the initial choice of seeds

o can try non-random sets of seeds

• can converge to a local (rather than global) optimum

• tends to produce only “spherical” clusters of similar size

• difficult to decide what value of k to choose

Overlapping and fuzzy clusters

some clustering methods produce overlapping clusters, in which some molecules are members of more than one cluster

in fuzzy clustering, each molecule has partial membership of all clusters

• degree of membership in each cluster is in range 0.0 to 1.0

• sum of membership over all clusters is 1.0

fuzzy clustering is arguably a better representation of the “real world” but makes it difficult to make decisions

Which method is best?

as with similarity measures and structure descriptors, there is no definite agreement

• this is probably why there are so many methods

empirical property-prediction experiments have been done to evaluate different methods

• predicted property value is average of other members of same cluster (Sheffield University work)

o calculate correlation coefficient between observed and predicted properties

• active and inactive molecules should be in separate clusters (Abbott Laboratories work)

Which method is best?

• Sheffield University work (mid-1980s) showed Ward’s (hierarchical agglomerative) and Jarvis-Patrick method gave best predictions

o at that time Jarvis-Patrick was significantly faster

• Joint CAS/Sheffield/BCI study in early 1990s showed Ward’s and “minimum diameter” (hierarchical divisive) significantly better than Jarvis-Patrick

• similar conclusions in Abbott study (mid 1990s)

• more recent work at Eli Lilly recommended K-means

o certainly better for very large datasets, because of speed

• still a very active area of research

How many clusters to choose?

Hierarchical methods allow user to choose any slice across the hierarchy

but what level is thebest one to choose?

there are methodsthat give a “score”to each level

• get the fewest and“tightest” clusters

How many clusters to choose?

Non-hierarchical methods

• Jarvis-Patrick method decides for itself on basis of user-selected k and kmin

• with other methods (e.g. k-means) it is more difficult

o what is the “natural” number of clusters?

The “natural number” of clusters

What is clustering used for?

• compound acquisition

o purchase compounds from clusters that contain no compounds from existing collections

• high-throughput screening

o choose one compound per cluster in first round

o test other compounds from clusters where hits are found

• homogeneous subsets for QSAR

• diverse subset selection from combinatorial libraries

o maximise different clusters represented; penalise over-representation of individual clusters

• classification of new compounds

o which existing cluster is a new compound closest to?

A clustering of clustering methods

Descriptor calculation

various numerical descriptors can be calculated for chemical structures

• molecular weight

• counts of features

o hydrogen bond donors/acceptors

o aromatic rings

o rotatable bonds

o etc

these can be used in similarity searching and clustering

Property Prediction

it is often useful to be able to calculate a physico-chemical property for a compound from its structure

• regression equations have been used to do this from topological indexes, but usually only for limited sets of molecules

• it would be better to have a more general method

some important properties have had a lot of attention in this respect

logP

octanol-water partition coefficient

• has been found very useful in predicting the bioavailability of a drug

o it needs to be soluble enough in lipid to be able to cross cell membranes

o but soluble enough in water not to get stuck there

• many methods have been proposed for calculating a good estimate from the structure

Leo, A. J. Chemical Reviews, 1993, 93, 1281-1306

logP calculation

fragment-based methods (ClogP)

• pioneered by Corwin Hansch and Al Leo (Pomona College)

• identify large fragments, whose contribution to logP value is known from their occurrence in other compounds with measured logP

• large “training set” of compounds with accurately-measured logP (the “Starlist”)

• works very well if test compound has the right fragments

o problems arise if test compound contains fragments that are “missing” from the training set

logP calculation

atom-based methods (AlogP, XlogP, SlogP)

• pioneered by Gordon Crippen (Univ. Michigan)

• based on identifying a series of “atom types” in the molecule

o essentially, small atom-centred fragments

o usually 60-200 such fragments are involved

• each atom-type is assigned a numerical value

• logP is obtained by adding values for the atom types present in the test molecule

• atom-type values are obtained by regression analysis, based on a set of compounds with measured logP

• sometimes some extra correction factors are used too

Atom-based property calculations

atom-based principle has also been used for other properties

• molar refractivity

• charged partial surface area

• intestinal absorption

• etc.

The Drug Discovery Process

pharmaceutical companies are in the business of identifying compounds that may be useful new drugs

• tens or hundreds of thousands of compounds are made and tested every year (“screening”)

o tests are usually simple binding assays (does the molecule bind to a target protein?)

• testing is done in two stages

o Lead Generation (find a compound that binds)

o Lead Optimisation (find a compound that binds better)

• chemical informatics techniques are important at both these stages

Drug development

Patents will be applied for as soon as a good compound (or class of compounds) is identified

• need to get in before the competition

• patent life (20 years) starts counting down from here

Much development work has still to be done

• animal tests

• clinical trials (several phases)

• regulatory requirements

• many drugs may “fail” during the process

Patent may have only 10 years left to run by the time a new drug is marketed

The need for early attrition

Only a tiny proportion of compounds make it all the way through this process

If a potential new drug is going to “fail” it is better that it fail early

• before too much money has been spent on it

If you can identify the failures before you even synthesise them, so much the better

• “virtual screening”

Three stages of screening

in silico (“in silicon”)

• virtual screening

• entirely in the computer

in vitro (“in glass”)

• uses test tube models of biological systems

• enzyme assays etc

• requires real compounds

in vivo (“in life”)

• compounds tested in living organisms

Virtual Screening

Often based on concept of “drug-likeness”

• do these compounds actually look like drugs?

• need to calculate appropriate properties

o Is compound likely to have suitable properties for

• Absorption

• Distribution

• Metabolism

• Excretion

• Toxicity

o ADMET or ADME/Tox

• suitable property ranges identified by analysing databases of existing drugs

Lipinski Rule of Five

Widely used set of properties used for virtual screening

Developed at Pfizer, 1997

• molecular weight < 500

• logP < 5.0

• < 5 hydrogen bond donors

o number of –OH and –NH groups

• < 10 hydrogen bond acceptors

o number of O and N atoms

Lead generation

when testing a large number of compounds to identify a new “lead”, it is obviously desirable to have them as different from each other as possible

• pharmaceutical companies purchase large numbers of compounds from 3rd party suppliers (often Eastern European) to test

• they also synthesise combinatorial libraries of compounds

chemical “diversity” is important feature of such compound collections and libraries

• the idea is to cover as much of “chemical space” as possible

Lead optimisation

when a “lead” compound has been identified, the next stage is to find compounds that are similar to it, which might bind even better

• this can involve similarity searching to find compounds previously made, or available commercially for purchase

in later stages, as activity of compound becomes better understood, medicinal chemists will make specific changes to the molecule which they hope will improve its binding affinity

Conclusions

Clustering is a useful technique for identifying classes of molecules in a dataset

• there are many different methods and algorithms

• some are faster or more effective than others

Topological indices are numbers that can be calculated from structures represented as connection tables

• there are many different indices available, some of which are designed to represent gross features like shape and branching

Topological indices can be used in regression equations to predict properties of a structure

• other methods are available for property prediction, based on summing scores for different fragments or atom types

Calculated properties can be used in “virtual screening”

Conclusions

Many computer techniques are available to manipulate chemical structure representations

• some have inherent limitations but are none-the-less useful

Structure and substructure search algorithms are among the most important and useful

There are useful techniques for calculating estimates of physico-chemical and other properties

Identifying structurally similar molecules can lead to identifying molecules with similar biological activities

Chemoinformatics is now a vital part of the drug discovery process in the pharmaceutical industry

Documents

Molecular Descriptors