25
Data-analytic sins in property-based molecular design Peter Kenny [email protected] | http://fbdd-lit.blogspot.com

Data-analytic sins in property-based molecular design

Embed Size (px)

DESCRIPTION

This harangue combines critiques of correlation inflation and ligand efficiency metrics. Who will be summoned to the headmasters study?

Citation preview

Page 1: Data-analytic sins in property-based molecular design

Data-analytic sins in property-based

molecular design

Peter Kenny

[email protected] | http://fbdd-lit.blogspot.com

Page 2: Data-analytic sins in property-based molecular design

TEP = [𝐷𝑟𝑢𝑔 𝑿,𝑡 ]𝑓𝑟𝑒𝑒

𝐾𝑑

Target engagement potential (TEP) A basis for molecular design?

Page 3: Data-analytic sins in property-based molecular design

Property-based design as search for ‘sweet spot’

Page 4: Data-analytic sins in property-based molecular design

Correlation

• Strong correlation implies good predictivity

– I have observed a correlation so you must use my rule

• Multivariate data analysis (e.g. PCA) usually involves transformation to orthogonal basis

• Applying cutoffs (e.g. MW restriction) to data can distort correlations

• Noise and range limits in data

Page 5: Data-analytic sins in property-based molecular design

Quantifying strengths of relationships between continuous variables

• Correlation measures

– Pearson product-moment correlation coefficient (R)

– Spearman's rank correlation coefficient ()

– Kendall rank correlation coefficient (τ)

• Quality of fit measures

– Coefficient of determination (R2) is the fraction of the variance in Y that is explained by model

– Root mean square error (RMSE)

Page 6: Data-analytic sins in property-based molecular design

Preparation of synthetic data setsKenny & Montanari (2013) JCAMD 27:1-13 DOI

Add Gaussian noise (SD=10) to Y

Page 7: Data-analytic sins in property-based molecular design

Correlation inflation by hiding variationSee Hopkins, Mason & Overington (2006) Curr Opin Struct Biol 16:127-136 DOI

Leeson & Springthorpe (2007) NRDD 6:881-890 DOI

Data is naturally binned (X is an integer) and mean value of Y is calculated for each value of X. In some studies, averaged data is only presented graphically and it is left to the reader to judge the strength of the correlation.

R = 0.34 R = 0.30 R = 0.31

R = 0.67 R = 0.93 R = 0.996

Page 8: Data-analytic sins in property-based molecular design

r

N 1202

R 0.247 ( 95% CI: 0.193 | 0.299)

0.215 ( P < 0.0001)

0.148 ( P < 0.0001)

N 8

R 0.972 ( 95% CI: 0.846 | 0.995)

0.970 ( P < 0.0001)

0.909 ( P = 0.0018)

Correlation Inflation in FlatlandSee Lovering, Bikker & Humblet (2009) JMC 52:6752-6756 DOI

Page 9: Data-analytic sins in property-based molecular design

Masking variation with standard errorSee Gleeson (2008) JMC 51:817-834 DOI

Partition by value of X into 4 bins with equal numbers of data points and display 95% confidence interval for mean (green) and mean Âą SD (blue) for each bin.

R = 0.12 R = 0.29 R = 0.28

Page 10: Data-analytic sins in property-based molecular design

N Bins Degrees of Freedom F P

40 4 3 0.2596 0.8540

400 4 3 12.855 < 0.0001

4000 4 3 115.35 < 0.0001

4000 2 1 270.91 < 0.0001

4000 8 7 50.075 < 0.0001

“In each plot provided, the width of the errors bars and the difference in the mean values of the different categories are indicative of the strength of the relationship between the parameters.” Gleeson (2008) JMC 51:817-834 DOI

The error of standard error

ANOVA for binned data sets

Page 11: Data-analytic sins in property-based molecular design

Know your data

• Assays are typically run in replicate making it possible to estimate assay variance

• Every assay has a finite dynamic range and it may not always be obvious what this is for a particular assay

• Dynamic range may have been sacrificed for thoughput but this, by itself, does not make the assay bad

• We need to be able analyse in-range and out-of-range data within single unified framework– See Lind (2010) QSAR analysis involving assay results which are only known to

be greater than, or less than some cut-off limit. Mol Inf 29:845-852 DOI

Page 12: Data-analytic sins in property-based molecular design

Depicting variation with percentile plots

This graphical representation of data makes it easy to visualize variation and can be used with mixed in-range and out-of-range data. See Colclough et al (2008) BMCL 16:6611-6616 DOI

Page 13: Data-analytic sins in property-based molecular design

Binning continuous data restricts your options for analysis and places burden of proof on you to show that your conclusions are independent of the binning scheme. Think before you bin!

Averaging the binned data was

your idea so don’t try blaming me this

time!

Page 14: Data-analytic sins in property-based molecular design

Correlation inflation: some stuff to think about

• Model continuous data as continuous data– RMSE is most relevant to prediction but you still need R2

– Fitted parameters may provide insight (e.g. solubility is more sensitive than potency to lipophilicity)

• When selecting training data think in terms of Design of Experiments (e.g. evenly spaced values of X)

• Try to achieve normally distributed Y (e.g. use pIC50 rather than IC50)• Never make statements about the strength of a relationship when

you’ve hidden or masked variation in the data (unless you want a starring role in Correlation Inflation 2)

• To be meaningful, a measure of the spread of a distribution must be independent of sample size

• Reviewers/editors, mercilessly purge manuscripts of statements like, “A negative correlation was observed between X and Y” or “A and B are correlated/linked”

Page 15: Data-analytic sins in property-based molecular design

Ligand efficiency metrics (LEMs) considered harmful

• We use LEMs to normalize activity with respect to risk factors such as molecular size and lipophilicity

• What do we mean by normalization?

• We make assumptions about underlying relationship between activity and risk factor(s) when we define an LEM

• LEM as measure of extent to which activity beats a trend?

Kenny, LeitĂŁo & Montanari (2014) JCAMD 28:699-710 DOI

Page 16: Data-analytic sins in property-based molecular design

Scale activity/affinity by risk factor

LE = ΔG/HA

Offset activity/affinity by risk factor

LipE = pIC50 ClogP

Ligand efficiency metrics

No reason that dependence of activity on risk factor should be restricted to one of these two linear models

Page 17: Data-analytic sins in property-based molecular design

Use trend actually observed in data for normalization

rather than some arbitrarily assumed trend

Page 18: Data-analytic sins in property-based molecular design

There’s a reason why we say standard free energy

of binding…

DG = DH TDS = RTln(Kd/C0)

• Adoption of 1 M as standard concentration is

arbitrary

• A view of a chemical system that changes with

the choice of standard concentration is

thermodynamically invalid

Page 19: Data-analytic sins in property-based molecular design

NHA Kd/M C/M (1/NHA) log10(Kd/C)

10 10-3 1 0.30

20 10-6 1 0.30

30 10-9 1 0.30

10 10-3 0.1 0.20

20 10-6 0.1 0.25

30 10-9 0.1 0.27

10 10-3 10 0.40

20 10-6 10 0.35

30 10-9 10 0.33

Effect on LE of changing standard concentration

Page 20: Data-analytic sins in property-based molecular design

Scaling transformation of parallel lines by dividing Y by X

(This is how ligand efficiency is calculated)

Size dependency of LE is consequence of non-zero intercept

Page 21: Data-analytic sins in property-based molecular design

Affinity plotted against molecular weight for minimal binding

elements against various targets in inhibitor deconstruction

study showing variation in intercept term

Hajduk PJ (2006) J Med Chem 49:6972–6976 DOI

Is it valid to combine results from different assays in LE analysis?

Page 22: Data-analytic sins in property-based molecular design

Offsetting transformation of lines with different slope and

common intercept by subtracting X from Y

(This is how lipophilic efficiency is calculated)

Thankfully (hopefully?) nobody has ‘discovered’

lipophilicity-dependent lipophilic efficiency yet

Page 23: Data-analytic sins in property-based molecular design

Linear fit of ΔG for published data set

Mortenson & Murray (2011) JCAMD 25:663-667 DOI

Page 24: Data-analytic sins in property-based molecular design

Ligand efficiency, group efficiency and residuals plotted for published data set

Page 25: Data-analytic sins in property-based molecular design

Some more stuff to think about

• Normalize activity using trend actually observed in data (this means you have to model the data)

• Residuals are invariant with respect to choice in standard concentration

• Residuals can be used with other functional forms (e.g. non-linear and multi-linear)