36
© 2016 Optibrium Ltd. Optibrium™, StarDrop™, Auto-Modeller™, Card View™ and Glowing Molecule™ are trademarks of Optibrium Ltd. ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall , Peter Hunt, Ed Champness [email protected] Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.Optibrium™, StarDrop™, Auto-Modeller™, Card View™ and Glowing Molecule™ are trademarks of Optibrium Ltd.

ACS Spring National Meeting. COMP, March 16th 2016Matthew Segall, Peter Hunt, Ed Champness

[email protected]

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Page 2: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

“We demand rigidly defined areas of doubt and uncertainty”

2

Douglas Adam’s, The Hitchhikers’ Guide to the Galaxy

Page 3: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Overview

• Uncertainties in model predictions− Domain of applicability

• Introduction to Gaussian Processes (GPs)

• Simple illustration of GPs− Uncertainty related to ‘domain of applicability’

− Dealing with data variability as a source of uncertainty

− Handling ‘missing’ or sparse data

• Automatic relevance determination− Identifying most important descriptors

• Practical example of GPs applied to QSAR

• Conclusions

3

Page 4: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

Uncertainty in Model Predictions

Page 5: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Quantitative Structure-Activity Relationships

𝑦 = 𝑓(𝑥1, 𝑥2, 𝑥3, … )

• Data

− Quality data is essential

− Public data needs very careful curation* (and may not be good enough)

• Descriptors, e.g.

− Whole molecule properties, e.g. logP, MW, PSA…

− Structural descriptors, SMARTS, fingerprints…

• Statistical fitting or machine learning method, e.g.

− Multiple linear regression, partial least squares

− Artificial neural networks, support vector machines, random forest, Gaussian processes…

5*Waldman et al. JCAMD (2015) DOI: 10.1007/s10822-015-9865-0

Statistical uncertainty

Page 6: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Sources of Uncertainty in Model Predictions

• Experimental noise in training data

• Descriptors may not capture all sources of variation

− Modelled property may not be ‘smooth’ in descriptor space, limiting ability to interpolate

• New compound may be ‘different’ from those used to train the model

− ‘Domain of applicability’

− Models often have a limited ability to extrapolate beyond the descriptor space represented by the training set

6

Page 7: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Assessing Predictive AbilityDomain of Applicability

• The diversity of the training set defines the domain of applicability of the model

• The position of a new compound relative to the domain of applicability should be reflected in the reported confidence in the prediction

• Can we do better than ‘in’ or ‘out’ indication?

Descriptor 1

De

scri

pto

r 2

Page 8: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Assessing Predictive AbilityDealing with ‘gaps’ in coverage

• Distribution of the training set may not be uniform and there may be ‘gaps’ or sparsely sampled regions

• Is this compounds ‘in’ or ‘out’ of the domain of applicability?

Descriptor 1

De

scri

pto

r 2

Page 9: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

Introduction to Gaussian Processes (GPs)

Page 10: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Modelling Techniques: Gaussian Processes

• A machine learning method based on Bayesian approach

• Advantages:

− Does not require a priori determination of model parameters

− Nonlinear relationship modelling

− Built-in tool to prevent overtraining - no need for cross-validation

− Inherent ability to select important descriptors

− Provides uncertainty estimate for each prediction

• Sufficiently robust to enable automatic model generation

12Obrezanova et. al. J. Chem. Inf. Model. (2007) 47(5) pp. 1847-57

Page 11: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Modelling Techniques: Gaussian Processes

• Define prior distribution over functions (controlled by hyperparameters, covariance function – ARD function)

• Posterior distribution: retain functions which fit experimental data

• Prediction is the mean of posterior distribution.

• Standard deviation of the distribution provides estimate of the uncertainty in prediction

13

f(x

)

x

Functions from the prior distribution

f(x)

x

Functions from the posterior

Obrezanova et. al. J. Chem. Inf. Model. (2007) 47(5) pp. 1847-57

Page 12: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Gaussian Processes: Hyperparameters

• Learning the Gaussian Process ~ finding hyperparameters

− Optimize the marginal log-likelihood (prevents overtraining)

− Fits parameter corresponding to estimate of noise in input data (assuming normally distributed)

• Techniques for finding hyperparameters

− “Fixed” values for length scales. Search for noise parameter

− Forward variable selection provides feature selection

− Optimisation by conjugate gradient methods

o Length scales show which descriptors are most relevant

− Nested sampling

o Search in the full hyperparameter space

o Search does not get trapped in local maxima

14

com

pu

tati

on

al d

em

and

Obrezanova et. al. J. Chem. Inf. Model. (2007) 47(5) pp. 1847-57

Page 13: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd. 15

Gaussian Processes: Nested Sampling

• Illustration for 2 variables

• Find maximum of likelihood:

variable 1va

riab

le 2

Obrezanova et. al. J. Chem. Inf. Model. (2007) 47(5) pp. 1847-57

Page 14: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

Simple Illustration of GPs

Page 15: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

‘Toy’ ExampleTraining set from sin function in 1 dimension

17

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function Training setLower boundaryof training set

Upper boundaryof training set

Page 16: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Partial Least SquaresLinear model not appropriate

18

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function PLS Prediction

Page 17: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Partial Least SquaresLinear model not appropriate

19

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function PLS Prediction

Domain of applicability based on Hotelling’s T2 test with 95% confidence limitError bars based on RMSE error inside and outside of domain of applicability

Page 18: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Radial Basis Function Model

21

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function RBF Prediction

Domain of applicability based on Hotelling’s T2 test with 95% confidence limitError bars based on RMSE error inside and outside of domain of applicability

Page 19: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Gaussian Processes (Nested Sampling)

22

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function GPNEST Prediction

Error bars estimated by standard deviation of Gaussian process for each predicted value

Page 20: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Training Set with NoiseNormally distributed error with standard deviation of 2

23

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function Training set

Page 21: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

RBF Model of Noisy Data

24

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function RBF Prediction

Greater error in prediction in domain of applicability, so error bars increase accordingly

Page 22: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

GP Model of Noisy Data

25

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function GPNEST Prediction

GP infers underlying functional form, fits model of noise and corrects error bars accordinglyMore difficult to extrapolate, but this is accounted for outside domain of applicability

Page 23: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Training Set with Missing Data

26

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function Training set

Page 24: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

GP Model Built with Missing Data

27

-15

-10

-5

0

5

10

15

-6.3 3.7 13.7 23.7 33.7 43.7 53.7 63.7

y

x

sin function GPNEST Prediction

Where data is sparse or missing uncertainty in prediction is higher, as reflected by larger error bars

Page 25: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

Automatic Relevance DeterminationIdentifying most important descriptors

Page 26: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

ExperimentDetecting relevant descriptors

• 100 training data points

• One descriptor (x) with perfect linear correlation with property (y)

• Hide this descriptor in a data set containing Nrandom

randomly generated descriptors

• Can a method find the relevant descriptor?

29

Identifier y x Rnd 1 Rnd 2 Rnd 3 Rnd 4 … Rnd Nrandom

Compound 1 0 2 1.252494 4.741985 2.14597 9.457343 3.958759 4.780421

Compound 2 1 2.1 8.64592 1.747653 6.76429 3.99527 7.626024 8.251326

Compound 3 2 2.2 7.064635 5.553097 0.355306 9.588649 2.791829 9.871042

Compound 4 3 2.3 4.783329 5.126768 3.525646 2.39005 0.392087 7.550868

… 4 2.4 7.077723 1.938555 2.028159 7.487378 9.672227 9.300353

Compound 100 99 11.9 0.588886 7.136536 9.538188 1.295742 3.522841 9.480185

0

2

4

6

8

10

12

0 20 40 60 80 100

y

x

Page 27: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Assessing Predictive AbilityValidation of Regression Models

• Coefficient of Determination –

− Measure of fit to identity line y=x

− N.B. Not the same as square correlation coefficient r2corr which is

measure of fit to best fit line – R2 is a stricter test

• Root mean square error - RMSE

30

Best fit liney= 0.71x + 2.3r2

corr = 0.86

R2 = 0.74

N

i

obsobs

N

i

predobs

yy

yy

R

1

2

1

2

2

)(

)(

1

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10

Series1

Identity

Linear (Series1)

Page 28: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

ResultsDetecting relevant descriptors

31

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

R2

Nrandom

PLS RF RBF GP Fixed GP 2D Search GP Opt GP Nest

Page 29: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Number of Descriptors Use in Model

• Often quoted rule of thumb… at least 5 compounds in training set per descriptor

• This is relevant for simple models where the only complexity control is the number of descriptors in the model

• But, GP Nest model includes all descriptors

− Influence of random descriptor on model is negligible

− Posterior probability of complex models is low

• Including additional non-influential descriptors in the model can be valuable

− Detect when new compound differs significantly from training set

− Uncertainty in prediction will increase

32

Page 30: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

Practical Application to QSAR Modelling

Page 31: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

hERG pIC50

• Diverse data set of 168 compounds

− All manual patch clamp measurements in mammalian cells

• Divided into training (135) and external test (33) sets

• Descriptors including

− Whole molecule properties: logP, Vx, TPSA, MW, flexibility…

− 156 structural descriptors expressed as SMARTS

34

Method R2 RMSE

GP (Nested sampling) 0.72 0.64

RF 0.68 0.68

RBF 0.70 0.66

Page 32: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

hERG pIC50 GP Nested Sampling Results

36

64% within 1 SD, 94% within 2 estimated SD (assuming 0.5 log units uncertainty in expt. data)

3

4

5

6

7

8

9

3 4 5 6 7 8 9

Pre

dic

ted

hER

G p

IC5

0

Observed hERG pIC50

Page 33: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Aqueous Solubility (logS)

• Diverse data set of 3313 compounds

− Log of intrinsic, thermodynamic, aqueous solubility in μM, measured between 20 and 30 C

• Divided into training (2650) and external test (663) sets

• Descriptors including

− Whole molecule properties: logP, Vx, TPSA, MW, flexibility…

− 164 structural descriptors expressed as SMARTS

Obrezanova et al. J. Comp.-Aided Mol. Des. (2008) 22(6-7) pp. 431-440

Method R2 RMSE

GP (Fixed) 0.81 0.80

RF 0.78 0.87

RBF 0.84 0.73

PLS 0.75 0.91

Page 34: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Aqueous Solubility (logS)

Obrezanova et al. J. Comp.-Aided Mol. Des. (2008) 22(6-7) pp. 431-440

-4

-2

0

2

4

6

8

-4 -2 0 2 4 6 8Pre

dic

ted

logS

M)

Observed logS (μM)

51% within 1 SD, 80% within 2 estimated SD (assuming 0.3 log units uncertainty in expt. data)

Page 35: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Conclusions

• Gaussian Processes

− Bayesian non-linear modelling technique

− Generates a probability distribution over possible functions models

− Explicitly calculates uncertainties for each prediction

− Similar performance to methods such as random forests, radial basis functions…

− Can also be used for classification

• Limitations

− Most expensive optimisations methods are computationally expensive (e.g. nested sampling O(N4))

− Can’t deal with potential sources of variability (e.g. structural features) not captured by descriptors

• More information (www.optibrium.com/community)

− Obrezanova et al. JCIM (2007) 47(5) pp. 1847-57

− Obrezanova et al. JCAMD (2008) 22(6-7) pp. 431-440

− Obrezanova et al. JCIM (2010) 50 (6), pp. 1053-1061

40

Page 36: Gaussian Processes: We demand rigorously defined areas of ... · −Public data needs very careful curation* (and may not be good enough) •Descriptors, e.g. −Whole molecule properties,

© 2016 Optibrium Ltd.

Acknowledgements

• Olga Obrezanova

• Iskander Yusof

• Chis Leeding

• All our other colleagues at Optibrium

41