52
Application in Cheminformatics Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute

Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Application in Cheminformatics

Kristin P. Bennett

Mathematical Sciences DepartmentRensselaer Polytechnic Institute

Page 2: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Regression Case Study

Given for each Molecule iDescriptor vectorBioresponse

ixiy

Bioresponse is a real valued measurement

Use SVM Regression

Construct a function

to predict bioresponse( )i if x y≈

Page 3: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Kernel Regression

Assume function is linear

Pick loss e.g.

Least Squares LAD E-insensitive

-E +E

2( ( ), ) ( ( ))loss f x y y f x= −

( )f x x w b= ⋅ +

Page 4: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Support Vector Regression (SVR)Points in ε-tube are treated as having no error.Robust least absolute deviation used outside of tube.

ε-insensitive loss function:

ε-ε

ξ*

( ( )) : m ax(0, | ( ) | )L y f x y f xε ε− = − −

y-f(x)

Page 5: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Primal Problem with Regularization21

21

m in m ax(0, | ( ) | )i

y x w b wε=

− ⋅ − − +∑Convert to Quadratic

Program

( )( )( )

1 22, , 1

*

*

*

min || ||

.

, 0 1,..,

w b z i

i i

i

i

i

i

ii

C w

s t

iy x w b

y x w b

ξ

ξ ε

ξ

ξ εξξ

=

− ⋅ + − ≤

+ +

≥ =

− ⋅ + + ≥ −

Page 6: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Construct Dual Problem

Primal

Dual

min ( ): diff and convex

( ) 0 : diff convex. .1, ,

nr

nii

f rf R R

g r g R Rs ti

→≤

→= …

, 1

1

max ( , ) ( ) ( ( ))

. . ( , ) ( ) ( ( )) 0

0, 1, ,

ir u i

r

i

i

i

r r ii

L r u f r g r

s t L r u f r g r

i

α

α

α

=

=

= +

∇ = ∇ + ∇ =

≥ =

∑…

Math Magic requiring only Plug and Chug

Page 7: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Final Regression ProblemThe Dual SVR with kernel

( )( )

( ) ( )

( )

*

1 * *2, 1 1

* *

*

1*

min

0

, 0 1, .

( ,

.

.

. .

,

)i j i i i ji j

i i i i ii i

i

i

i

j

i i

i

y y

y

C

s

K x

i

x

t

α αα α α α

α α ε α α

α α

α α

= =

=

− −

− − + −

− =

≥ ≥ =

∑∑

∑ ∑

Looks nasty but just standard Convex Quadratic Program

Page 8: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Intuition behind dual and capacity control?

x

Why minimize error + ||w||2?

y

Page 9: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Regression Using SVM Classification

x

y+ε

y-ε

y

Page 10: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Regression using SVM Classification

Page 11: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Final Regression Function

Page 12: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Regularization Shrinks (Soft) Tube(like nu-SVM, Schoelkopf et al 1998)

Margin

Original Tube2ε

New Tube

Page 13: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

CACO-2 Data

Human intestinal cell line Predicts drug absorption27 molecules with tested permeability718 descriptors generated

Electronic TAE Shape/Property (PEST)Traditional (MOE)

Page 14: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Electron Density-Derived TAE-wavelet Descriptors

1 ) Surface properties are encoded on 0.002 e/au3 surfaceBreneman, C.M. and Rhem, M., J. Comp. Chem., 1997,18(2), p. 182-197

2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors

PIP (Local Ionization Potential)

Histograms

Wavelet Coefficients

Page 15: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

PEST-Shape Descriptors: Surface Property-Encoded Ray Tracing

TAE Internal Ray Reflection - low resolution scan

Isosurface (portion removed) with 750 segments

RENSSELAER

Page 16: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Shape-Aware Molecular Descriptors fromProperty/Segment-Length Distributions

Segment length and point-of-incidence value form 2D-histogramEach bin of 2D-histogram becomes a hybrid descriptor

36 descriptors per hybrid length-property

PIP vs Segment Length

RENSSELAER

Page 17: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Benzodiazapine structure, TAE surface reconstruction and PEST

shape/property signatures

NN

Cl

O

Page 18: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Practical Issues

Overfitting/Lack of dataFeature selectionDifficult validationModel/parameter selection Very high model varianceNot confidence in any one model

Robust SVM MethodologyBagged feature selection via sparse linear SVMBagged RBF SVM for final modelModel selection via pattern searchModel mining for more information

Page 19: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

SVM Methodology

Constru Select

OptimizModel

Select ParameterC, ε, ρ

Bag Mode

Final Model

Page 20: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Model Selection

To choose SVM model parameters:Objective: C; Tube: ε; RBF Kernel: ρ

Select evaluation function: = (mean square error)/(true variance)

Evaluate on out-of-sample dataValidation set or leave-one-out

Optimize using grid search or pattern search

2Q

Page 21: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Pattern or Direct Search

RepeatEvaluate neighbors in gridIf better neighbor then go to neighborElse reduce grid size

Until grid size is small enough

Page 22: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Boosting and Bagging

Problems: Out-of-sample results don’t guarantee good generalization.Different validation sets give different modelsMany local minima in pattern search.

Solution = Bagging: Create several modelsAverage results.

Page 23: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged SVM (RBF)CACO2 - 718 Variables

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pred

icte

d R

T (m

in)

Test Q2 = .7073

Page 24: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Feature Selection

Using subset of descriptors can greatly improve results.Use your favorite selection method

Linear SVM with 1-norm regularization

2-1-

Page 25: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

1-norm is sparse

(1,0)

(1/2,1/2)

2 1

1 1 1 1 12 2 2 2 22 1

(1,0) (1,0) 1

( , ) ( , ) 1

= =

= < =

Page 26: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Feature Selection via Sparse SVM/LP

Construct linear µ-SVM using 1-norm LP:

Pick best C,µ for SVMKeep descriptors with nonzero coefficients

( )( )( )

* 1*

, , , , 1

*

*

min

.

, , 0 1,.

|| ||

.,

i iw b z z i

i i i

i i i

i i

C z z C

x b y zs

w

tx b y zz z

ww

i

ενε

εε

ε

=

+ + +

⋅ + − + ≥ −⋅ + − − ≤

≥ =

| | 0iw >

Page 27: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged Feature SelectionPartition Training Data

Training Set Validation Set

Linear SVM AlgorithmFor Feature Selection

A Linear Regression Model

Bag B Models and Obtain Subset of Features

Repeat B times

Random Variable - r

( ) ( ) ( )1 2 718 7191 2 718

Make 20 models of the form - ...

with only a few 0i

w x b w x w x w x w r b

w

⋅ = + + + + +

Page 28: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged SVM (RBF)CACO2 - 31 Variables

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed R T (min)

Pred

icte

d R

T (m

in)

Test Q2 = .134

Page 29: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Model Mining

Generate many equally valid models.Models are data.Mine the model data for trends.Visualize models for chemist: chemist can interact with modelingGenerate hypotheses from model data:

descriptor rankings and interpretations

Page 30: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Star Plot of ABSDRN6

ABSDRN6 is most weighted every bootstrap on average. molecule size.

Negatively weighted.

INTERPRETATION: Large not absorb well.

•Each Radius represents weight in one •Length is magnitude of weight.

Page 31: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Starplot Caco2 - 31 Variables

ABSDRN6

a.don

KB54

SMR.VSA2

BNP8

DRNB10

KB11

PEOE.VSA.FPPOS

ANGLEB45

PIPB53

DRNB00

PEOE.VSA.4

SlogP.VSA6

apol

ABSFUKMIN

PIPB04

PEOE.VSA.FPOL

PIPMAX

BNPB50

BNPB21

PEOE.VSA.FHYD

PEOE.VSA.PPOS

EP2

SlogP.VSA9

ABSKMIN

PEOE.VSA.FNEG

BNPB31

FUKB14

pmiZ

SIKIA

SlogP.VSA0

Page 32: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Chemistry In/Out Modeling

Feature Selection

Visualize Features

Assess Chemistry

Construct SVM Nonlinear model

Data +Descriptors

SVM Model

Test Data

Predict bioactivities

ChemistryInterpretation

Page 33: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

The flipped ruleTo investigate the relative importance of selected descriptors and their consistency

If doesn’t make sense. So eliminate flipped variables.

11 210, 0w w> <

Page 34: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged SVM (RBF)CACO2 - 15 Variables

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pred

icte

d R

T (m

in)

Test Q2 = .166

Page 35: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Visualization of feature selection resultsTo investigate the relative importance of selected descriptors and their consistency

Page 36: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

CACO2 – 15 Variables

a.don

KB54

SMR.VSA2

ANGLEB45

DRNB10

ABSDRN6

PEOE.VSA.FPPOS

DRNB00

PEOE.VSA.FNEG

ABSKMIN

SIKIA

pmiZ

BNPB31

FUKB14

SlogP.VSA0

Page 37: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Star Plot of a.don

a.don is most weighted variable Measures number of hydrogen

Negatively weighted.

INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.

•Each Radius represents •Length is of weights.

Page 38: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Star Plot of SlogP.VSA0

SlogP.VSA0 2nd most weighted Reflects hydrophobicity of

Positively weighted.

INTERPRETATION: Hydrophobic molecules absorb more easily

Page 39: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Chemical Insights

Hydrophobicity - a.donSIZE and ShapeABSDRN6, SMR.VSA2, ANGLEB45, PmiZLarge is bad. Flat is bad. Globular is good.Polarity –PEOE.VSA.FPPOS, PEOE.VSA.FNEG: negative partial charge good.

Correspond to conventional wisdom – rule of 5.

Page 40: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Hybrid TAE/SHAPE

Shape important overall factorDRNB10, DRNB00: del rho dot NBNP31: bare nuclear potential KB54: kinetic energy descriptors very large lipophilic molecules don’t workFUKB14: Fukui Surface

Interpretations difficultPoint to chemistry challenges/hypotheses

Page 41: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Final SVM Approach

Construct large set of descriptors.Perform feature selection:

Sensitivity Analysis or SVM-LPConstruct many SVM models

Optimize using QP or LPEvaluate by Validation Set or Leave-one-out Select best models by grid or pattern search

Bag best 9 models to create final function

Page 42: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Drug Discovery Results (LOO)Data #

Sample

# Var.Full

# Var.FS (Avg)

Q2Full

Q2FS

Caco2 27 713 31 0.33 0.29Barrier 62 569 11 0.31 0.28

HIV 64 561 12 0.41 0.40Cancer 46 362 16 0.50 0.16

LCCK 66 350 22 0.40 0.37Aquasol 197 525 41 0.08 0.06

Page 43: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Conclusions

Defined robust modeling methodology for QSAR type problems.Generates many valid models.Mine models for additional information.Model visualization allows “chemistry in/out”Can substitute your favorite feature selection/inference methodology. Generalizable to many inference/modeling tasks.

Page 44: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged Predictive Model Achieve the better generalization performance

construct a series of non-linear SVM modelsuse the average of all models as final prediction to reduce variance

Page 45: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pred

icte

d R

T (m

in)

Bagged SVM (RBF)CACO2 - 718 VariablesAverage of 10 Models

Test Q2 = .7073

Q2 is MSE scaledby variance

Page 46: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Feature Selection

Using subset of descriptors can greatly improve results.Do feature selection using Linear SVM with 1-norm regularization

2-1-

Page 47: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Feature Selection via Sparse SVM/LP (Bi et al 2003)

Construct linear µ-SVM using 1-norm LP:

Pick best C,µ for SVMKeep descriptors with nonzero coefficients

( )( )( )

* 1*

, , , , 1

*

*

min

.

, , 0 1,.

|| ||

.,

i iw b z z i

i i i

i i i

i i

C z z C

x b y zs

w

tx b y zz z

ww

i

ενε

εε

ε

=

+ + +

⋅ + − + ≥ −⋅ + − − ≤

≥ =

| | 0iw >

Page 48: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged Variable SelectionPartition Training Data

Training Set Validation Set

Linear SVM AlgorithmFor Feature Selection

A Linear Regression Model

Bag B Models and Obtain Subset of Features

Repeat B times

( ) ( ) ( )1 2 7181 2 718

i r

Make 20 models of the form - ...

with only a few 0Keep attributes with w w

r

i

w x b w x w x w x w r b

w

⋅ = + + + + +

≠>

Random Variable - r

Page 49: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged Variable Selection

DATASET

Test set

PredictiveModel

Nonlinear SVM

Prediction

Training set

Training Validation

Bootstrap sample k

Tuning /Prediction

Sparse Linear SVM

…Reduced Data

descriptors

Random Variables

Page 50: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Star Plot of a.don

Measures number of hydrogen

Negatively weighted.

INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.

Page 51: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Caco-2 – 14 Features (SVM)

a.don

KB54

SMR.VSA2

ANGLEB45

DRNB10

ABSDRN6

PEOE.VSA.FPPOS

DRNB00

Each star represents a descriptor

Each ray is a separate bootstrap

The area of a star represents the relative importance of that descriptor

Descriptors shaded cyan have a negative effect

Unshaded ones have a positive effect

BNPB31

FUKB14

SlogP.VSA0

PEOE.VSA.FNEG

ABSKMIN

SIKIA

Hydrophobicity - a.donSize and Shape - ABSDRN6, SMR.VSA2, ANGLEB45 Large is bad. Flat is bad. Globular is good.Polarity – PEOE.VSA...: negative partial charge good.

Page 52: Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Bagged SVM (RBF) Caco-2

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Train Rcv2 = 0.93

Blind Test R2 = 0.83

Before feature selection R2=.66