Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear

Application in Cheminformatics

Kristin P. Bennett

Mathematical Sciences DepartmentRensselaer Polytechnic Institute

Regression Case Study

Given for each Molecule iDescriptor vectorBioresponse

ixiy

Bioresponse is a real valued measurement

Use SVM Regression

Construct a function

to predict bioresponse( )i if x y≈

Kernel Regression

Assume function is linear

Pick loss e.g.

Least Squares LAD E-insensitive

-E +E

2( ( ), ) ( ( ))loss f x y y f x= −

( )f x x w b= ⋅ +

Support Vector Regression (SVR)Points in ε-tube are treated as having no error.Robust least absolute deviation used outside of tube.

ε-insensitive loss function:

ε-ε

ξ*

( ( )) : m ax(0, | ( ) | )L y f x y f xε ε− = − −

y-f(x)

Lε

Primal Problem with Regularization21

21

m in m ax(0, | ( ) | )i

y x w b wε=

− ⋅ − − +∑Convert to Quadratic

Program

( )( )( )

1 22, , 1

*

*

*

min || ||

.

, 0 1,..,

w b z i

i i

i

i

i

i

ii

C w

s t

iy x w b

y x w b

ξ

ξ ε

ξ

ξ εξξ

=

− ⋅ + − ≤

+ +

≥ =

− ⋅ + + ≥ −

∑

Construct Dual Problem

Primal

Dual

min ( ): diff and convex

( ) 0 : diff convex. .1, ,

nr

nii

f rf R R

g r g R Rs ti

→≤

→= …

, 1

1

max ( , ) ( ) ( ( ))

. . ( , ) ( ) ( ( )) 0

0, 1, ,

ir u i

r

i

i

i

r r ii

L r u f r g r

s t L r u f r g r

i

α

α

α

=

=

= +

∇ = ∇ + ∇ =

≥ =

∑

∑…

Math Magic requiring only Plug and Chug

Final Regression ProblemThe Dual SVR with kernel

( )( )

( ) ( )

( )

*

1 * *2, 1 1

* *

*

1*

min

0

, 0 1, .

( ,

.

.

. .

,

)i j i i i ji j

i i i i ii i

i

i

i

j

i i

i

y y

y

C

s

K x

i

x

t

α αα α α α

α α ε α α

α α

α α

= =

=

− −

− − + −

− =

≥ ≥ =

∑∑

∑ ∑

∑

Looks nasty but just standard Convex Quadratic Program

Intuition behind dual and capacity control?

x

Why minimize error + ||w||2?

y

Regression Using SVM Classification

x

y+ε

y-ε

y

Regression using SVM Classification

Final Regression Function

Regularization Shrinks (Soft) Tube(like nu-SVM, Schoelkopf et al 1998)

Margin

Original Tube2ε

New Tube

CACO-2 Data

Human intestinal cell line Predicts drug absorption27 molecules with tested permeability718 descriptors generated

Electronic TAE Shape/Property (PEST)Traditional (MOE)

Electron Density-Derived TAE-wavelet Descriptors

1 ) Surface properties are encoded on 0.002 e/au3 surfaceBreneman, C.M. and Rhem, M., J. Comp. Chem., 1997,18(2), p. 182-197

2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors

PIP (Local Ionization Potential)

Histograms

Wavelet Coefficients

PEST-Shape Descriptors: Surface Property-Encoded Ray Tracing

TAE Internal Ray Reflection - low resolution scan

Isosurface (portion removed) with 750 segments

RENSSELAER

Shape-Aware Molecular Descriptors fromProperty/Segment-Length Distributions

Segment length and point-of-incidence value form 2D-histogramEach bin of 2D-histogram becomes a hybrid descriptor

36 descriptors per hybrid length-property

PIP vs Segment Length

RENSSELAER

Benzodiazapine structure, TAE surface reconstruction and PEST

shape/property signatures

NN

Cl

O

Practical Issues

Overfitting/Lack of dataFeature selectionDifficult validationModel/parameter selection Very high model varianceNot confidence in any one model

Robust SVM MethodologyBagged feature selection via sparse linear SVMBagged RBF SVM for final modelModel selection via pattern searchModel mining for more information

SVM Methodology

Constru Select

OptimizModel

Select ParameterC, ε, ρ

Bag Mode

Final Model

Model Selection

To choose SVM model parameters:Objective: C; Tube: ε; RBF Kernel: ρ

Select evaluation function: = (mean square error)/(true variance)

Evaluate on out-of-sample dataValidation set or leave-one-out

Optimize using grid search or pattern search

2Q

Pattern or Direct Search

RepeatEvaluate neighbors in gridIf better neighbor then go to neighborElse reduce grid size

Until grid size is small enough

Boosting and Bagging

Problems: Out-of-sample results don’t guarantee good generalization.Different validation sets give different modelsMany local minima in pattern search.

Solution = Bagging: Create several modelsAverage results.

Bagged SVM (RBF)CACO2 - 718 Variables

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pred

icte

d R

T (m

in)

Test Q2 = .7073

Feature Selection

Using subset of descriptors can greatly improve results.Use your favorite selection method

Linear SVM with 1-norm regularization

2-1-

1-norm is sparse

(1,0)

(1/2,1/2)

2 1

1 1 1 1 12 2 2 2 22 1

(1,0) (1,0) 1

( , ) ( , ) 1

= =

= < =

Feature Selection via Sparse SVM/LP

Construct linear µ-SVM using 1-norm LP:

Pick best C,µ for SVMKeep descriptors with nonzero coefficients

( )( )( )

* 1*

, , , , 1

*

*

min

.

, , 0 1,.

|| ||

.,

i iw b z z i

i i i

i i i

i i

C z z C

x b y zs

w

tx b y zz z

ww

i

ενε

εε

ε

=

+ + +

⋅ + − + ≥ −⋅ + − − ≤

≥ =

∑

| | 0iw >

Bagged Feature SelectionPartition Training Data

Training Set Validation Set

Linear SVM AlgorithmFor Feature Selection

A Linear Regression Model

Bag B Models and Obtain Subset of Features

Repeat B times

Random Variable - r

( ) ( ) ( )1 2 718 7191 2 718

Make 20 models of the form - ...

with only a few 0i

w x b w x w x w x w r b

w

⋅ = + + + + +

≠


-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed R T (min)

Pred

icte

d R

T (m

in)

Test Q2 = .134

Model Mining

Generate many equally valid models.Models are data.Mine the model data for trends.Visualize models for chemist: chemist can interact with modelingGenerate hypotheses from model data:

descriptor rankings and interpretations

Star Plot of ABSDRN6

ABSDRN6 is most weighted every bootstrap on average. molecule size.

Negatively weighted.

INTERPRETATION: Large not absorb well.

•Each Radius represents weight in one •Length is magnitude of weight.

Starplot Caco2 - 31 Variables

ABSDRN6

a.don

KB54

SMR.VSA2

BNP8

DRNB10

KB11

PEOE.VSA.FPPOS

ANGLEB45

PIPB53

DRNB00

PEOE.VSA.4

SlogP.VSA6

apol

ABSFUKMIN

PIPB04

PEOE.VSA.FPOL

PIPMAX

BNPB50

BNPB21

PEOE.VSA.FHYD

PEOE.VSA.PPOS

EP2

SlogP.VSA9

ABSKMIN

PEOE.VSA.FNEG

BNPB31

FUKB14

pmiZ

SIKIA

SlogP.VSA0

Chemistry In/Out Modeling

Feature Selection

Visualize Features

Assess Chemistry

Construct SVM Nonlinear model

Data +Descriptors

SVM Model

Test Data

Predict bioactivities

ChemistryInterpretation

The flipped ruleTo investigate the relative importance of selected descriptors and their consistency

If doesn’t make sense. So eliminate flipped variables.

11 210, 0w w> <


-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pred

icte

d R

T (m

in)

Test Q2 = .166

Visualization of feature selection resultsTo investigate the relative importance of selected descriptors and their consistency

CACO2 – 15 Variables

a.don

KB54

SMR.VSA2

ANGLEB45

DRNB10

ABSDRN6

PEOE.VSA.FPPOS

DRNB00

PEOE.VSA.FNEG

ABSKMIN

SIKIA

pmiZ

BNPB31

FUKB14

SlogP.VSA0

Star Plot of a.don

a.don is most weighted variable Measures number of hydrogen


INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.

•Each Radius represents •Length is of weights.

Star Plot of SlogP.VSA0

SlogP.VSA0 2nd most weighted Reflects hydrophobicity of

Positively weighted.

INTERPRETATION: Hydrophobic molecules absorb more easily

Chemical Insights

Hydrophobicity - a.donSIZE and ShapeABSDRN6, SMR.VSA2, ANGLEB45, PmiZLarge is bad. Flat is bad. Globular is good.Polarity –PEOE.VSA.FPPOS, PEOE.VSA.FNEG: negative partial charge good.

Correspond to conventional wisdom – rule of 5.

Hybrid TAE/SHAPE

Shape important overall factorDRNB10, DRNB00: del rho dot NBNP31: bare nuclear potential KB54: kinetic energy descriptors very large lipophilic molecules don’t workFUKB14: Fukui Surface

Interpretations difficultPoint to chemistry challenges/hypotheses

Final SVM Approach

Construct large set of descriptors.Perform feature selection:

Sensitivity Analysis or SVM-LPConstruct many SVM models

Optimize using QP or LPEvaluate by Validation Set or Leave-one-out Select best models by grid or pattern search

Bag best 9 models to create final function

Drug Discovery Results (LOO)Data #

Sample

# Var.Full

# Var.FS (Avg)

Q2Full

Q2FS

Caco2 27 713 31 0.33 0.29Barrier 62 569 11 0.31 0.28

HIV 64 561 12 0.41 0.40Cancer 46 362 16 0.50 0.16

LCCK 66 350 22 0.40 0.37Aquasol 197 525 41 0.08 0.06

Conclusions

Defined robust modeling methodology for QSAR type problems.Generates many valid models.Mine models for additional information.Model visualization allows “chemistry in/out”Can substitute your favorite feature selection/inference methodology. Generalizable to many inference/modeling tasks.

Bagged Predictive Model Achieve the better generalization performance

construct a series of non-linear SVM modelsuse the average of all models as final prediction to reduce variance

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Observed RT (min)

Pred

icte

d R

T (m

in)

Bagged SVM (RBF)CACO2 - 718 VariablesAverage of 10 Models

Test Q2 = .7073

Q2 is MSE scaledby variance

Feature Selection

Using subset of descriptors can greatly improve results.Do feature selection using Linear SVM with 1-norm regularization

2-1-

Feature Selection via Sparse SVM/LP (Bi et al 2003)

Construct linear µ-SVM using 1-norm LP:

Pick best C,µ for SVMKeep descriptors with nonzero coefficients

( )( )( )

* 1*

, , , , 1

*

*

min

.

, , 0 1,.

|| ||

.,

i iw b z z i

i i i

i i i

i i

C z z C

x b y zs

w

tx b y zz z

ww

i

ενε

εε

ε

=

+ + +

⋅ + − + ≥ −⋅ + − − ≤

≥ =

∑

| | 0iw >

Bagged Variable SelectionPartition Training Data

Training Set Validation Set

Linear SVM AlgorithmFor Feature Selection

A Linear Regression Model

Bag B Models and Obtain Subset of Features

Repeat B times

( ) ( ) ( )1 2 7181 2 718

i r

Make 20 models of the form - ...

with only a few 0Keep attributes with w w

r

i

w x b w x w x w x w r b

w

⋅ = + + + + +

≠>

Random Variable - r

Bagged Variable Selection

DATASET

Test set

PredictiveModel

Nonlinear SVM

Prediction

Training set

Training Validation

Bootstrap sample k

Tuning /Prediction

Sparse Linear SVM

…Reduced Data

descriptors

Random Variables

Star Plot of a.don

Measures number of hydrogen


INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.

Caco-2 – 14 Features (SVM)

a.don

KB54

SMR.VSA2

ANGLEB45

DRNB10

ABSDRN6

PEOE.VSA.FPPOS

DRNB00

Each star represents a descriptor

Each ray is a separate bootstrap

The area of a star represents the relative importance of that descriptor

Descriptors shaded cyan have a negative effect

Unshaded ones have a positive effect

BNPB31

FUKB14

SlogP.VSA0

PEOE.VSA.FNEG

ABSKMIN

SIKIA

Hydrophobicity - a.donSize and Shape - ABSDRN6, SMR.VSA2, ANGLEB45 Large is bad. Flat is bad. Globular is good.Polarity – PEOE.VSA...: negative partial charge good.

Bagged SVM (RBF) Caco-2

-8 -7 -6 -5 -4 -3

-8

-7

-6

-5

-4

-3

Train Rcv2 = 0.93

Blind Test R2 = 0.83

Before feature selection R2=.66

Documents

Kristin P. Bennett - rpi.edubennek/class/mds/lecture/lecture9-06b.pdf · every bootstrap on average. molecule size. ... Partition Training Data Training Set Validation Set Linear