Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Application in Cheminformatics
Kristin P. Bennett
Mathematical Sciences DepartmentRensselaer Polytechnic Institute
Regression Case Study
Given for each Molecule iDescriptor vectorBioresponse
ixiy
Bioresponse is a real valued measurement
Use SVM Regression
Construct a function
to predict bioresponse( )i if x y≈
Kernel Regression
Assume function is linear
Pick loss e.g.
Least Squares LAD E-insensitive
-E +E
2( ( ), ) ( ( ))loss f x y y f x= −
( )f x x w b= ⋅ +
Support Vector Regression (SVR)Points in ε-tube are treated as having no error.Robust least absolute deviation used outside of tube.
ε-insensitive loss function:
ε-ε
ξ*
( ( )) : m ax(0, | ( ) | )L y f x y f xε ε− = − −
y-f(x)
Lε
Primal Problem with Regularization21
21
m in m ax(0, | ( ) | )i
y x w b wε=
− ⋅ − − +∑Convert to Quadratic
Program
( )( )( )
1 22, , 1
*
*
*
min || ||
.
, 0 1,..,
w b z i
i i
i
i
i
i
ii
C w
s t
iy x w b
y x w b
ξ
ξ ε
ξ
ξ εξξ
=
− ⋅ + − ≤
+ +
≥ =
− ⋅ + + ≥ −
∑
Construct Dual Problem
Primal
Dual
min ( ): diff and convex
( ) 0 : diff convex. .1, ,
nr
nii
f rf R R
g r g R Rs ti
→≤
→= …
, 1
1
max ( , ) ( ) ( ( ))
. . ( , ) ( ) ( ( )) 0
0, 1, ,
ir u i
r
i
i
i
r r ii
L r u f r g r
s t L r u f r g r
i
α
α
α
=
=
= +
∇ = ∇ + ∇ =
≥ =
∑
∑…
Math Magic requiring only Plug and Chug
Final Regression ProblemThe Dual SVR with kernel
( )( )
( ) ( )
( )
*
1 * *2, 1 1
* *
*
1*
min
0
, 0 1, .
( ,
.
.
. .
,
)i j i i i ji j
i i i i ii i
i
i
i
j
i i
i
y y
y
C
s
K x
i
x
t
α αα α α α
α α ε α α
α α
α α
= =
=
− −
− − + −
− =
≥ ≥ =
∑∑
∑ ∑
∑
Looks nasty but just standard Convex Quadratic Program
Intuition behind dual and capacity control?
x
Why minimize error + ||w||2?
y
Regression Using SVM Classification
x
y+ε
y-ε
y
Regression using SVM Classification
Final Regression Function
Regularization Shrinks (Soft) Tube(like nu-SVM, Schoelkopf et al 1998)
Margin
Original Tube2ε
New Tube
CACO-2 Data
Human intestinal cell line Predicts drug absorption27 molecules with tested permeability718 descriptors generated
Electronic TAE Shape/Property (PEST)Traditional (MOE)
Electron Density-Derived TAE-wavelet Descriptors
1 ) Surface properties are encoded on 0.002 e/au3 surfaceBreneman, C.M. and Rhem, M., J. Comp. Chem., 1997,18(2), p. 182-197
2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors
PIP (Local Ionization Potential)
Histograms
Wavelet Coefficients
PEST-Shape Descriptors: Surface Property-Encoded Ray Tracing
TAE Internal Ray Reflection - low resolution scan
Isosurface (portion removed) with 750 segments
RENSSELAER
Shape-Aware Molecular Descriptors fromProperty/Segment-Length Distributions
Segment length and point-of-incidence value form 2D-histogramEach bin of 2D-histogram becomes a hybrid descriptor
36 descriptors per hybrid length-property
PIP vs Segment Length
RENSSELAER
Benzodiazapine structure, TAE surface reconstruction and PEST
shape/property signatures
NN
Cl
O
Practical Issues
Overfitting/Lack of dataFeature selectionDifficult validationModel/parameter selection Very high model varianceNot confidence in any one model
Robust SVM MethodologyBagged feature selection via sparse linear SVMBagged RBF SVM for final modelModel selection via pattern searchModel mining for more information
SVM Methodology
Constru Select
OptimizModel
Select ParameterC, ε, ρ
Bag Mode
Final Model
Model Selection
To choose SVM model parameters:Objective: C; Tube: ε; RBF Kernel: ρ
Select evaluation function: = (mean square error)/(true variance)
Evaluate on out-of-sample dataValidation set or leave-one-out
Optimize using grid search or pattern search
2Q
Pattern or Direct Search
RepeatEvaluate neighbors in gridIf better neighbor then go to neighborElse reduce grid size
Until grid size is small enough
Boosting and Bagging
Problems: Out-of-sample results don’t guarantee good generalization.Different validation sets give different modelsMany local minima in pattern search.
Solution = Bagging: Create several modelsAverage results.
Bagged SVM (RBF)CACO2 - 718 Variables
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed RT (min)
Pred
icte
d R
T (m
in)
Test Q2 = .7073
Feature Selection
Using subset of descriptors can greatly improve results.Use your favorite selection method
Linear SVM with 1-norm regularization
2-1-
1-norm is sparse
(1,0)
(1/2,1/2)
2 1
1 1 1 1 12 2 2 2 22 1
(1,0) (1,0) 1
( , ) ( , ) 1
= =
= < =
Feature Selection via Sparse SVM/LP
Construct linear µ-SVM using 1-norm LP:
Pick best C,µ for SVMKeep descriptors with nonzero coefficients
( )( )( )
* 1*
, , , , 1
*
*
min
.
, , 0 1,.
|| ||
.,
i iw b z z i
i i i
i i i
i i
C z z C
x b y zs
w
tx b y zz z
ww
i
ενε
εε
ε
=
+ + +
⋅ + − + ≥ −⋅ + − − ≤
≥ =
∑
| | 0iw >
Bagged Feature SelectionPartition Training Data
Training Set Validation Set
Linear SVM AlgorithmFor Feature Selection
A Linear Regression Model
Bag B Models and Obtain Subset of Features
Repeat B times
Random Variable - r
( ) ( ) ( )1 2 718 7191 2 718
Make 20 models of the form - ...
with only a few 0i
w x b w x w x w x w r b
w
⋅ = + + + + +
≠
Bagged SVM (RBF)CACO2 - 31 Variables
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed R T (min)
Pred
icte
d R
T (m
in)
Test Q2 = .134
Model Mining
Generate many equally valid models.Models are data.Mine the model data for trends.Visualize models for chemist: chemist can interact with modelingGenerate hypotheses from model data:
descriptor rankings and interpretations
Star Plot of ABSDRN6
ABSDRN6 is most weighted every bootstrap on average. molecule size.
Negatively weighted.
INTERPRETATION: Large not absorb well.
•Each Radius represents weight in one •Length is magnitude of weight.
Starplot Caco2 - 31 Variables
ABSDRN6
a.don
KB54
SMR.VSA2
BNP8
DRNB10
KB11
PEOE.VSA.FPPOS
ANGLEB45
PIPB53
DRNB00
PEOE.VSA.4
SlogP.VSA6
apol
ABSFUKMIN
PIPB04
PEOE.VSA.FPOL
PIPMAX
BNPB50
BNPB21
PEOE.VSA.FHYD
PEOE.VSA.PPOS
EP2
SlogP.VSA9
ABSKMIN
PEOE.VSA.FNEG
BNPB31
FUKB14
pmiZ
SIKIA
SlogP.VSA0
Chemistry In/Out Modeling
Feature Selection
Visualize Features
Assess Chemistry
Construct SVM Nonlinear model
Data +Descriptors
SVM Model
Test Data
Predict bioactivities
ChemistryInterpretation
The flipped ruleTo investigate the relative importance of selected descriptors and their consistency
If doesn’t make sense. So eliminate flipped variables.
11 210, 0w w> <
Bagged SVM (RBF)CACO2 - 15 Variables
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed RT (min)
Pred
icte
d R
T (m
in)
Test Q2 = .166
Visualization of feature selection resultsTo investigate the relative importance of selected descriptors and their consistency
CACO2 – 15 Variables
a.don
KB54
SMR.VSA2
ANGLEB45
DRNB10
ABSDRN6
PEOE.VSA.FPPOS
DRNB00
PEOE.VSA.FNEG
ABSKMIN
SIKIA
pmiZ
BNPB31
FUKB14
SlogP.VSA0
Star Plot of a.don
a.don is most weighted variable Measures number of hydrogen
Negatively weighted.
INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.
•Each Radius represents •Length is of weights.
Star Plot of SlogP.VSA0
SlogP.VSA0 2nd most weighted Reflects hydrophobicity of
Positively weighted.
INTERPRETATION: Hydrophobic molecules absorb more easily
Chemical Insights
Hydrophobicity - a.donSIZE and ShapeABSDRN6, SMR.VSA2, ANGLEB45, PmiZLarge is bad. Flat is bad. Globular is good.Polarity –PEOE.VSA.FPPOS, PEOE.VSA.FNEG: negative partial charge good.
Correspond to conventional wisdom – rule of 5.
Hybrid TAE/SHAPE
Shape important overall factorDRNB10, DRNB00: del rho dot NBNP31: bare nuclear potential KB54: kinetic energy descriptors very large lipophilic molecules don’t workFUKB14: Fukui Surface
Interpretations difficultPoint to chemistry challenges/hypotheses
Final SVM Approach
Construct large set of descriptors.Perform feature selection:
Sensitivity Analysis or SVM-LPConstruct many SVM models
Optimize using QP or LPEvaluate by Validation Set or Leave-one-out Select best models by grid or pattern search
Bag best 9 models to create final function
Drug Discovery Results (LOO)Data #
Sample
# Var.Full
# Var.FS (Avg)
Q2Full
Q2FS
Caco2 27 713 31 0.33 0.29Barrier 62 569 11 0.31 0.28
HIV 64 561 12 0.41 0.40Cancer 46 362 16 0.50 0.16
LCCK 66 350 22 0.40 0.37Aquasol 197 525 41 0.08 0.06
Conclusions
Defined robust modeling methodology for QSAR type problems.Generates many valid models.Mine models for additional information.Model visualization allows “chemistry in/out”Can substitute your favorite feature selection/inference methodology. Generalizable to many inference/modeling tasks.
Bagged Predictive Model Achieve the better generalization performance
construct a series of non-linear SVM modelsuse the average of all models as final prediction to reduce variance
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed RT (min)
Pred
icte
d R
T (m
in)
Bagged SVM (RBF)CACO2 - 718 VariablesAverage of 10 Models
Test Q2 = .7073
Q2 is MSE scaledby variance
Feature Selection
Using subset of descriptors can greatly improve results.Do feature selection using Linear SVM with 1-norm regularization
2-1-
Feature Selection via Sparse SVM/LP (Bi et al 2003)
Construct linear µ-SVM using 1-norm LP:
Pick best C,µ for SVMKeep descriptors with nonzero coefficients
( )( )( )
* 1*
, , , , 1
*
*
min
.
, , 0 1,.
|| ||
.,
i iw b z z i
i i i
i i i
i i
C z z C
x b y zs
w
tx b y zz z
ww
i
ενε
εε
ε
=
+ + +
⋅ + − + ≥ −⋅ + − − ≤
≥ =
∑
| | 0iw >
Bagged Variable SelectionPartition Training Data
Training Set Validation Set
Linear SVM AlgorithmFor Feature Selection
A Linear Regression Model
Bag B Models and Obtain Subset of Features
Repeat B times
( ) ( ) ( )1 2 7181 2 718
i r
Make 20 models of the form - ...
with only a few 0Keep attributes with w w
r
i
w x b w x w x w x w r b
w
⋅ = + + + + +
≠>
Random Variable - r
Bagged Variable Selection
DATASET
Test set
PredictiveModel
Nonlinear SVM
Prediction
Training set
Training Validation
Bootstrap sample k
Tuning /Prediction
Sparse Linear SVM
…Reduced Data
descriptors
Random Variables
Star Plot of a.don
Measures number of hydrogen
Negatively weighted.
INTERPRETATION: Molecules of hydrogen bonds, bind well with So will stay in solution instead of absorbing.
Caco-2 – 14 Features (SVM)
a.don
KB54
SMR.VSA2
ANGLEB45
DRNB10
ABSDRN6
PEOE.VSA.FPPOS
DRNB00
Each star represents a descriptor
Each ray is a separate bootstrap
The area of a star represents the relative importance of that descriptor
Descriptors shaded cyan have a negative effect
Unshaded ones have a positive effect
BNPB31
FUKB14
SlogP.VSA0
PEOE.VSA.FNEG
ABSKMIN
SIKIA
Hydrophobicity - a.donSize and Shape - ABSDRN6, SMR.VSA2, ANGLEB45 Large is bad. Flat is bad. Globular is good.Polarity – PEOE.VSA...: negative partial charge good.
Bagged SVM (RBF) Caco-2
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Train Rcv2 = 0.93
Blind Test R2 = 0.83
Before feature selection R2=.66