Upload
vernon-taylor
View
220
Download
0
Embed Size (px)
Citation preview
Unsupervised Forward SelectionA data reduction algorithm for use
with very large data sets
David Whitley†, Martyn Ford† and David Livingstone†‡
†Centre for Molecular Design, University of Portsmouth†‡ChemQuest
Outline
• Variable selection issues• Pre-processing strategy• Dealing with multicollinearity• Unsupervised forward selection• Model selection strategy• Applications
Variable Selection Issues
• Relevance– statistically significant correlation with response– non-small variance
• Redundancy– linear dependence
– some variables have no unique information
• Multicollinearity– near linear dependence
– some variables have little unique information
0 iiv
0 iiv
Pre-processing Strategy
• Identify variables with a significant correlation with the response
• Remove variables with small variance• Remove variables with no unique information• Identify a set of variables on which to construct a
model
Effect of Multicollinearity
ii xxxxxy 55443322110
izxx 15
Build regression models of the form
where
Increasing reduces the collinearity between x5 and x1
and x1 - x4 , y, zi and ei are random N(0,1)
Effect of Multicollinearity
Q2
Dealing with Multicollinearity
• Examine pair-wise correlations between variables, and remove one from each pair with high correlation
• Corchop (Livingstone & Rahr, 1989) aims to remove the smallest number of variables while breaking the largest number of pair-wise collinearities
Unsupervised Forward Selection1 Select the first two variables with the smallest pair-
wise correlation coefficient2 Reject variables whose pair-wise correlation
coefficient with the selected columns exceeds rsqmax3 Select the next variable to have the smallest squared
multiple correlation coefficient with those previously selected
4 Reject variables with squared multiple correlation coefficients greater than rsqmax
5 Repeat 3 - 4 until all variables are selected or rejected
Continuum Regression• A regression procedure with the generalized
criterion function)21(422 )''()''(
2 XcXcyXcF
• Varying the continuous parameter 0 1.5 adjusts the balance between the covariance of the response with the descriptors and the variance of the descriptors, so that = 0 is equivalent to ordinary least squares = 0.5 is equivalent to partial least squares = 1.0 is equivalent to principal components regression
Model Selection Strategy
• For = 0.0, 0.1, …, 1.5 build a CR model for the set of variables selected by UFS with rsqmax = 0.1, 0.2, …, 0.9, 0.99
• Select the model with rsqmax and maximizing Q2 (leave-one-out cross-validated R2)– Apply n-fold cross-validation to check predictive
ability– Apply a randomization test (1000 permutations of the
response scores) to guard against chance correlation
Pyrethroid Data Set
• 70 physicochemical descriptors to predict killing activity (KA) of 19 pyrethroid insecticides
• Only 6 descriptors are correlated with KA at the 5% level
• Optimal models– 4-variable, 2-component model with R2 = 0.775,
Q2 = 0.773 obtained when rsqmax = 0.7, = 1.2
– 3-variable, 1-component model with R2 = 0.81, Q2 = 0.76 obtained when rsqmax = 0.6, = 0.2
Optimal Model I
• Standard errors are bootstrap estimates based on 5000 bootstraps
• Randomization test tail probabilities below 0.0003 for fit and 0.0071 for prediction
DVXMIZAAKA 037.000024.08044.0564.931.2 )80.2( )98( )000083.0( )11.0(
Optimal Model II
• Standard errors are bootstrap estimates based on 5000 bootstraps
• Randomization test tail probabilities below 0.0001 for fit and 0.0052 for prediction
DVXMIZAKA 20.000019.0567.880.1 )00.2( )000055.0( )08.0(
N-Fold Cross-Validation
3 variable model4 variable model
Feature Recognition
• Important explanatory variables may not be selected for inclusion in the model– force some variables in, then continue UFS algorithm
• The component loadings for the original variables can be examined to identify variables highly correlated with the components in the model
Loadings for the 1-component pyrethroid model with tail probability < 0.01
variable loading
A5 0.756
A3 0.723
A8 0.619
NS16 - 0.605
DVX - 0.603
ES12 - 0.584
MIZ 0.567
Steroid Data Set
• 21 steroid compounds from SYBYL CoMFA tutorial to model binding affinity to human TBG
• Initial data set has 1248 variables with values below 30 kcal/mol
• Removed 858 variables not significantly correlated with response (5% level)
• Removed 367 variables with variance below 1.0 kcal/mol
• Leaving 23 variables to be processed by UFS/CR
Optimal models
• UFS/CR produces a 3-variable, 1-component model with R2 = 0.85, Q2 = 0.83 at rsqmax = 0.3, = 0.3
• CoMFA tutorial produces a 5-component model with R2 = 0.98, Q2 = 0.6
N-Fold Cross-Validation
CoMFA tutorial model UFS/CR model
Putative Pharmacophore
Selwood Data Set
• 53 descriptors to predict biological activity of 31 antifilarial antimycin analogues
• 12 descriptors are correlated with the response variable at the 5% level
• Optimal models– 2-variable, 1-component model with R2 = 0.42,
Q2 = 0.41 obtained when rsqmax = 0.1, = 1.0
– 12-variable, 1-component model with R2 = 0.85, Q2 = 0.5 obtained when rsqmax = 0.99, = 0.0 (omitting compound M6)
N-Fold Cross-Validation
2-variable model 12-variable model
Summary
• Multicollinearity is a potential cause of poor predictive power in regression.
• The UFS algorithm eliminates redundancy and reduces multicollinearity, thus improving the chances of obtaining robust, low-dimensional regression models.
• Chance correlation can be addressed by eliminating variables that are uncorrelated with the response.
Summary
• UFS can be used to adjust the balance between reducing multicollinearity and including relevant information.
• Case studies show that leave-one-out cross-validation should be supplemented by n-fold cross-validation, in order to obtain accurate and precise estimates of predictive ability (Q2).
Acknowledgements
• Astra Zeneca• GlaxoSmithKline• MSI• Unilever
BBSRC Cooperation with Industry Project: Improved Mathematical Methods for Drug Design
Reference
D. C. Whitley, M.G. Ford and D. J. Livingstone Unsupervised forward selection: a method for eliminating redundant variables.J. Chem. Inf. Comp. Sci., 2000, 40, 1160-1168.
UFS software available from: http://www.cmd.port.ac.uk
CR is a component of Paragon (available summer 2001)