Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre

Unsupervised Forward SelectionA data reduction algorithm for use

with very large data sets

David Whitley†, Martyn Ford† and David Livingstone†‡

†Centre for Molecular Design, University of Portsmouth†‡ChemQuest

Outline

• Variable selection issues• Pre-processing strategy• Dealing with multicollinearity• Unsupervised forward selection• Model selection strategy• Applications

Variable Selection Issues

• Relevance– statistically significant correlation with response– non-small variance

• Redundancy– linear dependence

– some variables have no unique information

• Multicollinearity– near linear dependence

– some variables have little unique information

0 iiv

0 iiv

Pre-processing Strategy

• Identify variables with a significant correlation with the response

• Remove variables with small variance• Remove variables with no unique information• Identify a set of variables on which to construct a

model

Effect of Multicollinearity

ii xxxxxy 55443322110

izxx 15

Build regression models of the form

where

Increasing reduces the collinearity between x5 and x1

and x1 - x4 , y, zi and ei are random N(0,1)

Effect of Multicollinearity

Q2

Dealing with Multicollinearity

• Examine pair-wise correlations between variables, and remove one from each pair with high correlation

• Corchop (Livingstone & Rahr, 1989) aims to remove the smallest number of variables while breaking the largest number of pair-wise collinearities

Unsupervised Forward Selection1 Select the first two variables with the smallest pair-

wise correlation coefficient2 Reject variables whose pair-wise correlation

coefficient with the selected columns exceeds rsqmax3 Select the next variable to have the smallest squared

multiple correlation coefficient with those previously selected

4 Reject variables with squared multiple correlation coefficients greater than rsqmax

5 Repeat 3 - 4 until all variables are selected or rejected

Continuum Regression• A regression procedure with the generalized

criterion function)21(422 )''()''(

2 XcXcyXcF

• Varying the continuous parameter 0 1.5 adjusts the balance between the covariance of the response with the descriptors and the variance of the descriptors, so that = 0 is equivalent to ordinary least squares = 0.5 is equivalent to partial least squares = 1.0 is equivalent to principal components regression

Model Selection Strategy

• For = 0.0, 0.1, …, 1.5 build a CR model for the set of variables selected by UFS with rsqmax = 0.1, 0.2, …, 0.9, 0.99

• Select the model with rsqmax and maximizing Q2 (leave-one-out cross-validated R2)– Apply n-fold cross-validation to check predictive

ability– Apply a randomization test (1000 permutations of the

response scores) to guard against chance correlation

Pyrethroid Data Set

• 70 physicochemical descriptors to predict killing activity (KA) of 19 pyrethroid insecticides

• Only 6 descriptors are correlated with KA at the 5% level

• Optimal models– 4-variable, 2-component model with R2 = 0.775,

Q2 = 0.773 obtained when rsqmax = 0.7, = 1.2

– 3-variable, 1-component model with R2 = 0.81, Q2 = 0.76 obtained when rsqmax = 0.6, = 0.2

Optimal Model I

• Standard errors are bootstrap estimates based on 5000 bootstraps

• Randomization test tail probabilities below 0.0003 for fit and 0.0071 for prediction

DVXMIZAAKA 037.000024.08044.0564.931.2 )80.2( )98( )000083.0( )11.0(

Optimal Model II

• Standard errors are bootstrap estimates based on 5000 bootstraps

• Randomization test tail probabilities below 0.0001 for fit and 0.0052 for prediction

DVXMIZAKA 20.000019.0567.880.1 )00.2( )000055.0( )08.0(

N-Fold Cross-Validation

3 variable model4 variable model

Feature Recognition

• Important explanatory variables may not be selected for inclusion in the model– force some variables in, then continue UFS algorithm

• The component loadings for the original variables can be examined to identify variables highly correlated with the components in the model

Loadings for the 1-component pyrethroid model with tail probability < 0.01

variable loading

A5 0.756

A3 0.723

A8 0.619

NS16 - 0.605

DVX - 0.603

ES12 - 0.584

MIZ 0.567

Steroid Data Set

• 21 steroid compounds from SYBYL CoMFA tutorial to model binding affinity to human TBG

• Initial data set has 1248 variables with values below 30 kcal/mol

• Removed 858 variables not significantly correlated with response (5% level)

• Removed 367 variables with variance below 1.0 kcal/mol

• Leaving 23 variables to be processed by UFS/CR

Optimal models

• UFS/CR produces a 3-variable, 1-component model with R2 = 0.85, Q2 = 0.83 at rsqmax = 0.3, = 0.3

• CoMFA tutorial produces a 5-component model with R2 = 0.98, Q2 = 0.6


CoMFA tutorial model UFS/CR model

Putative Pharmacophore

Selwood Data Set

• 53 descriptors to predict biological activity of 31 antifilarial antimycin analogues

• 12 descriptors are correlated with the response variable at the 5% level

• Optimal models– 2-variable, 1-component model with R2 = 0.42,

Q2 = 0.41 obtained when rsqmax = 0.1, = 1.0

– 12-variable, 1-component model with R2 = 0.85, Q2 = 0.5 obtained when rsqmax = 0.99, = 0.0 (omitting compound M6)


2-variable model 12-variable model

Summary

• Multicollinearity is a potential cause of poor predictive power in regression.

• The UFS algorithm eliminates redundancy and reduces multicollinearity, thus improving the chances of obtaining robust, low-dimensional regression models.

• Chance correlation can be addressed by eliminating variables that are uncorrelated with the response.

Summary

• UFS can be used to adjust the balance between reducing multicollinearity and including relevant information.

• Case studies show that leave-one-out cross-validation should be supplemented by n-fold cross-validation, in order to obtain accurate and precise estimates of predictive ability (Q2).

Acknowledgements

• Astra Zeneca• GlaxoSmithKline• MSI• Unilever

BBSRC Cooperation with Industry Project: Improved Mathematical Methods for Drug Design

Reference

D. C. Whitley, M.G. Ford and D. J. Livingstone Unsupervised forward selection: a method for eliminating redundant variables.J. Chem. Inf. Comp. Sci., 2000, 40, 1160-1168.

UFS software available from: http://www.cmd.port.ac.uk

CR is a component of Paragon (available summer 2001)

Documents

Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre