View
37
Download
1
Category
Tags:
Preview:
DESCRIPTION
Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk. Random Forest – consensus modelling. Random Forest model is an ensemble of single decision trees. Rules for model construction - PowerPoint PPT Presentation
Citation preview
1
Application and Efficacy of Random Forest Method
for QSAR Analysis
presented byPavel Polishchuk
2Random Forest – consensus modelling
Random Forest model is an ensemble of single decision trees.
Rules for model construction
1. Each tree growing on separate bootstrap sample of
initial training set compounds.
2. In each node only small randomly chosen fixed
number of descriptors are considered.
3. Each tree grows for its maximum depth (no
pruning).
3
Initial dataset
Bootstrapsample
Bootstrapsample
Bootstrapsample
Tree1 Tree2 Tree3
Combined prediction
…
Random Forest algorithm
4Random Forest advantages:
1. RF models are robust to over-fitting.
2. There is no need in pre-selection of variables.
3. RF has its own reliable procedure for estimation of predictive ability of model.
4. RF models are robust to “noise” in training dataset.
5. RF allows to estimate variable importance for target property (interpretability of RF model).
6. RF allows to analyze compounds with different mechanisms of action.
7. RF method is very fast and effective in working with huge datasets.
5
Several examples of real QSAR tasks solutions
6Toxicity of chemical compounds for T. pyriformis#
Diverse datasets:training set = 644 compoundstest set 1 (ts1) = 339 compoundstest set 2 (ts2) = 110 compounds
Total number of 2D simplex descriptors = 6021
was expressed as inverse logarithm of 50% inhibition of Tetrahymena pyriformis growth concentration (pIGC50)
# Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.
7
RF#
(2D simplex)
Consensus PLS
(2D simplex)
Consensus
literature##
R2(ws) 0.99 0.85 0.92
R2(oob) 0.81 --- ---
R2(ts1) 0.83 0.80 0.85
R2(ts2) 0.74 0.69 0.67
MAE(ts1)
0.30 0.33 0.29
MAE(ts2)
0.38 0.41 0.39mean absolute error of prediction
n
i
YYn
MAE1
^1
Comparison of RF model with other consensus ones
RF model (trees=500, vars=2000)#
# Polischuk, P.G., et al J. Chem. Inf. Model., 2009. 49: p.2481-2488## Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.
8Estimation of mutagenic potential of chemical compounds (Ames test)
Model DescriptorsAccuracy
(oob)Accuracy
(5-fold CV)Accuracy(test set)
2D RFSimplex + Dragon
0.827 0.823 0.813
2D RF Simplex 0.823 0.810 0.8142D RF Dragon 0.815 0.803 0.805Consensus#
(32 models)
--- --- 0.828 0.823
# Results of collaboration of 13 scientific groups (not published yet)
training set = 4361 compoundstest set = 2181 compounds
9Solubility in water QSPR task solution#
training set = 2537 compoundstest set = 301 compoundstraining setR2 = 0.99
out-of-bag setR2 = 0.88
test setR2 = 0.82
# Kovdienko, N.A., et al. Molecular Informatics, 2010. 29: p.394-406
10
(27.01.1928 – 07.07.2005)
Leo Breiman – author of Random Forest
«Random Forest is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.»
Recommended