59
A new workflow for QSAR model development from small data sets: Integration of data curation, double cross-validation and consensus prediction tools Pravin Ambure 1 , Agnieszka Gajewicz 2 , M. Natalia D. S. Cordeiro 1 , Kunal Roy 3 1 LAQV@REQUIMTE/Department of Chemistry and Biochemistry, University of Porto, 4169-007 Porto, Portugal 2 Laboratory of Environmental Chemometrics, Faculty of Chemistry, University of Gdansk, Gdansk, Poland 3 Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/ 6/12/2019 1

A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

A new workflow for QSAR model development from small data sets:

Integration of data curation, double cross-validation and consensus

prediction tools

Pravin Ambure1, Agnieszka Gajewicz2, M. Natalia D. S. Cordeiro1, Kunal Roy3

1LAQV@REQUIMTE/Department of Chemistry and Biochemistry, University of Porto, 4169-007 Porto, Portugal 2Laboratory of Environmental Chemometrics, Faculty of Chemistry, University of Gdansk, Gdansk, Poland 3Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700032, India

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 1

Page 2: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 2

QSAR (Quantitative Structure-Activity Relationship)

Predictive models correlating biological activity (BA) of chemicals with descriptors representative of molecular structure and/or property by application of statistical tools.

BA = f (chemical structure or property) = f (descriptors) Roy, Kar and Das, Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, Academic Press, 2015

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

Page 3: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 3

QSAR (Quantitative Structure-Activity Relationship)

Roy, Kar and Das, A Primer on QSAR/QSPR Modeling: Fundamental Concepts, Springer, 2015 Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 3

Page 4: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 4

QSAR (Quantitative Structure-Activity Relationship)

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 4

Page 5: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Roy, Kar and Das, A Primer on QSAR/QSPR Modeling: Fundamental Concepts, Springer, 2015

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 5

Page 6: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Roy, Kar and Das, A Primer on QSAR/QSPR Modeling: Fundamental Concepts, Springer, 2015

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 6

Page 7: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Roy, Kar and Das, A Primer on QSAR/QSPR Modeling: Fundamental Concepts, Springer, 2015

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 7

Page 8: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 8 Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

Page 9: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 9

QSAR in regulatory use

The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration, biodegradation and biotransformation, toxic effects) of commercial chemicals is a costly and time consuming process. Since there is large number of chemicals currently in common use (approx. 100,000) and new chemicals are registered at a very high rate (1000 per year), it is obvious that our human and material resources are insufficient to obtain experimentally even basic information on environmental fate and effects for all these chemicals. Thus, it is necessary to develop quantitative models that will accurately and readily predict environmental behaviour of large sets of chemicals. Roy K, Expert Opin Drug Discov, 2007, 2, 1567-1577

Page 10: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 10

QSAR in regulatory use

• Time and cost effective • Avoids animal experimentation • Supports “3Rs” Principles • Can be applied for virtual compounds • Supported by various organizations like European Centre for the Validation of Alternative Methods (ECVAM) International Organizations of Medical Sciences REACH (Registration, Evaluation and Authorization of Chemicals) regulations US EPA Organization for Economic Cooperation and Development (OECD) Roy K, Expert Opin Drug Discov, 2007, 2, 1567-1577 Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 10

Page 11: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 11

Guidelines for QSAR model development

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 11

Page 12: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 12

Validation of QSAR models

Any QSAR modeling should ultimately lead to statistically robust models capable of making accurate and reliable predictions of biological activities of compounds.

The process of QSAR model development can be generally divided into three stages: data preparation, data analysis, and model validation.

The validation strategies check the reliability of the developed models for their possible application on a new set of data, and confidence of prediction can thus be judged. Kubinyi H, Hamprecht F A, Mietzner T, J Med Chem, 1998, 41, 2553- 2564.

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

Page 13: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 13

Validation of QSAR models

Internal validation Leave-one-out Leave-many-out Bootstrapping External validation Y-randomization

Roy K, Mitra I, Comb. Chem. High Throughput Screen. 2011, 14, 450−474.

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

Page 14: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 14

External Validation

External validation or validation using an independent test set is usually considered as the gold standard in evaluating the quality of predictions from a QSAR model. The external predictivity of QSAR models is commonly described by employing various validation metrics, which can be broadly categorized into two major classes, viz.,

•R2 based metrics namely R2test, Q2

(ext_F1), Q2(ext_F2), etc.

•Purely error based measures like predicted residual sum of squares (PRESS), root mean square error (RMSE), mean absolute error (MAE), etc.

Tropsha A. Mol inform. 2010;29(6‐7):476-488.

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 14

Page 15: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 15

Internal vs. External Validation

•According to some group of scientists, cross-validation is better suited for checking predictive ability of QSAR models in order avoid loss of information from splitting of the data set into training and test sets.

•They have also argued that the test of predictive ability of QSAR models from a single training-test split is biased and insufficient.

•Although a comparison of suitability of cross-validation vs. external validation for judging predictive ability of QSAR models is a matter of debate, the importance of a test to check quality of predictions from a “unseen” data set cannot be neglected. Heberger K, Racz A, Bajusz D. Which Performance Parameters Are Best Suited to Assess the Predictive Ability of Models? In Advances in QSAR Modeling (pp. 89-104). Springer; 2017. Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India

https://sites.google.com/site/kunalroyindia/ 6/12/2019 15

Page 16: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

• The currently accepted practice of deriving a useful QSAR model is to test its predictive capability by both internal and external validation methods.

• In case of a small data set, a significant amount of information is lost due to the held out samples.

• Moreover, there may be a bias in descriptor selection due to a fixed composition of the small training set.

• In addition, presence of outliers with respect to both chemical and biological domains in the training data might seriously influence modeling of the data set.

• Thus, data set curation prior to model development is a very important step, especially for small data sets.

Problems of QSAR modeling from small data sets

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 16

Page 17: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

• The problem of model development from small data sets can be overcome to some extent through the method of double cross-validation (DCV).

• In this approach, the validation is done in two loops: in the inner loop, the training set is further divided into ‘n’ calibration and validation sets resulting in diverse compositions, which are utilized for model building and model selection, while the test set in the external loop is exclusively used for model assessment.

• This method also obviates the bias in descriptor selection due to a fixed composition of the training set.

Problems of QSAR modeling from small data sets

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 17

Roy and Ambure, Chemometrics and Intelligent Laboratory Systems 159 (2016) 108–126

Page 18: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Double cross-validation

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 18

Page 19: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

• There is another issue of loss of information due to selection of a limited number of descriptors in the final (especially multiple linear regression) models considering the availability of low number of data points.

• Consideration of multiple models in a consensus approach might be a possible solution of this problem.

• The strength of a consensus approach is that the final result takes into account the different assumptions characterizing each model, encompassing different chemical features and their contributions allowing for a more reliable judgment in a complex situation.

• The drawback of one model can also be nullified by another model used in consensus predictions.

Roy et al., Journal of Chemometrics. 2018;32:e2992.

Problems of QSAR modeling from small data sets

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 19

Page 20: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 20

Page 21: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

• In this presentation, we report two user friendly standalone software tools for executing the proposed workflow which integrates small data set curation, exhaustive double cross-validation and a set of optimal model selection techniques including consensus predictions for model development, especially for small data sets.

• The software tools are available at https://dtclab.webs.com/software-tools

Present Work

Drug Theoretics and Cheminformatics Laboratory, Jadavpur University, Kolkata, India https://sites.google.com/site/kunalroyindia/

6/12/2019 21

Page 22: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Small Dataset Curator

6/12/2019 22

Small Data Set (Chemically

curated)

Duplicate analysis (descriptor-based)

Structural outliers removed

Response outliers removed

Activity cliffs analysis (using Student t-test)

Final Ready-to-use

Dataset

Small Data Set Curation

Page 23: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Duplicate Analysis

6/12/2019 23

• The duplicates are first identified simply based on the values of the given descriptors provided for all the data objects or compounds present in the data set.

• For a given pair of found duplicates, if their experimental response values are identical, then one of the compounds should be randomly deleted.

• However, if their experimental response values are different, then the following scenarios should be considered.

(a) If the experimental responses are nearly similar, then one of the duplicate compounds can be kept with the arithmetic average of the experimental values of the duplicates. (b) If the experimental responses are significantly different, then such duplicates should always be removed, unless there is a possibility to confirm the correct experimental value from the original sources.

Page 24: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Outlier Analysis

6/12/2019 24

The response-range outliers can simply be defined as those compounds having experimental response values in a sufficiently different range when compared to the rest of the data set compounds and are very few in number. The software ‘Small Dataset Curator’ categorizes those compounds as response-range outliers having experimental response values outside the range defined by Mean ± k × SD (by default, k =3), where the Mean and standard deviation (SD) are computed based on the experimental response values of the compounds.

Page 25: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Outlier Analysis

6/12/2019 25

The structural outliers are those compounds that are extremely dissimilar in terms of chemical features from most compounds in the data sets. The software identifies structural outliers using the mean Euclidean distance score computed for all compounds in the data sets and then the compounds are considered as structural outliers if their mean distance score is outside the range defined by MeanED ± k×SDED (by default, k =3), where MeanED and standard deviation (SDED) are computed based on the mean distance score values of the compounds.

Page 26: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Activity Cliff Analysis

6/12/2019 26

• Activity cliffs are the regions where large changes in activity are observed for relatively small changes in the chemical structure. Thus, compounds showing activity cliffs have high structural similarity but a large difference in their biological activity or response values.

• The Small Data Curator version 1.0.0 software identifies compounds showing activity cliffs by employing Student’s t-test and in some cases simply using Euclidean distance as similarity index, which is performed in the following way:

• For each compound in the whole data set, identify the closest structurally similar compounds in the same set using Euclidean distance score (normalized to 0 to 1) with user-defined cut-off criteria (default value = <0.10). Now based on the number of similar compounds found for each compound in the data set, the following three criteria are applied to identify the activity cliffs.

Page 27: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Activity Cliff Analysis

6/12/2019 27

• Criterion 1: For a compound ‘i’, if the number of similar compounds (Ns) found are ≥ 3, then the software performs Student’s t-test to identify the activity cliffs.

For compound ‘i’, if ti value is greater than the calculated t-value at 95% confidence level (degree of freedom equal to Ns – 1), then the corresponding probability (p-value) is <0.05. If p-value <0.05 AND |MeansimilarCompounds – Responsei| ≥ threshold, then there is a possibility of the presence of an activity cliff. • Criterion 2: For a compound ‘i’, if the number of similar compounds (Ns) found

are equal to 2, then the tool calculates the mean of the response values (i.e., let say, ResponsesimilarComp1, ResponsesimilarComp2) of the two similar compounds and if |MeansimilarCompounds – Responsei| ≥ threshold (default=1 log unit) AND (|MeansimilarCompounds – Responsei| >| ResponsesimilarComp1 – ResponsesimilarComp2|), then there is a possibility of presence of an activity cliff.

• Criterion 3: For a compound ‘i’, if only one similar compound is found, then the tool simply checks the following condition, i.e., |ResponsesimilarComp1 – Responsei| ≥ threshold (default=1 log unit). If true, then there is a possibility of the presence of an activity cliff.

Page 28: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Small Dataset Modeler

6/12/2019 28

Page 29: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Small Dataset Modeler

6/12/2019 29

• ‘Small Dataset Modeler version 1.0.0’ implements exhaustive double cross-validation. In the inner loop, the modeling set is repetitively split into calibration and validation sets. As we are dealing with small-sized modeling set (let say, comprising ‘n’ compounds), the best way is to derive all possible combinations (k) of the validation set (comprising ‘r’ compounds) and the calibration set (comprising ‘n–r’ compounds). In the tool, the user defines the number of compounds (i.e. ‘r’) in the validation set, based on which all possible combinations of calibration and validation sets are generated.

Page 30: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Small Dataset Modeler

6/12/2019 30

• The calibration sets are employed to develop multiple linear regression (MLR) models using the Genetic Algorithm-MLR (GA-MLR) technique, whereas the respective validation sets are utilized to estimate the prediction quality of the developed MLR models.

• For each developed model, the software provides all the important internal validation metrics, namely, R2,Q2

LOO, Q2LMO, MAELOO, and external

validation metrics, namely Q2extF1, Q2

extF2, Q2extF3, concordance correlation

coefficient (CCC), MAEtest employing both 100% and 95% (after removing 5% data with the high prediction errors) data points for all the selected models.

• Note that the above-mentioned validation metrics are calculated for models developed in the inner loop (using calibration sets), as well as, in outer loop (using corresponding training/modeling set and test set) of the exhaustive double cross-validation technique.

• Optionally, the software also generates a partial least squares (PLS) model corresponding to each MLR model.

Page 31: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Small Dataset Modeler

6/12/2019 31

Finally, there are at least 5 recommended ways to select the top model (functionality also available in the software), which are as follows: i) A model (MLR/PLS) with the lowest MAE (95%) in the validation set is selected. ii) A model (MLR/PLS) with the lowest MAE (95%) (modeling set) iii) A model (MLR/PLS) with the highest Q2

LMO (modeling set) iv) Consensus predictions employing the top-ranking models that should be selected based on the MAE (95%) values in the respective validation sets. There are two variants of consensus predictions, a) simple arithmetic average of predictions from all the selected top models and b) weighted average of predictions by assigning appropriate weights to the selected top models based on the mean absolute error obtained from leave-one-out cross-validation, MAEcv(95%). Obviously, the model showing the lowest MAEcv(95%) gets higher weight. v) A pooled set of unique descriptors appearing in the models (again based on the MAE (95%) values in the respective validation sets) is used for further model development. In the case of MLR, the best descriptor combinations out of the unique descriptors present in the ‘nm’ top-ranking models are determined. In case of a PLS option, all descriptors selected in the top models are pooled together for a PLS run.

Page 32: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Case Studies

6/12/2019 32

Data set 1: 35 (Benzo-)Triazoles with reported toxicity values towards Pseudokirchneriella subcapitata Data set 2: 49 (Benzo-)Triazoles with reported solubility in water Data set 3: 29 esters with reported short-term toxicity towards Daphnia magna Data set 4: 30 esters with reported acute toxicity towards Pimephales promelas Data set 5: 48 nitrated polycyclic aromatic hydrocarbons with reported mutagenicity potency in TA100 (without the S9 activation system) Data set 6: 92 cyclic sulfone (or sulfoxide) hydroxyethylamine derivatives as beta-secretase 1 (BACE1) inhibitors Data set 7: 224 4-amino-5-methyl-4H-1,2,4-triazole-3-thiol derivatives as cyclin‐dependent kinase 5/p25 (CDK5/p25) inhibitors Gramatica et al., Journal of Computational Chemistry 2014, 35, 1036-1044.

Ambure and Roy, RSC Advances 2016,6, 28171-28186. Ambure and Roy, RSC Advances 2014,4, 6702-6709.

Page 33: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 33

Page 34: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 1

6/12/2019 34

Page 35: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 1

6/12/2019 35

Page 36: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 2

6/12/2019 36

Page 37: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 2

6/12/2019 37

Page 38: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 3

6/12/2019 38

Page 39: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 3

6/12/2019 39

Page 40: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 4

6/12/2019 40

Page 41: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 4

6/12/2019 41

Page 42: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 5

6/12/2019 42

Page 43: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 5

6/12/2019 43

Page 44: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 6

6/12/2019 44

Page 45: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 6

6/12/2019 45

Page 46: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 1

6/12/2019 46

Page 47: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 2

6/12/2019 47

Page 48: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 3

6/12/2019 48

Page 49: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 4

6/12/2019 49

Page 50: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 5

6/12/2019 50

Page 51: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 6

6/12/2019 51

Page 52: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 7

6/12/2019 52

Page 53: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Data set 7, Division 8

6/12/2019 53

Page 54: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Discussion

6/12/2019 54

• After analyzing the results, we found that our proposed integrated approach was more consistent in the selection of an optimal model when compared to the conventional approach.

• We have simply compared the predictive performance of the QSAR models that were obtained from conventional techniques with the QSAR models obtained from our proposed approach.

• In this case study, total fourteen optimal models were expected to be obtained including six models from data sets 1-6 and eight models from 8 different divisions of data set 7.

• The predictive performance for all the models was evaluated based on the MAE (95% data) value for the respective true external set.

Page 55: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Discussion

6/12/2019 55

• Here, we have found that 13 out of 14 times the optimal model was obtained from our proposed approach, while only in one case (i.e. for data set 7, division 1) the optimal model was obtained from a conventional technique.

• Further, the fourteen optimal models that were obtained in this case study comprise six single MLR/PLS models, six consensus MLR/PLS models and two pooled PLS model. Moreover, data curation involving duplicate analysis, structural and response outlier determination, and activity-cliff analysis was found crucial in improving the quality of modeling as observed with data sets 1, 2, and 4.

• Note that outliers and/or activity cliffs were only found in the data sets 1, 2, 4 and 5, while no outliers or activity-cliffs were found in data sets 3, 6, and 7.

Page 56: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 56

Small Data Set Curator Small Data Set Modeler

Page 57: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Conclusion

6/12/2019 57

• In the present study, we have developed two user-friendly standalone software tools Small Dataset Curator version 1.0.0 and Small Dataset Modeler version 1.0.0 that are freely available to download at https://dtclab.webs.com/software-tools.

• The Small Dataset Curator software helps with the duplicate analysis, response-range and structural outliers’ detection, and activity cliff analysis, which are highly recommended before performing the QSAR modeling of small-sized data sets.

• The Small Dataset Modeler software is designed to handle QSAR modeling of small datasets, and it provides functionalities including “exhaustive” double cross-validation and a set of optimal model selection techniques for model development.

• Moreover, we have performed seven case studies to find out which scheme between the conventional approaches and the proposed integrated approach performs well in the selection of an optimal (MLR/PLS) model in terms of model predictive performance checked on the true external set.

• After analyzing the results from all seven different data sets, we conclude that our proposed scheme is more consistent in the selection of an optimal model when compared to the conventional approaches.

Page 58: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

Acknowledgement

6/12/2019 58

University Grants Commission, India for financial assistance under the UPE II scheme

FCT/MCTES, Portugal

Polish National Science Center

Page 59: A new workflow for QSAR model development from small …6/12/2019 9 QSAR in regulatory use The experimental determination of environmental parameters (e.g., soil sorption, bioconcentration,

6/12/2019 59