1
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS Kamel Mansouri , Tine Ringsted, Viviana Consonni, Davide Ballabio, Roberto Todeschini Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, P.za della Scienza 1 20126 Milano, Italy Persistent organic pollutants are highly bioaccumulative with toxic effects on humans, wildlife and the environment. Their persistency have been studied experimentally and theoretically for the evaluation of new chemicals to avoid Persistent Bioaccumulative and Toxic (PBT) compounds. In order to fill data gaps, QSARs are increasingly being used by scientific community as an alternative to animal testing and implemented in legislation (REACH). The goal of this study was to predict ready biodegradation of chemicals by QSAR modeling. The dataset used for this purpose was produced by the Japanese Ministry of International Trade and Industry (MITI) with experimental results according to the OECD test guideline 301C. Molecular descriptors from Dragon 6 were calculated. Variable selection coupled with classification methods were applied to find the most predictive models with low cross-validation error rate. The best models were after that validated using the preselected test set to check its prediction reliability and for further analysis. 1314 compounds with ready biodegradation (MITI-I test) were collected.[1] A molecule was removed if: it had a disconnected structure the experimental value did not agree with the classification BOD threshold of 60%. (Fig1) replicate values had more than 20% difference the classification would change if nitrification was taken into account After removal 1055 molecules remained (356 ready biodegradable/ 699 not ready biodegradable). (Fig2) Descriptors : Different blocks of molecular descriptors were initially calculated using Dragon6 [3]; 2D Atom pairs, Topological indices, Ring descriptors, Constitutional indices, Functional groups, 2D Matrix based, Atom centered fragments, Atom type E-state. Highly correlated, constant and near constant descriptors were removed automatically using the same software. Variable selection: In Matlab, using genetic algorithms (GA) [4] applied on each classification method, (SVM, KNN, PLSDA), two filters were performed to select the best descriptors: + first on each block apart, then on resulting sets all merged. + the frequency of selection after 100 GA runs was used to sort the descriptors by importance to keep only the 100 most appropriates ones for the last modeling step. Validation of models: 5-fold cross-validation. A test set which was chosen by randomly splitting the initial data set into 20% test and 80% training set while keeping the balance between ready biodegradable/not ready biodegradable. The training set contained 837 molecules and the test set 218 molecules. 0 10 20 30 40 50 60 70 80 Number of molecules 28 days <28 days QSAR SVM KNN PLS-DA Model ID Descriptors 5f-CV Test ER cv Spec. Sens. ER test Spec. Sens. SVM_1 20 0.151 0.775 0.924 0.135 0.806 0.925 SVM_2 23 0.153 0.785 0.910 0.131 0.806 0.932 SVM_3 24 0.156 0.775 0.913 0.131 0.819 0.918 Model ID Descriptors LVs Fit 5f-CV Test ER fit Spec. Sens. ER cv Spec. Sens. ER test Spec. Sens. PLSDA_1 26 9 0.140 0.887 0.834 0.141 0.891 0.826 0.145 0.861 0.849 PLSDA_2 28 9 0.144 0.891 0.821 0.142 0.887 0.828 0.145 0.847 0.863 PLSDA_3 23 5 0.144 0.880 0.832 0.141 0.884 0.834 0.148 0.833 0.870 Model ID Descriptors Distance K 5f-CV Test ER cv Spec. Sens. ER test Spec. Sens. KNN_1 17 Euclidean 6 0.136 0.859 0.870 0.121 0.847 0.911 KNN_2 17 CityBloc 6 0.139 0.852 0.870 0.138 0.847 0.877 KNN_1 15 CityBloc 8 0.141 0.849 0.870 0.142 0.806 0.911 Abstract: Acknowledgements: The research leading to these results has received funding from the [European Community's] Seventh Framework Programme ([FP7/2007-2013]) under Grant Agreement n° [238701] of the project Marie Curie ITN Environmental Chemoinformatics (ECO-ITN). http://www.eco-itn.eu References: [1]. Chemical Risk Information Platform (CHRIP), National Institute of Technology and Evaluation, Japan, http://www.safe.nite.go.jp/english/kizon/KIZON_start_hazkizon.htm [2]. Chih-Chung Chang and Chih-Jen Lin, LIBSVM 3.1 http://www.csie.ntu.edu.tw/~cjlin/libsvm [3]. Dragon6. Talete srl, Milano, Italy, http://www.talete.mi.it [4]. Leardi, R., Lupianez, A., 1998. Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemometr. Intell. Lab. 41, 195207. in vessel substance test mg blank by uptake O mg - substance by test uptake O mg 2 2 BOD Table2: Selected best models using GA-SVM Table1: Selected best models using GA-KNN Table3: Selected best models using GA-PLSDA Fig2: Multidimensional scaling plot Fig1: Distribution of BOD values in Ready Biodeg. compounds. The number of K nearest neighbors was optimized during the GA calculations to meet the lowest cross-validation error rate (ER cv). The most selected descriptors are: Kier benzene-likeliness (BLI), nb. atoms of type 'sssN', sum of 'dssC' E- states, nb. of subst. benzene C(sp2) and nb. of ring tertiary C(sp3). The number of PLSDA latent variables (LVs) was optimized during the GA calculations to meet the lowest cross-validation error rate (ER cv). The most selected descriptors are: R-CX-R, nb. of atoms type 'sssN’, spectral mean absolute deviation from Laplace matrix , presence of C-Cl at Topo. Dist. 1, eccentricity, nb. of N atoms, nb. of (thio-) carbamates (aliphatic) average Randic index from Burden matrix weighted by mass and Cl attached to C1(sp3). The SVM results were obtained using the LIBSVM3.1 C library compiled in Matlab [2]. The kernel used in the radial-basis-function and its default parameters defined in the library. The most selected descriptors are: average MW, nb. of terminal primary C(sp3), mean first ionisation pot., nb. of N atoms, sum of ' aasC' E-states, nb. of heteroatoms, nb. of esters (aromatic), intrinsic state pseudoconnectivity index and freq. of C-P at Topo. Dist 2.

QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin,

Embed Size (px)

Citation preview

Page 1: QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin,

QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS

Kamel Mansouri, Tine Ringsted, Viviana Consonni,

Davide Ballabio, Roberto Todeschini

Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences,

University of Milano-Bicocca, P.za della Scienza 1 – 20126 Milano, Italy

Persistent organic pollutants are highly bioaccumulative with toxic

effects on humans, wildlife and the environment. Their persistency have

been studied experimentally and theoretically for the evaluation of new

chemicals to avoid Persistent Bioaccumulative and Toxic (PBT)

compounds. In order to fill data gaps, QSARs are increasingly being

used by scientific community as an alternative to animal testing and

implemented in legislation (REACH).

The goal of this study was to predict ready biodegradation of

chemicals by QSAR modeling. The dataset used for this purpose was

produced by the Japanese Ministry of International Trade and Industry

(MITI) with experimental results according to the OECD test guideline

301C. Molecular descriptors from Dragon 6 were calculated. Variable

selection coupled with classification methods were applied to find the

most predictive models with low cross-validation error rate. The best

models were after that validated using the preselected test set to check

its prediction reliability and for further analysis.

1314 compounds with ready biodegradation (MITI-I test) were collected.[1]

A molecule was removed if:

it had a disconnected structure

the experimental value did not agree with the classification BOD threshold

of 60%. (Fig1)

replicate values had more than 20% difference

the classification would change if nitrification was taken into account

After removal 1055 molecules remained (356 ready biodegradable/ 699 not ready

biodegradable). (Fig2)

Descriptors : Different blocks of molecular descriptors were

initially calculated using Dragon6 [3]; 2D Atom pairs,

Topological indices, Ring descriptors, Constitutional

indices, Functional groups, 2D Matrix based, Atom centered

fragments, Atom type E-state.

Highly correlated, constant and near constant descriptors

were removed automatically using the same software.

Variable selection: In Matlab, using genetic algorithms (GA) [4] applied on each classification

method, (SVM, KNN, PLSDA), two filters were performed to select the best

descriptors:

+ first on each block apart, then on resulting sets all merged.

+ the frequency of selection after 100 GA runs was used to sort the

descriptors by importance to keep only the 100 most appropriates ones for

the last modeling step.

Validation of models: 5-fold cross-validation.

A test set which was chosen by randomly splitting the initial data set into

20% test and 80% training set while keeping the balance between ready

biodegradable/not ready biodegradable. The training set contained 837

molecules and the test set 218 molecules.

0

10

20

30

40

50

60

70

80

Num

ber

of

mole

cule

s

28 days

<28 days

QSAR

SVM

KNN

PLS-DA

Model ID Descriptors

5f-CV Test

ER cv Spec. Sens. ER test Spec. Sens.

SVM_1 20 0.151 0.775 0.924 0.135 0.806 0.925

SVM_2 23 0.153 0.785 0.910 0.131 0.806 0.932

SVM_3 24 0.156 0.775 0.913 0.131 0.819 0.918

Model ID Descriptors LVs

Fit 5f-CV Test

ER fit Spec. Sens. ER cv Spec. Sens. ER test Spec. Sens.

PLSDA_1 26 9 0.140 0.887 0.834 0.141 0.891 0.826 0.145 0.861 0.849

PLSDA_2 28 9 0.144 0.891 0.821 0.142 0.887 0.828 0.145 0.847 0.863

PLSDA_3 23 5 0.144 0.880 0.832 0.141 0.884 0.834 0.148 0.833 0.870

Model ID Descriptors Distance K

5f-CV Test

ER cv Spec. Sens. ER test Spec. Sens.

KNN_1 17 Euclidean 6 0.136 0.859 0.870 0.121 0.847 0.911

KNN_2 17 CityBloc 6 0.139 0.852 0.870 0.138 0.847 0.877

KNN_1 15 CityBloc 8 0.141 0.849 0.870 0.142 0.806 0.911

Abstract:

Acknowledgements: The research leading to these results has received funding from the [European

Community's] Seventh Framework Programme ([FP7/2007-2013]) under Grant Agreement

n° [238701] of the project Marie Curie ITN Environmental Chemoinformatics (ECO-ITN).

http://www.eco-itn.eu

References: [1]. Chemical Risk Information Platform (CHRIP), National Institute of Technology and

Evaluation, Japan, http://www.safe.nite.go.jp/english/kizon/KIZON_start_hazkizon.htm

[2]. Chih-Chung Chang and Chih-Jen Lin, LIBSVM 3.1

http://www.csie.ntu.edu.tw/~cjlin/libsvm

[3]. Dragon6. Talete srl, Milano, Italy, http://www.talete.mi.it

[4]. Leardi, R., Lupianez, A., 1998. Genetic algorithms applied to feature selection in PLS

regression: how and when to use them. Chemometr. Intell. Lab. 41, 195–207.

in vessel substance test mg

blankby uptake O mg - substanceby test uptake O mg 22BOD

Table2: Selected best

models using GA-SVM

Table1: Selected best models using GA-KNN

Table3: Selected best models using GA-PLSDA

Fig2: Multidimensional scaling plot

Fig1: Distribution of

BOD values in

Ready Biodeg.

compounds.

The number of K nearest neighbors was optimized during the GA calculations to meet the lowest cross-validation error rate (ER cv). The most selected descriptors are: Kier benzene-likeliness (BLI), nb. atoms of type 'sssN', sum of 'dssC' E-states, nb. of subst. benzene C(sp2) and nb. of ring tertiary C(sp3).

The number of PLSDA latent variables (LVs) was optimized during the GA calculations to meet the lowest cross-validation error rate (ER cv). The most selected descriptors are: R-CX-R, nb. of atoms type 'sssN’, spectral mean absolute deviation from Laplace matrix , presence of C-Cl at Topo. Dist. 1, eccentricity, nb. of N atoms, nb. of (thio-) carbamates (aliphatic) average Randic index from Burden matrix weighted by mass and Cl attached to C1(sp3).

The SVM results were obtained using the LIBSVM3.1 C library compiled in Matlab [2]. The kernel used in the radial-basis-function and its default parameters defined in the library. The most selected descriptors are: average MW, nb. of terminal primary C(sp3), mean first ionisation pot., nb. of N atoms, sum of ' aasC' E-states, nb. of

heteroatoms, nb. of esters (aromatic), intrinsic state pseudoconnectivity index and freq. of C-P at Topo. Dist 2.