QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin,

QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS

Kamel Mansouri, Tine Ringsted, Viviana Consonni,

Davide Ballabio, Roberto Todeschini

Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences,

University of Milano-Bicocca, P.za della Scienza 1 – 20126 Milano, Italy

Persistent organic pollutants are highly bioaccumulative with toxic

effects on humans, wildlife and the environment. Their persistency have

been studied experimentally and theoretically for the evaluation of new

chemicals to avoid Persistent Bioaccumulative and Toxic (PBT)

compounds. In order to fill data gaps, QSARs are increasingly being

used by scientific community as an alternative to animal testing and

implemented in legislation (REACH).

The goal of this study was to predict ready biodegradation of

chemicals by QSAR modeling. The dataset used for this purpose was

produced by the Japanese Ministry of International Trade and Industry

(MITI) with experimental results according to the OECD test guideline

301C. Molecular descriptors from Dragon 6 were calculated. Variable

selection coupled with classification methods were applied to find the

most predictive models with low cross-validation error rate. The best

models were after that validated using the preselected test set to check

its prediction reliability and for further analysis.

1314 compounds with ready biodegradation (MITI-I test) were collected.[1]

A molecule was removed if:

it had a disconnected structure

the experimental value did not agree with the classification BOD threshold

of 60%. (Fig1)

replicate values had more than 20% difference

the classification would change if nitrification was taken into account

After removal 1055 molecules remained (356 ready biodegradable/ 699 not ready

biodegradable). (Fig2)

Descriptors : Different blocks of molecular descriptors were

initially calculated using Dragon6 [3]; 2D Atom pairs,

Topological indices, Ring descriptors, Constitutional

indices, Functional groups, 2D Matrix based, Atom centered

fragments, Atom type E-state.

Highly correlated, constant and near constant descriptors

were removed automatically using the same software.

Variable selection: In Matlab, using genetic algorithms (GA) [4] applied on each classification

method, (SVM, KNN, PLSDA), two filters were performed to select the best

descriptors:

+ first on each block apart, then on resulting sets all merged.

+ the frequency of selection after 100 GA runs was used to sort the

descriptors by importance to keep only the 100 most appropriates ones for

the last modeling step.

Validation of models: 5-fold cross-validation.

A test set which was chosen by randomly splitting the initial data set into

20% test and 80% training set while keeping the balance between ready

biodegradable/not ready biodegradable. The training set contained 837

molecules and the test set 218 molecules.

0

10

20

30

40

50

60

70

80

Num

ber

of

mole

cule

s

28 days

<28 days

QSAR

SVM

KNN

PLS-DA

Model ID Descriptors

5f-CV Test

ER cv Spec. Sens. ER test Spec. Sens.

SVM_1 20 0.151 0.775 0.924 0.135 0.806 0.925

SVM_2 23 0.153 0.785 0.910 0.131 0.806 0.932

SVM_3 24 0.156 0.775 0.913 0.131 0.819 0.918

Model ID Descriptors LVs

Fit 5f-CV Test

ER fit Spec. Sens. ER cv Spec. Sens. ER test Spec. Sens.

PLSDA_1 26 9 0.140 0.887 0.834 0.141 0.891 0.826 0.145 0.861 0.849

PLSDA_2 28 9 0.144 0.891 0.821 0.142 0.887 0.828 0.145 0.847 0.863

PLSDA_3 23 5 0.144 0.880 0.832 0.141 0.884 0.834 0.148 0.833 0.870

Model ID Descriptors Distance K

5f-CV Test

ER cv Spec. Sens. ER test Spec. Sens.

KNN_1 17 Euclidean 6 0.136 0.859 0.870 0.121 0.847 0.911

KNN_2 17 CityBloc 6 0.139 0.852 0.870 0.138 0.847 0.877

KNN_1 15 CityBloc 8 0.141 0.849 0.870 0.142 0.806 0.911

Abstract:

Acknowledgements: The research leading to these results has received funding from the [European

Community's] Seventh Framework Programme ([FP7/2007-2013]) under Grant Agreement

n° [238701] of the project Marie Curie ITN Environmental Chemoinformatics (ECO-ITN).

http://www.eco-itn.eu

References: [1]. Chemical Risk Information Platform (CHRIP), National Institute of Technology and

Evaluation, Japan, http://www.safe.nite.go.jp/english/kizon/KIZON_start_hazkizon.htm

[2]. Chih-Chung Chang and Chih-Jen Lin, LIBSVM 3.1

http://www.csie.ntu.edu.tw/~cjlin/libsvm

[3]. Dragon6. Talete srl, Milano, Italy, http://www.talete.mi.it

[4]. Leardi, R., Lupianez, A., 1998. Genetic algorithms applied to feature selection in PLS

regression: how and when to use them. Chemometr. Intell. Lab. 41, 195–207.

in vessel substance test mg

blankby uptake O mg - substanceby test uptake O mg 22BOD

Table2: Selected best

models using GA-SVM

Table1: Selected best models using GA-KNN

Table3: Selected best models using GA-PLSDA

Fig2: Multidimensional scaling plot

Fig1: Distribution of

BOD values in

Ready Biodeg.

compounds.

The number of K nearest neighbors was optimized during the GA calculations to meet the lowest cross-validation error rate (ER cv). The most selected descriptors are: Kier benzene-likeliness (BLI), nb. atoms of type 'sssN', sum of 'dssC' E-states, nb. of subst. benzene C(sp2) and nb. of ring tertiary C(sp3).

The number of PLSDA latent variables (LVs) was optimized during the GA calculations to meet the lowest cross-validation error rate (ER cv). The most selected descriptors are: R-CX-R, nb. of atoms type 'sssN’, spectral mean absolute deviation from Laplace matrix , presence of C-Cl at Topo. Dist. 1, eccentricity, nb. of N atoms, nb. of (thio-) carbamates (aliphatic) average Randic index from Burden matrix weighted by mass and Cl attached to C1(sp3).

The SVM results were obtained using the LIBSVM3.1 C library compiled in Matlab [2]. The kernel used in the radial-basis-function and its default parameters defined in the library. The most selected descriptors are: average MW, nb. of terminal primary C(sp3), mean first ionisation pot., nb. of N atoms, sum of ' aasC' E-states, nb. of

heteroatoms, nb. of esters (aromatic), intrinsic state pseudoconnectivity index and freq. of C-P at Topo. Dist 2.