1
QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT Paola Gramatica Paola Gramatica a a , Ester Papa , Ester Papa a a and Stefano Pozzi and Stefano Pozzi b b a) Department of Structural and Functional Biology, University of Insubria - a) Department of Structural and Functional Biology, University of Insubria - Varese (Italy) Varese (Italy) b) Laboratory of environmental Studies (SPAA) - Lugano (Switzerland) b) Laboratory of environmental Studies (SPAA) - Lugano (Switzerland) e-mail: e-mail: paola [email protected] http://fisio.dipbsf.uninsubria.it/qsar/ http://fisio.dipbsf.uninsubria.it/qsar/ QSAR QSAR Research Research Unit Unit D 13 Principal Component Analysis PC1 PC2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 -2.5 -1.5 -0.5 0.5 1.5 2.5 -7 -5 -3 -1 1 3 5 mp logKow logKoa logVP logH logSol MW Global Mobility Index Global Mobility Index The inherent tendency of compounds towards global mobility is regulated mainly by volatility, water solubility, Kow and Koa. A Global Mobility Index Global Mobility Index is obtained from the linear combination, by PCA, of the physico-chemical properties: the PC1 score (EV %=74.6%) in Fig. 2. The chemicals on the right side of are those with the major tendency to mobility. The need for a scientific foundation for the criteria used to evaluate persistence and long-range transport (LRT) potential of POPs (Persistent Organic Pollutants) in the environment has been recently highlighted 1 . Persistence is a necessary condition for long-range transport, however persistent chemicals are not necessarily subject to long-range transport: the inherent tendency of compounds towards global mobility must also be taken into account. The half-life of organic pollutants in various compartments is among the most commonly used criterion for studying persistence, but these studies are severely hindered by the limited availability of experimental degradation half-life data, thus there is an incentive to develop reliable procedures, like QSAR/QSPR, to estimate lacking data. The same is true for physico-chemical properties particularly relevant for determining mobility potential 2 . As the Long Range Transport potential of POPs is due to the contemporaneous influence of their persistence in the environment and their inherent tendency to mobility, the finding of the best combination of chemical properties minimizing LRT is a multicriteria problem and can be approached positively through MultiCriteria Decision-Making (MCDM) techniques 3 : procedures for combining the magnitude of several properties into a single quantitative measure of overall quality. For modeling and predicting half life we used a data set of 141 organic compounds, for which half-life experimental values in different compartments are available from Howard 4 , Mackay 5 and Rodan 6 . The molecular structure has been represented by a wide set of molecular descriptors 7 calculated by a software developed by R.Todeschini 7,8 : Constitutional descriptors(56), Topological descriptors(69), Walk counts (20), Bcut descriptors (64), Galvez indices (21), 2D Autocorrelations (96), Charge descriptors (7), Aromaticity descriptors (4), Molecular profiles (40), Geometrical descriptors(18), 3D MoRSE descriptors (160), WHIM descriptors 9 (99), GETAWAY descriptors (196), Empirical descriptors (3). The selection of the best subset variables for modelling half-life was done by a Genetic Algorithm (GA-VSS) approach, where the response is obtained by ordinary least square regression (OLS). All the calculations have been performed by using the leave-one-out (LOO) and leave-more-out (LMO) procedures and the scrambling of the responses for the validation of the models (MOBY-DIGS package) 10 . Introduction Introduction Principal Component Analysis Cum. E.V.: 74.7% ( PC1:48.6% ) PC1 PC2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 79 80 81 82 83 86 87 89 90 91 92 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 115 116 117 118 120 122 123 124 125 126 127 128 129 130 131 132 133 134 136 137 138 139 140 141 -5 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 Soil Groundwater Air Surface water General Persistence Index General Persistence Index The Principal Component Analysis (PCA) of the experimental and predicted half-life of 141 pollutants in various media allows the ranking of the chemicals according to their overall half- life and relative persistence in different media. A general Persistence Index general Persistence Index is obtained from the linear combination of half-life data in four environmental media (PC1 in Fig. 1).The chemicals on the right are the most globally persistent in the various compartments. Figure 1 Figure 1 PERSISTENCE MOBILITY Screening of Long Range Transport Potential Screening of Long Range Transport Potential The finding of the best combination of chemical properties minimizing LRT can be approached by MultiCriteria Decision-Making MultiCriteria Decision-Making (MCDM) techniques: procedures for combining the magnitude of several properties into a single quantitative measure of overall quality. The utility function utility function is chosen here as the best combined criteria function and is applied to the most relevant properties determining the LRT, according to the following criteria, f(x), all expressed as the minimum: the general Persistence Index Persistence Index (Fig.1), deriving from the PCA combination of half- life in four environmental compartments, the Mobility Index Mobility Index (Fig.2), deriving from the cited physico-chemical properties and the Air Half-life Air Half-life , which is considered particularly relevant in determining LRT. The k=3 properties, equally weighted (by the weight ) and added in the utility function, according to the reported formula, allow a ranking of the studied chemicals according to their LRT potential, giving a LRT index ( F(x)). The chemicals, highlighted in Fig. 3, with the lowest utility (F(x) near 0) will exhibit highest LRT potential, while those with F(x) near 1 will have the lowest possibility for LRT. ID UTILITY 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 10 20 30 40 50 60 70 80 High LRT potential (Class 1) Medium LRT potential (Class 2) Low LRT potential (Class 3) F x f i i i k x () () 1 Figure 3 Figure 3 Figure 2 Figure 2 The QSPR (Quantitative Structure-Property Relationships) approach is applied here in two steps: first, to fill the gap in the experimental data of the studied properties and finally to model the scores of the MCDM function, the LRT index (Fig. 3). Different kinds of theoretical molecular descriptors have been used to obtain OLS regression models (Fig.4) and CART classification models (Fig. 5) with good predictive power (Q 2 LOO =86.8%, Q 2 LMO =86.2% and Misclassification Risk Cross val.=6.2%, respectively. LRT potential for 81 chemicals LRT index from MCDM (utility function) LRT potential predicted 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Conclusions Conclusions The ranking of the studied chemicals according to their LRT potential, obtained by the utility function of MCDM, can be proposed as an alternative approach to others based on characteristic travel distance (CTD) 11 . An additional advantage of this approach is that the application of the QSPR models (both regression and classification) on the scores of the MCDM utility function (defined as LRT index) can allow a fast pre- fast pre- screening screening of existing and new chemicals for their inherent tendency to LRT, based simply on the knowledge of their molecular structure. based simply on the knowledge of their molecular structure. References References 1- Klecka, G.M., Ed. (1999). SETAC Pellston Workshop Environ. Toxicol. Chem. (Suppl.), 18, 8 2- Gramatica, P., Pozzi, S., Consonni, V. and Di Guardo, A. (2001) SAR and QSAR in Environ. Res., in press. 3- Hendriks M.M.W.B., De Boer J.H., Smilde A.K. and Doornbos D.A. (1992) Chemom. Intell. Lab. Syst 16, 175. 4- Howard,P.H. et all. Handbook of environmental degradation rates (1991) http://esc. syrres .com/ interkow / PhysProp . htm ; 5- Mackay, Shiu, Ma Illustrated handbook of physical-chemical properties and environmental fate for organic chemicals (2000); 6- Rodan, B.D et all. Envir. Sci. technol.,33(2 3482-3488 (1999); 7- R.Todeschini and V.Consonni,Handbook of molecular descriptors (2000) Wiley; 8- R.Todeschini, DRAGON ver.1.0, Milano, 2000 free download from http://www.disat.unimib.it/chm ; 9- R. Todeschini and P.Gramatica (1997) Quant. Struct.Act. Rel. 16, 113-119. 10- R. Todeschini, R. (1999). MOBY DIGS - Software for multilinear regression analysis and variable subset selection by Genetic Algorithm, rel. 2.1 Milan (Italy). 11- Beyer, A., Mackay, D., Matthies, M., Wania, F. and Webster E. (2000). Environ. Sci.Technol. 34, 699-703. nC 14.50 nC 7.00 E1u 0.40 2 2 Assigned class Classification Tree Figure 5 Figure 5 Figure 4 Figure 4 1 3

QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT FOR ENVIRONMENTAL PERSISTENCE AND LONG

Embed Size (px)

Citation preview

Page 1: QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT FOR ENVIRONMENTAL PERSISTENCE AND LONG

QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPsQSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs

FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORTFOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORTPaola GramaticaPaola Gramaticaaa, Ester Papa, Ester Papaaa and Stefano Pozzi and Stefano Pozzibb

a) Department of Structural and Functional Biology, University of Insubria - Varese a) Department of Structural and Functional Biology, University of Insubria - Varese (Italy)(Italy)

b) Laboratory of environmental Studies (SPAA) - Lugano (Switzerland)b) Laboratory of environmental Studies (SPAA) - Lugano (Switzerland)e-mail: e-mail: [email protected] http://fisio.dipbsf.uninsubria.it/qsar/http://fisio.dipbsf.uninsubria.it/qsar/

QSARQSARResearchResearch

UnitUnitD 13

Principal Component Analysis

PC1

PC

2

1

2

3

4567

8

910

1112

1314

1516

1718

19

20

2122

23

24

25

26

27

2829

30

3132

33

34

35

363738

39

4041

42

43

44

45

4647

48

49

50

51

5253

54

55

56575859

6061

62

6364

656667

68

69

70

71

72

737475

76 77

78

7980

81

-2.5

-1.5

-0.5

0.5

1.5

2.5

-7 -5 -3 -1 1 3 5

mp

logKow

logKoa

logVP

logH

logSolMW

Global Mobility IndexGlobal Mobility Index

The inherent tendency of compounds towards global mobility is regulated mainly by volatility, water solubility, Kow and Koa. A Global Mobility IndexGlobal Mobility Index is obtained from the linear combination, by PCA, of the physico-chemical properties: the PC1 score (EV%=74.6%) in Fig. 2.The chemicals on the right side of are those with the major tendency to mobility.

The need for a scientific foundation for the criteria used to evaluate persistence and long-range transport (LRT) potential of POPs (Persistent Organic Pollutants) in the environment has been recently highlighted1. Persistence is a necessary condition for long-range transport, however persistent chemicals are not necessarily subject to long-range transport: the inherent tendency of compounds towards global mobility must also be taken into account. The half-life of organic pollutants in various compartments is among the most commonly used criterion for studying persistence, but these studies are severely hindered by the limited availability of experimental degradation half-life data, thus there is an incentive to develop reliable procedures, like QSAR/QSPR, to estimate lacking data. The same is true for physico-chemical properties particularly relevant for determining mobility potential 2. As the Long Range Transport potential of POPs is due to the contemporaneous influence of their persistence in the environment and their inherent tendency to mobility, the finding of the best combination of chemical properties minimizing LRT is a multicriteria problem and can be approached positively through MultiCriteria Decision-Making (MCDM) techniques 3: procedures for combining the magnitude of several properties into a single quantitative measure of overall quality.

For modeling and predicting half life we used a data set of 141 organic compounds, for which half-life experimental values in different compartments are available from Howard 4,

Mackay5 and Rodan6. The molecular structure has been represented by a wide set of molecular descriptors7 calculated by a software developed by R.Todeschini7,8: Constitutional

descriptors(56), Topological descriptors(69), Walk counts (20), Bcut descriptors (64), Galvez indices (21), 2D Autocorrelations (96), Charge descriptors (7), Aromaticity descriptors (4),

Molecular profiles (40), Geometrical descriptors(18), 3D MoRSE descriptors (160), WHIM descriptors9 (99), GETAWAY descriptors (196), Empirical descriptors (3).

The selection of the best subset variables for modelling half-life was done by a Genetic Algorithm (GA-VSS) approach, where the response is obtained by ordinary least square

regression (OLS). All the calculations have been performed by using the leave-one-out (LOO) and leave-more-out (LMO) procedures and the scrambling of the responses for the

validation of the models (MOBY-DIGS package)10.

IntroductionIntroduction

Principal Component Analysis Cum. E.V.: 74.7% ( PC1:48.6% )

PC1

PC

2 12

3 4

56 78

91011

12 13

1415

1617

18

1920

21

22

23

24

2526

2728

29

30 3132

3334

3536

37

3839

40

4142

43

44

45

4647

4849

50 5152

53

54

55 5657

58 5960

61

62

63

6465

66

6768

69

70

71 72

73

74

7980

818283

8687

8990

91

92

95

9697

98

99

100101

102

103104

105 106107108109110 111

112113

114

115

116

117

118120

122

123124

125

126127 128129

130

131

132

133134

136137

138139

140

141

-5

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

Soil

Groundwater

Air

Surface water

General Persistence IndexGeneral Persistence IndexThe Principal Component Analysis (PCA) of the experimental and predicted half-life of 141 pollutants in various media allows the ranking of the chemicals according to their overall half-life and relative persistence in different media.A general Persistence Indexgeneral Persistence Index is obtained from the linear combination of half-life data in four environmental media (PC1 in Fig. 1).The chemicals on the right are the most globally persistent in the various compartments. Figure 1Figure 1

PERSISTENCEPERSISTENCE MOBILITYMOBILITY

Screening of Long Range Transport PotentialScreening of Long Range Transport PotentialThe finding of the best combination of chemical properties minimizing LRT can be approached by MultiCriteria MultiCriteria Decision-MakingDecision-Making (MCDM) techniques: procedures for combining the magnitude of several properties into a single quantitative measure of overall quality. The utility functionutility function is chosen here as the best combined criteria function and is applied to the most relevant properties determining the LRT, according to the following criteria, f(x), all expressed as the minimum: the general Persistence IndexPersistence Index (Fig.1), deriving from the PCA combination of half-life in four environmental compartments, the Mobility IndexMobility Index (Fig.2), deriving from the cited physico-chemical properties and the Air Half-lifeAir Half-life, which is considered particularly relevant in determining LRT. The k=3 properties, equally weighted (by the weight ) and added in the utility function, according to the reported formula, allow a ranking of the studied chemicals according to their LRT potential, giving a LRT index ( F(x)).

The chemicals, highlighted in Fig. 3, with the lowest utility (F(x) near 0) will exhibit highest LRT potential, while those with F(x) near 1 will have the lowest possibility for LRT.

ID

UT

ILIT

Y

1

2

3 4

5

6

7

89

10

11

1213

141516

17

1819

20

21

22

2324

2526

2728

29

30

31

3233

34

35

3637

38

39

40

41

42

43

4445

464748

49

505152

53

54

5556

57

5859

6061

6263646566

676869

70

71

7273

74

75

76

77

78

79

80

81

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 10 20 30 40 50 60 70 80

High LRT potential (Class 1)

Medium LRT potential (Class 2)

Low LRT potential (Class 3)

F x fi i

i

k

x( ) ( )

1

Figure 3Figure 3

Figure 2Figure 2

The QSPR (Quantitative Structure-Property Relationships) approach is applied here in two steps: first, to fill the gap in the experimental data of the studied properties and finally to model the scores of the MCDM function, the LRT index (Fig. 3). Different kinds of theoretical molecular descriptors have been used to obtain OLS regression models (Fig.4) and CART classification models (Fig. 5) with good predictive power (Q2

LOO=86.8%, Q2LMO =86.2% and Misclassification Risk Cross val.=6.2%,

respectively.LRT potential for 81 chemicals

LRT index from MCDM (utility function)

LR

T p

ote

nti

al p

red

icte

d

1 234

56

7 8 9 10

11 12

13 1415

16 17181920

21

22 2324 25262728

29 30 313233

34

35

3637

38 3940

41 4243

4445

46

4748

49

5051

52

53

5455

56

57

5859

6061

62 6364

6566

676869

70

7172

73

74

75

76

77

78

79

80

81

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

ConclusionsConclusions The ranking of the studied chemicals according to their LRT potential, obtained by the utility function of MCDM, can be proposed as an alternative approach to

others based on characteristic travel distance (CTD)11

. An additional advantage of this approach is that the application of the QSPR models (both regression and classification) on the scores of the MCDM utility function (defined as LRT index) can allow a fast pre-screeningfast pre-screening of existing and new chemicals for their inherent tendency to LRT, based simply on the knowledge of their molecular structure.based simply on the knowledge of their molecular structure.

ReferencesReferences

1- Klecka, G.M., Ed. (1999). SETAC Pellston Workshop Environ. Toxicol. Chem. (Suppl.), 18, 8

2- Gramatica, P., Pozzi, S., Consonni, V. and Di Guardo, A. (2001) SAR and QSAR in Environ. Res., in

press.

3- Hendriks M.M.W.B., De Boer J.H., Smilde A.K. and Doornbos D.A. (1992) Chemom. Intell. Lab. Syst 16,

175.

4- Howard,P.H. et all. Handbook of environmental degradation rates (1991)

http://esc.syrres.com/interkow/PhysProp.htm ;

5- Mackay, Shiu, Ma Illustrated handbook of physical-chemical properties and environmental fate

for organic chemicals (2000);

6- Rodan, B.D et all. Envir. Sci. technol.,33(2 3482-3488 (1999);

7- R.Todeschini and V.Consonni,Handbook of molecular descriptors (2000) Wiley;

8- R.Todeschini, DRAGON ver.1.0, Milano, 2000 free download from http://www.disat.unimib.it/chm;

9- R. Todeschini and P.Gramatica (1997) Quant. Struct.Act. Rel. 16, 113-119.

10- R. Todeschini, R. (1999). MOBY DIGS - Software for multilinear regression analysis and variable

subset selection by Genetic Algorithm, rel. 2.1 Milan (Italy).

11- Beyer, A., Mackay, D., Matthies, M., Wania, F. and Webster E. (2000). Environ. Sci.Technol. 34,

699-703.

nC 14.50

nC 7.00

E1u 0.40

2 2

Assigned class

Classification Tree

Figure 5Figure 5Figure 4Figure 4

1 3