Use of Rough Sets and Spectral Data for Building Predictive Models of Reaction Rate Constants

submitted papers

Use of Rough Sets and Spectral Data for Building Predictive Models of Reaction Rate Constants

T I M O T H Y W. C O L L E T F E * and A D A M J. S Z L A D O W Environmental Research Laboratory, t U.S. Environmental Protection Agency, Athens, Georgia 30605-2720 (T. W. C.); and REDUCT Systems Inc., Regina, Saskatchewan S4P 3L7, Canada (A.J.S.)

A model for predicting the log of the rate constants for alkaline hydrolysis of organic esters has been developed with the use of gas-phase mid- infrared library spectra and a rule-building software system based on the mathematical theory of rough sets. A diverse set of 41 esters was used as training compounds. The model is an advance in the development of a generalized system for predicting environmentally important reactivity parameters based on spectroscopic data. By comparison to a previously developed model using the same training set with multiple linear regression (MLR), the rough-sets model provided better predictive power, was more widely applicable, and required less spectral data manip- ulation. [For the previous MLR model, a standard error of prediction (SEP) of 0.59 was calculated for 88% of the training set data under leave-one-out cross-validation. In the present study using rough sets, an SEP of 0.52 was calculated for 95% of the data set.] More importantly, analysis of the decision rules generated by rough-sets analysis can lead to a better understanding of both the reaction process under study and important trends in the spectral data, as well as underlying relationships between the two.

Index Headings: IR spectroscopy; Environmental analysis; Rough sets.

INTRODUCTION

The number of chemicals for which the U.S. Environ- mental Protection Agency may need to assess potential risk to humans and the environment is limited only by the number of chemicals that have been (or will be) man- ufactured in sizable quantities. When a chemical is re- leased into the environment, risk can be assessed only after the fate of the chemical (transport and transformation) has been determined. Such determinations can require more than a dozen chemical-specific kinetic and equilibrium constants, few of which are available in the literature. Laboratory measurement of the needed constants is always expensive and often requires more time and resources than are available. Thus, a pressing need exists for reliable, economical, and rapid methods for predicting these rate constants.

Vibrational spectra (often in conjunction with sophisticated statistical techniques) have been used extensively

Received 21 January 1994; accepted 16 August 1994. * Author to whom correspondence should be sent. t Mention of trade names or commercial products does not constitute

endorsement or recommendation for use by the U.S. Environmental Protection Agency.

for the prediction of constituent concentration in various types of complex samples [e.g., prediction of fat content of foods using principal component regression (PCR) analysis with near-infrared data], l However, only a few reports on the use of spectra for the challenging problem of predicting chemical activity can be found in the literature, 2 even though such application logically follows classical works of vibrational spectroscopy and physical organic chemistry. Infrared group frequencies and intensities have long been known to depend on the polar and steric nature of chemical substructure 3 and have been shown to correlate with Hammett a constants. 4 In turn, many derived parameters such as Hammett a constants have been used extensively for correlation with chemical activity)

We have recently reported on the use of infrared spectral data for the prediction of the log of alkaline hydrolysis rate constants (kon) of organic esters 6 as a prototype of a generalized spectroscopic-based method for predicting pollutant transport and transformation parameters for use in environmental risk assessment. In that report, data points from interferograms of the IR spectra of 41 esters were used to build a simple multiple linear regression (MLR) model. One of the principal limitations stated in that study 6 was the lack of a more sophisticated method for ranking the predictive power of the individual inter- ferometric points in order to determine which ones to include in the model. In response to this problem, and also to gain additional insight into spectra-activity relationships, we report here on an improved method for predicting log koH from IR spectra based on the mathematical theory of rough sets.

There are many chemometric methods that have been shown to outperform MLR in terms of accuracy of prediction: However, for our application, accuracy of prediction is not the only consideration. Although alkaline hydrolysis is a well-understood process, the mechanisms of many environmentally important processes are not known. Some transformations of chemicals by microor- ganisms are important examples. This lack of mechanistic information is often the biggest impediment to accurately forecasting the fate of chemicals in the environment. After more complete development, we want to apply predictive techniques of this type to important environmental pro-

Volume 48, Number 11, 1994 0003-7028/94/4s11.137952.00/0 APPLIED SPECTROSCOPY 1379 © 1994 Society for Applied Spectroscopy

X 1

/

d d'

+ + + + - - -

Boundary Positive Negative Region Region Region

FIG. 1. Graphical representation of an approximation space.

cesses for which mechanisms and pathways are not well understood, with a view toward delineating mechanistic information. The approach would be to gather empirical training data, develop a predictive model, and then examine the components of the spectra that are chosen as being related to activity. By doing so, and applying group frequency analysis, we can hopefully infer process mechanistic information. In other words, we want to use predictive models, not only to predict rates, but also as a tool by which fundamental process information can be gained. With this knowledge in hand, remediation strat- egies could be developed by which risk from toxic chemicals can be reduced.

The types of predictive tools that are perhaps best suit- ed for these objectives are rule-building expert systems, which are capable of identifying patterns and developing understandable decision rules from examples. Software based on the mathematical theory of rough sets provides a rule-building system that is capable of this type of empirical learning. (Other types of rule-building expert systems have been successfully applied to other chemical problems. 8,9) As opposed to statistical methods, for rough sets, the discovery and description of structural set relationships (rather than probability distributions) are the primary objective in the rule-building process. A fundamental aspect of the rough-sets model is the formal recognition of uncertainty in the process of reasoning and decision making, which is advantageous when one is dealing with imprecise data. The rough-sets model presents an alternative approach to dealing with uncertainty, which has also been addressed by several other methods including fuzzy-set theory? °

THEORY

A complete presentation of rough-sets theory is not feasible in this article; details are available in the literature. ~,12 The intent here is to present the basic underlying concepts and to give some insight into the way predictive models can be built with the use of spectral data. The methodological approach that underlies the technique is

based on the intuitive observation that lowering the degree of precision in the representation of objects (for our application, by replacing discrete log kon values with ranges) makes data regularities more visible, as well as easier to characterize in terms of rules. The central issue is the analysis of limits of discernibility of a subset of objects, Xi, as defined by decision outcomes (for example, X~ may be those esters with log kon > 3.0) belonging to the domain, or universe, U (all esters in the model development database). In other words, we want to determine how well a subset defined by decision outcomes can be characterized in terms of attributes, where attributes are the pieces of information available to represent objects (e.g., spectral data points). In rough-sets theory, clas- sification knowledge is incorporated into the set description by means of an equivalence relation, R, which corresponds to partitioning the universe into a number of small classes based on attributes (e.g., spectral data points).

Let U denote a finite universe of objects--for example, all esters in our database. The pair A = (U,R), consisting of the universe of objects U and the associated equivalence relation R, is called an approximation space.

Let R* = {x~, x2 . . . . . x,} denote the partition induced by the equivalence relation R, where xi is an equivalence class of R* (an elementary set of A). The classes of R* represent the resolution of our system, i.e., our ability to describe or characterize subsets of U, such as X~. [For example, x~ may be the class whose members all have C=O stretching peak frequencies (J'c=o) = 1730 _+ 5 cm- and C-O stretching peak frequencies (Pc-o) = 1270 _+ 5 cm-t.] One can distinguish among esters in different equivalence classes, but not among esters that make up one equivalence class. Thus, an equivalence class is the smallest group of esters discernible by the chosen clas- sification knowledge. Because, in general, it is impossible to characterize any subset of U such as X~ completely within the resolution limits, the best we can do is to provide an approximate characterization. This limitation leads to the concepts of lower and upper approximations.

For the subset X~ ~ U, we can define the lower [A_(X0] and upper [A(X~)] approximations of X~ in the approximation space A = (U,R) using two expressions:

A(X,) = u x , : x , c_ X,

A(Xi ) : U x i : x i n X 1 ~ O.

That is, A_(X0 is the union of those equivalence classes (x~) in A which are individually contained by X~, whereas.4(X~) is the union of all those x, each of which has a nonempty intersection with X~.

On the basis of the notions of the lower and upper approximations, we can characterize the concept X~ in the approximation space A with three distinct regions:

(1) A(X0 is called the positive region POS(X0 of X~ in A.

(2) A(X~) - A(X0 is called the boundary region BND(X~) of X~ in A.

(3) U - A(X~) is called the negative region NEG(X0 of Xt inA.

The concept of an approximation space for a single subset (X0 is depicted graphically in Fig. 1. The large rectangle

1380 Volume 48, Number 11, 1994

is U, the universe of esters. The subdivision of U into smaller rectangles represents a certain indiscernibility relationship. Each smaller rectangle represents an equivalence class xi described by some combination of a few spectral features. The portion enclosed by the curve represents the subset Xi of U(e.g., esters with log koH > 3.0). The approximation space provides an approximate characterization of Xi of U. For objects in the positive region, it can be determined without ambiguity that they belong to X1. For objects in the negative region, it can be determined without ambiguity that they do not belong to X~. However, the boundary region consists of objects whose membership status with respect to Xl cannot be determined because some objects in the boundary region are contained by Xi, and some are not. Recall that objects in equivalence classes are not distinguishable from each other in the chosen knowledge representation system. The boundary region is the way in which uncertainty is for- mally recognized in the rough-sets approach. The accuracy of the set approximation is measured as the ratio of the size of the lower approximation to the size of the upper approximation. If BND(X0 = 0, the accuracy = 1, and we say that set Xl is completely definable in A; oth- erwise, X~ is said to be not completely definable, or a rough set.

To understand rough-sets theory it is instructive to contrast it with the better known technique of fuzzy sets, which is an alternative way of modeling uncertainty. With fuzzy sets, the underlying idea is the adoption of a "soft" set membership function, which assumes values from 0 to 1. This value is interpreted as a degree of membership, reflecting partial association of a particular element with the set concept. With rough sets, the underlying idea is entirely different. Partial membership is not used at all. In rough sets, an element may or may not belong to a set. However, knowledge of the membership function is re- stricted only to the negative and positive region of the universe. The membership function in the boundary region is simply unknown. This incomplete membership function is obtained from the training data by applying the definitions of set approximations. Developers of the theories of rough sets and fuzzy sets refer to the uncertainty in fuzzy sets as vagueness and the uncertainty in rough sets as indiscernibility. The two theories are con- sidered to be distinctly different and complementary? 2

The introduction of the rough-sets model leads to a system for reasoning from data. To build a predictive model, the user breaks the data into a number of disjointed ranges based on decision outcomes (e.g., log koH values). (Each range is a subset Xi of the universe, or a decision concept.) The software system then attempts to generate a minimal set of decision rules based on selected spectral features that can best discriminate all decision concepts. In the process of decision rule generation, the software system partitions example cases by means of attribute values into a number of disjointed equivalence classes (elemental sets xis), forming an approximation space for each subset Xi. That is, the positive, negative, and boundary regions for all X~ are generated. The approximation space is then dynamically refined until the most accurate approximation, based on a minimal collection of the strongest attributes, is reached for each decision concept. When optimized, the universal set will

be partitioned in such a way that no dividing line between equivalence classes can be eliminated without decreasing the accuracy of the approximation. (Eliminating a dividing line corresponds to deleting spectral features that differentiate one equivalence class from another.) Note that, in Fig. 1, the dividing line marked d can be eliminated without affecting the accuracy of approximation; however, deleting the line marked d' would increase the size of the boundary region and therefore decrease the accuracy. The model is refined by eliminating redundant attributes that do not contribute to the accuracy of the approximation. The underlying concept is that nonre- dundant descriptions characterize important patterns in the data.

Because the classes of the final partition are all definable, we can associate a set of decision rules with each decision concept. For our application, rules are "information" about the presence or absence of peaks that are characteristic for each decision concept. Specifically, rules are logical combinations of elementary conditions, where the conditions are intensity thresholds at certain frequencies. For example, the rule for inclusion in a particular decision concept may be

I1730 > 0.7AU and 11270 < 0.6AU or

0.3AU < 11755 > 0.8AU and I11s5 > 0.6AU and

12095 < 0.2AU

where Ix is the intensity at the frequency x cm-1 in ar- bitrary units (AU).

Ultimately, the rules of the model can be used to make predictions of the range of log kon for a "new" ester, which has not been used in generating the model, but whose spectrum is available. The system analyzes the new spectrum and determines into which decision concept the unknown fits. If no equivalence class can be matched, the system reports that no decision can be made. This outcome often indicates that new and relevant information is contained in the spectrum of the unknown. If more that one equivalence class is matched, the spectrum can be described by more than one rule and, possibly, by more than one range of log koH. This outcome indicates that more data are required in the training set, or that the partition of the approximation space should be less coarse.

EXPERIMENTAL

One objective of this work was to benchmark the performance of the rough-sets approach against that of a commonly used, well-understood method. Therefore, the same database of 41 organic esters (same spectra and experimental values for log koH) that was used in our previously published study employing MLR 6 was used in the present study. Briefly, the spectra were extracted from the EPA vapor-phase IR library (8-cm-1 resolution) and then crudely normalized to represent a constant concentration based on the number of aliphatic C-H units and the observed intensities ofaliphatic C-H stretching peaks (3025-2800 cm-l). The rate constants for most of these esters were extracted from the literature. 6

The modeling software used for the rough-sets analysis is a commercially available package called DataLogic/R

APPLIED SPECTROSCOPY 1381

T A B L E I. Partition of 41 esters into six decision concepts (A-F) based on log ko. values.

Dec i - s ion No. o f con- C o n c e p t M i d P T R a n g e m e m - cep t s de f in i t i on log koH log u n i t s be rs

A - 3 . 0 4 =< log koH =< - -2 .00 - -2 .52 1.04 6 B - 2 . 0 0 < log k o , --< - 1.00 - 1.50 1.00 14 C - 1 . 0 0 < log k o , --< 0 .00 - 0 . 5 0 1.00 7 D 0 .00 < log koH =< 1.00 0 ,50 1.00 2 E 1.00 < log koN ~ 2 .00 1.50 1.00 10 F 2 .00 < log kon ~ 3.41 2.71 1.41 2

(REDUCT Systems Inc. Regina, Saskatchewan, Canada). The software ran on a PC AT class computer. ~3

RESULTS AND DISCUSSION

For our MLR model the most advantageous approach that was tested (which is described fully in a previous report 6) was to first divide spectra into peak-rich and baseline-only regions, and then take Fourier transforms only for each peak-rich region. Then, after elimination of centerbursts, a preliminary regression analysis was un- dertaken to preselect a number of information-rich points from the interferograms on the basis of the magnitude of their associated regression coefficients. These information-rich points were then included in the final MLR predictive model. This ranking system was crude and cumbersome, but it gave the best MLR results of the approaches tested, including several other ways of generating and handling interferograms, in addition to using spectra directly.

The rough-sets analysis reported herein uses only spectra (as opposed to interferograms). Limited investigations indicate that single-sided, real components of interferograms (with only the centerburst rejected) produce models with predictive power comparable (but not superior) to the one presented here with the use of spectra. When predictive power is comparable, we prefer models based on spectra (over those based on interferograms) because their rules can be more easily interpreted.

As stated previously, for rough-sets analysis, the user must partition the training compounds into a number of disjointed sets based on decision outcomes (log kon values). Table I lists the partitioning used for the present study. When predictions are made for "new" esters, the model predicts which of the six decision concepts the new compound best fits. However, for our application, the prediction of a discrete value, as opposed to a range of values, is required. For this reason, we have let the midpoints of the range defined by a decision concept represent the predicted discrete value for the rough-sets analysis. The midpoints are also listed in Table I.

Table II summarizes the results of leave-one-out cross- validation analysis, listing the names of the compounds, their experimental log koH values, their actual decision concept, their predicted decision concept, and the predicted discrete log koH value that would be used in a true- predictive situation.

The first 35 entries in Table II [for which the predicted decision concept(s) column contains an arrow] are predicted as members of the correct concept. Although these

T A B L E I I . Results of leave-one-out cross-validation analysis of the 41 esters with the predictive model based on rough sets.

f o r - Expef i - rec t Pre-

C o m p o u n d m e n t a l con- P r e d i c t e d d i c t e d n a m e log koH cep t concept ( s ) log kon

E thy l n - b u t y r a t e - 1.26 B ~ - 1.50 Benzyl ace t a t e - 0 . 7 1 C ~-- - 0 . 5 0 n -Bu ty l ace t a t e - 1.06 B ~-- - 1.50 E thy l i s o b u t y r a t e - 1.49 B ~ - 1.50 E thy l ace t a t e - 0 . 9 6 C ~ - 0 . 5 0 E thy l b e n z o a t e - 1.50 B ,-- - 1.50 n - P r o p y l a ce t a t e - 1.06 B ~ - 1.50 I s o p r o p y l f o r m a t e 1.04 E ~ - 1.50 E thy l b r o m o a c e t a t e 1.70 E ~ 1.50 E thy l i o d o a c e t a t e 1.21 E ~-- 1.50 E thy l f o r m a t e 1.41 E ~ 1.50 M e t h y l b e n z o a t e - 1.10 B ~-- - 1.50 E thy l c h l o r o a c e t a t e 1.56 E ~-- 1.50 M e t h y l m e t h a c r a l a t e - 1.25 B ~ - 1.50 Benzy l b e n z o a t e - 2 . 1 0 A ~ - 2 . 5 2 I s o p r o p y l ace t a t e - 1.52 B ~-- - 1.50 n - B u t y l f o r m a t e 1.34 E ~-- 1.50 n - P r o p y l f o r m a t e 1.36 E ~ 1.50 E thy l ac ry l a t e - 1.11 B ~-- - 1.50 2 - C h l o r o e t h y l ace t a t e - 0 . 4 1 C ~ - 0 . 5 0 E thy l p - f l u o r o b e n z o a t e - 1.41 B ~ - 1.50 M e t h y l p - f l u o r o b e n z o a t e - 1.15 B ~-- - 1.50 E thy l d i b r o m o a c e t a t e 2.31 F ~ 2.71 M e t h y l p - h y d r o x y b e n z o a t e - 1.52 B ~ - 1.50 M e t h y l p - a m i n o b e n z o a t e - 2 .35 A ~-- - 2 .52 I s o p r o p y l p - h y d r o x y b e n z o a t e - 2 . 2 3 A ~-- - 2 . 5 2 E t h y l p - n i t r o b e n z o a t e - 0 . 1 3 C ,-- - 0 . 5 0 E thy l p - a m i n o b e n z o a t e - 2 .59 A ~ - 2 .52 M e t h y l m - a m i n o b e n z o a t e - 1.47 B ~-- - 1.50 E thy l p i v a l a t e - 2 . 7 7 A ,--- - 2 . 5 2 I s o p r o p y l p - a m i n o b e n z o a t e - 3 . 0 4 A ~-- - 2 . 5 2 M e t h y l 2 , 4 - D ~ 1.06 E ~-- 1.50 E thy l a m i n o a c e t a t e - 0 . 1 9 C ~-- - 0 . 5 0 2 - B u t o x y e t h y ! 2 , 4 - D a 1.48 E ~-- 1.50 n - O c t y l 2 , 4 - D a 0 .57 D ~-- 0 .50 2 - B r o m o e t h y l n - p r o p i o n a t e 1.00 D C - 0 . 5 0 2 - M e t h o x y e t h y l ace t a t e - 0 . 6 9 C B - 1 . 5 0 sec-Butyl ace t a t e - 1.76 B B, C - 1.00 M e t h y l f o r m a t e 1.56 E A, B, E - 0 . 5 2 E thy l t f i c h l o r o a c e t a t e 3.41 F A , B - 2 . 0 0 M e t h y l ace t a t e - 0 . 7 4 C N o de- - -

c i s i on

2 , 4 - D is the pe s t i c ide ( 2 , 4 - d i c h l o r o p h e n o x y ) a c e t a t e .

predictions are "correct", the experimental and predicted values differ by varying amounts, depending on the dis- tance of the experimental number from the decision concept midpoint.

The next two compounds (2-bromoethyl propionate and 2-methoxyethyl acetate) are predicted to be members of an incorrect decision concept. Although these outcomes are "wrong", these two compounds are actually predicted to be members of the decision concept "adjacent" to the correct concept. Thus, their predicted discrete values do not differ greatly from the experimental ones.

The next three compounds (sec-butyl acetate, methyl formate, and ethyl trichloroacetate) are predicted to be members of more than one concept (i.e., they fall in boundary regions shared by more than one concept). In order to assign a discrete value for these outcomes, we take the midpoint of the range defined by the two most extreme endpoints of the multiple concepts predicted. Although these outcomes are "wrong", it is possible that the predicted discrete value may be reasonably close to

1382 Volume 48, Number 11, 1994

t

the experimental value if the multiple concepts contain the correct concept, and the incorrectly predicted concept (or concepts) is adjacent to the correct one. This is the case with sec-butyl acetate.

For the last compound (methyl acetate), the model yields no decision. In other words, the compound falls in an equivalence class that is a negative region with respect to all concepts. For this outcome, we assign no value.

To evaluate the utility of a predictive model, one needs to consider the question, How good does a prediction have to be in order to be acceptable for my purposes? We believe that, for environmental risk assessment, the ability to estimate the rate of any one of many competing transformation processes to within about 1 log unit may be sufficient for many situations. Our MLR model 6 was found to make predictions that were acceptable under this inexact criterion for only 36 of these same 41 compounds. For those 36, a plot of MLR-predicted log koH values vs. the experimental values gave a correlation co- ef f ic ien t , r 2, o f 0 .89 , with a standard error of prediction (SEP) of 0.59. SEP was calculated as [~(log koHoxp -- log koH ~)2/n]~/2, where log koH and log koH are the ex- e p pred p • x • perlmental and predicted log kOH values, respecUvely, and n is the number of compounds included in the calculation of the SEP, in this case 36. If we included any more of the 41 compounds in the calculation of r 2 or SEP, they were both significantly degraded.

For comparison, in Fig. 2 the predicted log kOH values from the rough-sets model are plotted against the experimental log kou values. Certainly, the model's performance with regard to methyl acetate is inadequate because no decision is rendered. Furthermore, the prediction for ethyl trichloroacetate is unacceptable. However, for the remaining 39 compounds, the plot in Fig. 2 yields an r 2 value of 0.88, with an SEP of 0.52. If methyl formate is also excluded, the r 2 value (for 38 of the 41 compounds) rises to 0.93 and the SEP decreases to 0.41.

As discussed earlier, an important aspect of predictive models based on rough sets is the ability to examine rules in terms of the underlying principles. This approach is advantageous in several different ways. If a collection of data is available for a reaction process that is not well understood, rules can be built with vibrational spectra, and by interpreting the frequencies chosen, one may be able to gain insight into the mechanism of the reaction, in addition to developing a predictive model. For processes that are well understood, meaningful rules can give the user confidence that the model is not based on for- tuitus correlations. Also, rules can be edited and modified by the user, and the impact on the predictive power of the model can be tested in light of the modifications. Thus, software based on rough sets may, in some cases, be more beneficial as a human knowledge acquisition tool than as a purely predictive one.

For the case of alkaline hydrolysis of organic esters, the mechanism of reaction is well known. The process involves nucleophilic attack by the hydroxyl group at the carbonyl carbon, proceeding through a negatively charged transition state. We have previously argued that variations in reactivity for this process should be correlated with variations in both Vc=o and Vc_o. ~4 For the predictive model presented herein, for almost every decision class, the V¢.o is indeed an important part of the rules developed

® L O_

I 0 0

o - 2

/

mm •

• oOC

l • mmm ~t

Methyl acetate V n° decisi°n

- 4 --' - -2 ' 0 ' 2 ' 4

Iog(k OH)exp FiG. 2. D i s c r e t e v a l u e p r e d i c t i o n s o f t h e l o g o f t h e a l k a l i n e h y d r o l y s i s r a t e c o n s t a n t s f r o m t h e r o u g h - s e t s m o d e l , u s i n g l e a v e - o n e - o u t a n a l y s i s , vs. the experimental values. The middle line represents perfect agree- ment. Each outer line represents error in prediction of 1 log unit. The arrow indicates the experimental value for methyl acetate, for which no decision was rendered.

by the rough-sets software. However, the Vc=o is not. In fact, only one rule for only one of the six decision concepts involves the Vc=o. Thus, analysis of decision rules suggests that, when esters are partitioned according to log koH values, significant and consistent differences are observed for Vc.o, but not for Vc=o. A close examination of spectra of structurally similar esters whose reaction rate constants differ significantly supports this suggestion. For example, one half of the esters that belong to decision concept E (which is the second most rapidly reacting subset) are unsubstituted alkyl formates. By comparison, unsubstituted alkyl acetates react much more slowly, and most of them are contained in decision concept B. Obviously, the spectra of unsubstituted acetates and formates are very similar, as depicted in Fig. 3. Although there is a consistent pattern of differences in Vc=o, it is only about 15 cm -~. (For a given formate, such as n-butyl formate, the Vc=o is always about 15 cm -t lower than for the analogous acetate.) However, as clearly depicted in Fig. 3, the most significant difference is that the Voo of formates is con- sistently about 60 cm-t below that of the analogous acetate. An important component of rules for decision concept E in the predictive model presented herein is the presence of absorbance above a threshold at 1163 cm -~. (Note that the presence of a frequency in a rule does not necessarily mean that any member will have a peak whose maximum occurs at this frequency. It could correspond to, for example--as in this case--the low-frequency wing of an important peak.) Thus, if the Vc_o occurs at a low enough frequency to have absorbance above a certain threshold at 1163 cm -l, in addition to the presence or absence of some other spectral feature, it will be predicted by the model to belong to decision concept E. However, the software system does not see the Vc=o of these esters as unique to this decision concept.

In this case, the predictive model provided instructive information about the spectra that was somewhat surprising. The importance of Vc_o for describing variations in log kon was not unexpected, but the magnitude of its


1751~n-propyl formate 6

1764~ n--propyl a c e t o ~ l ~

1745~ / ~ 1 1 ~ isopropyl formate

c /

-~ 175 isopropyl oce o m

1750//~ ~ ~ 4 J i n - b u t y l f o r m a t ~ _ _ _ _ ~

1 7 8 ~ 1236/~ n-butyl aoet~2t.._~.___f ~

19oo 18'oo 17'oo ldoo 15'oo 14'oo 13'oo 12'oo 11'oo ldoo A cm -1 ~

Fio. 3. Gas-phase spectra of three formates (decision concept E) and the analogous three acetates (decision concept 13). All spectra have been scaled so that the most intense peak in the displayed range spans the same absorbance. The location of 1163 cm ~ has been noted on the axis.

importance relative to that Of Uc=o was. Perhaps this result should not have been surprising, because, as pointed out by Colthup et al., 15 in comparison to the Uc=o, the Uoo of esters is much less stable and is more sensitive to changes in the steric and polar nature of attached groups because it is a skeletal vibration and much more strongly coupled to the rest of the molecule.

Another interesting example is that of decision concept A, the most slowly reacting collection of esters. Four of the six members are alkyl benzoates that have an electron- releasing substituent para to the ester linkage. Two important components in the rules developed for this decision class are absorbance above a certain threshold at both 3048 and 1609 cm-L Both of these frequencies would be characteristic for all benzoates since they correspond to the aromatic C-H stretch and one of the two prominent C=C stretching vibrations, respectively. We were curious to know how rules based on these two frequencies could differentiate among these substituted benzoates and oth- ers in the data collection that react more rapidly. Also, the presence of the 1600-cm- ~ C=C stretching vibration,

] ~ - 0 . 3 6

Aromatic ~ Aliphotic Ja

+0.06

Approx. 0 - - x l O - - rr

R~_~- C -O-CHs /

-0.66 NH 2 - 2 . 3 ~ ~ k j j ~ A k

OH - 1 . 5 2 ~

C=O str C=Cs,r II B

+0.78 NO: -0.3

3200 .3000 2800 2600 1800 1600 1400 1200 ~,~ cm -1 o~ cm -1

Flo. 4. Gas-phase spectra of four p-substituted benzoates. See text for details. Hammett constants (a) are from Ref. 16; log ko. values are from Ref. 6. The locations of 3048 and 1609 cm ~ have been noted on the axis.

and not the other prominent C=C stretching vibration at about 1500 cm- ~, is puzzling. Directly comparing spectra of benzoates that are p-substituted with groups of differing electronic characteristics gives insight into the origin of these rules. In Fig. 4, we have plotted two regions (left side, 3300-2600 cm-~; right side, 1900-1000 cm -l) of the spectra of four different p-substituted methyl esters of benzoic acid. Note that, for presentation in this figure, spectra of the two regions have been normalized sepa- rately. For the 1900-1000 cm -I region all four spectra have been scaled so that the C=O stretching peaks all span the same absorbance. For the 3300-2600 cm -~ region, all four spectra have been scaled so that the CH3 asymmetrical stretching peaks (at about 2960 cm -~) span the same absorbance. For all four cases, it was necessary to multiply the 3300-2600 cm -~ region by a factor of approximately 10 relative to the 1900-1000 cm- ~ region.

On the bottom of Fig. 4 is the spectrum of the compound with a strong electron-attracting substituent, NO2. (The Hammett a constant for p-NO2 is +0.78.16 Substit- uents with positive Hammett a constants are electron deficient and those with negative constants are electron rich. The greater the magnitude of the constant, the greater the effect.) Above that is the spectrum of the compound with a weaker electron-attracting substituent, F (a = +0.06). Above that, the spectrum of the compound with

1 3 8 4 Volume 48, Number 11, 1994

an electron-releasing substituent, OH (a = -0.36), is shown. And, on top is the spectrum of the compound with a strong electron-releasing substituent, NH2 (a = -0.66). Notice there is a trend in log koH values, which decrease on going from the compound on the bottom of Fig. 4 to the one on top. For electron-releasing substituents, slower rate constants are observed because the partial positive charge on the carbonyl carbon is reduced, thus making the compound less susceptible to nucleophilic attack.

For the 1900-1000 cm -~ region (right side of Fig. 4), comparing methyl p-nitrobenzoate, methyl p-fluoroben- zoate, and methyl p-hydroxybenzoate, it is obvious that the intensity of the C=C stretching vibration at about 1600 cm- ~ increases steadily (in comparison to other peaks in the same spectrum, for example, the C--O stretching vibration) as one moves from the more electron-attracting substituent to the more electron-releasing substituent. It is not completely clear whether this trend continues on to methyl p-aminobenzoate, because the NH2 deforma- tion vibration overlaps with the C=C stretching vibration. Deconvolution of this spectrum suggests that the trend does indeed continue. Regardless, the intensity at 1609 cm-~ (ratioed to that of the C=O) is the highest for the NH2-substituted compound, and it also exhibits the slowest reactivity among those esters in Fig. 4. Thus, the software trains on absorbance above a threshold at 1609 cm-~ as a partial condition for inclusion in the subset of slowly reacting esters. Obviously, in the case of NH2, the software system does not know whether this enhanced absorbance is due entirely to a trend in enhanced intensity of the 1600-cm -~ C=C stretching vibration due to the electron-releasing character of the substituent or is from some other vibration associated with the substituent. In this case, the resulting prediction is correct nonetheless.

Katritzky and co-workers have published extensively on the variations of ring vibrations in numerous substituted benzenes, including 69 p-substituted compounds ~7 in dilute chloroform solution. They describe in detail this same trend that we have observed and assert that it does indeed continue with NH2. Specifically, they report that, for p-substituted benzenes, the intensity of the 1600-cm -~ peak varies as the algebraic difference of the electronic effects of the two substituents. (The Hammett a constant forp-COOCH3 is +0.31 .) Thus, when substituents of like electronic characteristics (such as the electron-attracting substituents NO2 and COOCH3) are para on benzene, the intensity of the 1600-cm-l peak is small. When the substituents are of differing electronic characteristics (such as NH2 and COOCH3), the intensity is large. The rough- sets analysis uncovered this trend, which Katritzky and co-workers discovered through manual spectral interpretation more than 30 years ago. Furthermore, the fundamental relationship of this trend to the hydrolytic activity of organic esters was identified by the model, with the use of relatively few examples.

Katritzky and co-workers also note that the variations in the C=C stretching peak at about 1500 cm -~ are much smaller and less regular. From the right side of Fig. 4, we would conclude from the spectra of the NH2-, OH-, and F-substituted compounds, that, for our data set, no clear trend in this peak is observed. (It is not possible to assign the 1500-cm -~ ring vibration in the NO2-substituted

compound because of interference with the NO2 stretching vibrations at 1542 and 1351 cm -~.) Accordingly, the rough-sets analysis ignored this peak as well.

Next, considering the 3300-2600 cm-~ region, we see clearly from the left side of Fig. 4 that, for this data set, as we go from electron-withdrawing to electron-releasing substituents, the suite of aromatic C-H stretching peaks (3100-3000 cm -~) increases steadily in comparison to the CH3 stretching peaks. (Katritzky and Simmons ~7 did not include this region in their reports because of the poor resolution of the sodium chloride prism.) Presumably, the enhanced intensity of the aromatic C-H stretching peak due to electron-releasing substituents is a result of an increase in the small difference in the electronegativ- ities of the C and H atoms. Upon C-H stretching, the change in molecular dipole would be expected to increase, resulting in more intense peaks for compounds with electron-releasing substituents. For our data set, this effect correlates well with changes in log kon. We had not recognized this trend prior to inspecting the decision rules.

CONCLUSION

Models for predicting alkaline hydrolysis reaction rates of organic esters using spectral data and rough-sets theory outperform previous ones built from the same input data with MLR. The advantages of the rough-sets model, with the use of MLR as a performance benchmark, can be viewed in several ways. First, if we approach prediction with the caveat that we will tolerate a given error of prediction (e.g., SEP < 0.6), then the rough-sets model is more widely applicable, at least for the database used in this study. Additionally, the previous MLR model was optimized in a way that required considerable manipu- lation of input data. The rough-sets model presented herein used unaltered spectral data directly. Ifa specified level of accuracy is accepted, then the rough-sets approach also offers the advantage of ease of use.

For our application, accuracy of prediction is not the only, and perhaps not always the most important, consideration. Although we have presented prediction of constants for a well-studied reaction (alkaline hydrolysis), many processes of environmental concern for which constants are required are not well understood. An important advantage of this rule-building system is that the rules are auditable and are often easy to understand and in- terpret. Thus, for processes that are not well understood, empirical rules developed for the predictive model may actually provide a tool for investigating process mechanism and pathways. As the examples presented herein illustrate, the rules can also point out trends that are useful in spectral interpretation that may not be anticipated, thus providing the opportunity to more fully understand the relationship between activity and spectral characteristic. Also, with well-known processes, because anticipated trends are recognized, the user has more confidence that the predictive model is meaningful and useful.

1. T. Isaksson and T. N~es, Appl. Spectrosc. 42, 1273 (1988). 2. C. J. O'Conner, D. J. McLennan, D. J. Calvert, T. D. Lomax,

A. J. Porter, and D. A. Rogers, Aust. J. Chem. 37, 497 (1984). 3. L. J. Bellamy, The Infra-red Spectra of Complex Molecules (Me-

thuen and Co., London, 1954), Chap. 1.


4. R. Katritzky and R. D. Topsom, "Linear Free Energy Relationships and Optical Spectroscopy", in Advances in Linear Free Energy Relationships, N.B. Chapman and J. Shorter, Eds. (Plenum Press, London, 1972), Chap. 3.

5. L. P. Hammett, Physical Organic Chemistry." Reaction Rate, Equi- libria, and Mechanisms (McGraw-Hill, New York, 1970), 2nd ed., Chap. 11.

6. T. W. Collette, Environ. Sci. Technol. 24, 1671 (1990). 7. S. Sekulic, M. B. Seasholtz, Z. Wang, B. R. Kowalski, S. E. Lee,

and B. R. Holt, Anal. Chem. 65, 835A (1993). 8. P. B. Harrington and K. J. Voorhees, Anal. Chem. 62, 729 (1990). 9. M.P. Derde, L. Buydens, C. Guns, D. L. Massart, and P. K. Hopke,

Anal. Chem. 59, 1868 (1987).

10. M. Otto and H. Bandmer, Anal. Chim. Acta 184, 21 (1986). 11. Z. Pawlak, International J. Computer Info. Sci. 11, 341 (1982). 12. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data

(Kluwer Academic Publishers, Dordrecht, 1991). 13. A. J. Szladow, PC AI 7, 25 (1993). 14. T. W. Collette, Environ. Tox. Chem. 11,981 (1992). 15. N. B. Colthup, L. H. Daly, and S. E. Wiberley, Introduction to

Infrared and Raman Spectroscopy (Academic Press, New York, 1964), Chap. 4.

16. D. H. McDaniel and H. C. Brown, J. Org. Chem. 23, 420 (1958). 17. A. R. Katritzky and P. Simmons, J. Chem. Soc. 2051 (1959).

1386 Volume 48, Number 11, 1994

Documents

Use of Rough Sets and Spectral Data for Building Predictive Models of Reaction Rate Constants