Application of SIMPLISMA purity function for variable selection in multivariate regression analysis: A case study of protein secondary structure determination from infrared spectra

ry Systems 88 (2007) 132–142www.elsevier.com/locate/chemolab

Chemometrics and Intelligent Laborato

Application of SIMPLISMA purity function for variable selection inmultivariate regression analysis: A case study of protein secondary

structure determination from infrared spectra

Andrey Bogomolov a,1, Michel Hachey b,⁎

a European Molecular Biology Laboratory, Hamburg outstation, build. 25a, 85 Notkestrasse, 22603 Hamburg, Germanyb Advanced Chemistry Development, Inc., 110 Yonge Street, 14th floor, Toronto, Ontario, Canada M5C 1T4

Received 31 March 2006; received in revised form 7 July 2006; accepted 18 July 2006Available online 7 September 2006

Abstract

A novel approach for the pre-selection of wavelengths, to be used in combination with Partial Least Squares (PLS) or other multivariateregression techniques, is presented. This variable selection method makes use of the purity function, originally suggested in the SIMPLe-to-useInteractive Self-modeling Mixture Analysis (SIMPLISMA) algorithm, to map up the regions of potentially influential variables. The selectedintervals are then individually tested in practical modeling and prediction, and an optimal subset of variables is obtained. The algorithm is simpleand intuitive and does not rely on iterative variable searches. The method was tested on a set of infrared protein spectra in order to improve thequantitative determination of the fractions of two secondary structure elements, α-helices and β-strands (β-sheets) in the protein polypeptidechain. Comparable results to those obtained through interval PLS (iPLS), an exhaustive search-based algorithm, were achieved in this study. Ourmethod was shown to be particularly beneficial in combination with variable weighting by their inverse standard deviation.© 2006 Elsevier B.V. All rights reserved.

Keywords: Variable selection; PLS; SIMPLISMA; Purity function; Protein secondary structure

1. Introduction

Variable selection is an important preprocessing procedure inchemometrics, which is widely used to improve the perfor-mance of various multivariate methods and algorithms, such asregression methods, factor analysis, and curve resolution. Thestrength of multivariate approaches is that they can exploit allvariables to effectively extract necessary information in theanalysis. However, not all variables or their regions are equallyimportant for the modeling; some of them, like noise areas, mayeven be harmful. Data projection on the abstract factor space

⁎ Corresponding author.E-mail address: [email protected] (M. Hachey).

1 During the period when the idea of the paper was first discussed A.Bogomolov was located at Advanced Chemistry Development, Ltd., Moscowoffice, 6 Akademika Bakuleva str., 117513 Moscow, Russia.

0169-7439/$ - see front matter © 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.chemolab.2006.07.006

reduces the error but does not eliminate it entirely; it is partiallyprojected onto the new data space, spoiling the model [1].Therefore, removal of the variables, in which the noisedominates over the relevant information usually leads to betteraccuracy and performance of analytical methods.

In multivariate regression, variable selection is usually wellworth the effort. For Multiple Linear Regression (MLR),variable selection may be compulsory if the number of variablesexceeds the number of samples, which is not allowed by thismethod. Besides, since this regression technique is very sen-sitive to variable collinearity, various selection methods areused to find an optimal variable subset, for instance, stepwiseMLR [2]. In contrast, factor projection methods like PartialLeast Squares (PLS) [3,4] and Principal Component Regression(PCR) [5] are free of this disadvantage. For a long time, it wasbelieved that the PCR and PLS full-spectrum methods did notneed preliminary feature selection. However, it was shown thatthe predictive ability can be increased and the complexity of the

mailto:[email protected]

http://dx.doi.org/10.1016/j.chemolab.2006.07.006

Table 1Studied proteins and their secondary structure: fractions of α-helices and β-sheets

# Protein α-helix β-sheet Ref.

1 Albumin (human serum) 0.723 0 [21]2 Ferritin (apo, horse spleen) 0.736 0 [21]3 β-lactoglobulin A (bovine milk) 0.167 0.41 [22]4 Carbonic anhydrase (bovine erythrocytes) 0.162 0.286 [22]5 α-chymotrypsin (bovine pancreas) 0.114 0.314 [22]6 α-chymotrypsinogen (bovine pancreas) 0.078 0.33 [21]7 Concanavalin A (jack bean) 0.038 0.464 [22]8 Cytochrome c (oxidized; equine heart) 0.48 0.1 [23]9 Elastase (bovine pancreas) 0.108 0.342 [22]10 Hemoglobin (aquomet; equine) 0.717 0 [21]11 Immunoglobin G (bovine) 0.03 0.67 [23]12 Interferon (recombinant, human) 0.479 0 [21]13 α-lactalbumin (Ca2+-bound, bovine milk) 0.455 0.065 [21]14 α-lactalbumin (Ca2+-depleted, bovine milk) 0.442 0.082 [21]15 Lysozyme (chicken egg white) 0.419 0.063 [22]16 Ovalbumin (chicken egg) 0.319 0.33 [21]17 Papain (papaya latex) 0.26 0.169 [22]18 Growth hormone (recombinant, human) 0.438 0 [21]19 Ribonuclease A (bovine pancreas) 0.21 0.331 [22]20 Ribonuclease S (bovine pancreas) 0.216 0.371 [21]21 Cu,Zn-superoxidase dismutase (oxidized) 0.02 0.384 [21]22 Cu,Zn-superoxidase dismutase (reduced) 0.02 0.384 [21]23 Subtilisin Carlsberg (Bacillus licheniformis) 0.292 0.172 [21]24 Trypsin (bovine pancreas) 0.09 0.56 [23]

Sources of the information protein secondary structure are given in the Ref.column.

133A. Bogomolov, M. Hachey / Chemometrics and Intelligent Laboratory Systems 88 (2007) 132–142

model can be reduced by a judicious pre-selection of wave-lengths [6–14]. When developing mission-critical regressionmodels intended for the routine, e.g., industrial usage, even asubtle improvement in the performance is important.

Another situation where variable selection may be highlyrecommended is the application of data weighting. For example,variable multiplication by inverse standard deviation is commonlyapplied in the analysis [5]. For spectral data it may be useful tostrengthen low-intensity signals of the analyte component(s).However, this procedure often leads to exaggeration of the purenoise regions, whichmay override the positive effects, thus leadingto deterioration of the entire model. Variable selection is a naturalway to avoid this. One of the most evident and straightforwardapproaches to the noise problem is to use a standard deviationthreshold to eliminate all variables below a chosen limit, e.g., 5%[1]. Thismethod, however, cannot distinguish between “good” and“bad” variables in regions of high variance.

There is a group of methods which apply direct testing ofmultiple variable combinations in the modeling, and sort thembased on the resulting prediction error to find an optimalsolution. Interval PLS (iPLS) [15] is a recent, and one of themost efficient algorithms of this type. In the iPLS approach, aninterval width (the number of spectral channels) is chosen todivide the full spectrum into a series of local bins. Next, thenumber of intervals (N) that will be included in a local PLSmodel is selected, thus limiting the number of combinations tobe tested. Local PLS models are built for all possible com-binations of N local intervals. The models showing the bestperformance are then selected. The constraints placed on theinterval width and number prevent the need for testing anastronomical number of combinations, while still providing anexhaustive search pattern. The algorithm was shown to beeffective in modeling optical spectroscopic data [13,14]. How-ever, a problem inherent to all search-based methods is atendency to yield wavelength selection instabilities relative tosample data additions or subtractions, which is due to thesusceptibility of the region selection to random noise [16,17].So while search methods like iPLS are very good at locatingwavelength regions of main component contributions, in-depthinterpretation of the relevant spectroscopic features may bedifficult in places. Perhaps, an alternative approach that avoidsthe potential wavelength selection instability pitfall and pro-vides a simpler graphic representation amenable to spectro-scopic interpretation is worth exploring.

In the present work, we aim to develop a simple but effectivevariable selection method, avoiding intensive combinatorialscreening of candidate variables. We applied the purity functionintroduced in the SIMPLISMA curve resolution algorithm[18,19] as a guide for selecting a relatively small number ofpotentially informative intervals.

The method's performance was tested using the problem ofprotein secondary structure analysis from infrared (IR) spectralstudies using PLS regression. The effect of variable selection onthe prediction accuracy of α-helical and β-stranded elementcontents is considered both with and without preliminary dataweighting. Practical recommendations on the method applica-tion are formulated.

2. Data, theory, and methods

2.1. Data

The same core data set as used in the conventional PLS andiPLS studies by Navea et al. [13] was used for this study. It wasdownloaded from the Protein Infrared Database [20], which isaccessible on the Internet. Table 1 identifies all 24 proteinsbelonging to this data set, as well as the measured fraction ofsecondary structure elements, α-helices and β-sheets in thepolypeptide chain, as determined by X-ray crystallographicanalysis [21–23]. A constant-offset baseline correction wasapplied to avoid negative values, which are detrimental to theSIMPLISMA algorithm. Further, since the infrared spectra wereacquired on different instruments (Nicolet 730, Magna 550, orBomen-MB Fourier-transform infrared spectrometer in single-beam mode) with a 4 cm−1 resolution at room temperature, thespectral series was converted to a uniform matrix by means oflinear interpolation between data points and subsequenttruncation of unwanted regions. The resulting data matrixranged from 1350.21 cm−1 to 1749.72 cm−1 and had a dataspacing of 1.93 cm−1. Finally, each spectrum was individuallynormalized to its vector norm. Fig. 1a shows the resultingspectral series.

2.2. Theory

SIMPLISMA (SIMPLe-to-use Interactive Self-modelingMixture Analysis) is a soft-modeling curve resolution algorithm

Fig. 1. IR spectra of proteins (Table 1) and SIMPLISMA-resolved components: (a) IR spectral data; and (b) resolved component spectra and corresponding purevariables: first (solid line), second (dashed line), and third (dash-dotted line).

134 A. Bogomolov, M. Hachey / Chemometrics and Intelligent Laboratory Systems 88 (2007) 132–142

by Windig and Guilment [18] that decomposes a data matrix—without any prior component spectra or composition knowledge—into its pure component spectra and detectable componentconcentration profiles. The success of SIMPLISMA curve re-solution relies on its ability to detect wavelength regions that aremost characteristic of each component in the mixture, whileavoiding problematic regions.

The wavelength selection algorithm used by SIMPLISMA isbased on locating the pure variables, i.e., the variables where theintensity contributions can be attributed to only one componentin the mixture. Pure variables are detected for each nth com-ponent in the mixture by using a modified equation of relativestandard deviations (a.k.a., coefficient of variance), which isreferred to as a purity function (Eq. (1)),

pjn ¼ rjnlj þ a

ð1Þ

where Pjn is the purity variable value; σjn is the standarddeviation of the jth variable corrected for the variance explainedby previously detected n−1 components (σj1 is the conventionalstandard deviation), μj is the variable mean; and where α is aconstant offset, which is defined as a percent of the mean. Theoffset is designed to prevent an artificially high purity value incases where the mean value, μj, approaches zero. Like itsrelative standard deviation cousin, the purity function is used tocompare the variance between measurements of differentabsolute magnitude. It was shown that the higher the purityvariable value, the higher its relative component purity for thisvariable [18]. Therefore, if we plot the purity value, Pj, at each

variable j, then we obtain the so-called purity spectrum. Themaximum intensity peak in the plot will correspond to the purevariable of the first component. All other peaks in the purityspectrum are indicative regions of some higher degree ofcomponent purity relative to other regions. Since any given dataset may have more than one pure variable, it is not obvious atfirst glance which peak correlates with what component.However, because each component is calculated stepwise inassociation with a pure variable, the effect of previously selectedcomponents is eliminated from each subsequent purity spectrumso that it reflects only the residual variance of the data. Becausecontributions from previous components are suppressed, thesubsequent maximum in the new purity spectrum corresponds toa different component.

Because of significant signal overlap in the present IR region,a modified SIMPLISMA algorithm was applied to the analysisof protein data. This method is essentially the same, except that itmakes use of the so-called inverted second derivatives to searchfor the pure variables, instead of the raw spectral data [19].Inverted derivative is stated as a smoothed derivative of aspectrum multiplied by − 1 and with negative values (illegal inthe SIMPLISMA approach) set to zero. The truncation ofnegative regions may result in the loss of some purity variables.While this is a concern, it was noted [19] that the negativeregions (in the inverted 2nd derivative function) tend to sufferfrommore component overlap and thuswere less likely to lead toviable pure variables. Furthermore, since components tend tohave more than one pure variable in any data set, the methodgenerally remains viable in spite of possible loss of information.


This assertion is supported by the fact that the 2nd derivativemode in SIMPLISMAwas used for many applications [24–26]where components were highly overlapping or with significantbackground problems. Note, that the derivative data were onlyused for the pure variable search; the component resolution inthe next step of the algorithm is performed on the original data,as in the conventional SIMPLISMA [18].

2.3. Methods

The proposed algorithm for wavelength selection by purevariable technique consists of the following steps.

2.3.1. Derivative calculationThe inverted second derivative was calculated by the

Savitsky–Golay [27] method (Fig. 2b). In this algorithm, theorder of the polynomial and smoothing window can be adjustedto take into account the data type and noise level. These valueswere set to 2 and 9 respectively, which are typical settings for IRspectrum treatment. The present step is optional: the 2ndderivative pretreatment may be skipped for data with well-resolved spectral peaks.

2.3.2. Curve resolutionResolve the components by SIMPLISMA. The number of

components to be retained in the model is determined inaccordance with several diagnostic tools provided by thealgorithm and implemented in the software [28,29]. An analystmay choose between fully automatic pure variable selection inthe maxima of purity variables and their manual settings.Inactive regions may be set to disable variable searches inclearly useless areas. In our case study, we applied the methodof automated selection. Resolved spectra by themselves are notused for variable selection. Curve resolution is performed toobtain the purity spectra, which are of interest. As we will showbelow, a simplified method may be based on only the first purityspectrum; in this case, no component resolution is necessary.

2.3.3. Peak selectionCollect positions of peakmaxima among all purity spectra and

sort them in accordance with: first, component ordinal numberand, second, peak intensity within the same purity spectrum. Inthis hierarchy, peaks of an earlier component are always abovesubsequent ones; and in the same purity spectrum, more intensepeaks are at the head of the list. We only included those peakswhich accounted for at least 5% of the global maximum of thepurity spectrum. This threshold can be adjustedwithin themethodframework.

2.3.4. Define initial intervalsFor each peak (wavenumber value) selected during step 3, an

interval is defined for the initial data set by adding an equalnumber of adjacent points on each side so that all intervals havethe same chosen width according to the current option setting.The width should be adjusted in accordance with specific data.For this protein IR data set, an interval of about 10 cm−1 wasfound to be optimal in the iPLS studies [13].

2.3.5. Interval testingThe intervals are tested one-by-one, starting from the top of

the list. During the testing, the final optimal set of wavelengthintervals is simultaneously compiled. On each step, the testedinterval is added to the currently compiled list, and a PLS modelis built. If the prediction error (defined below in this section) isreduced due to the newly added interval, then the latter isconsidered useful and added to the cumulative optimal set;otherwise, it is ignored and the next peak is tested. The process isrepeated until the end of the list. The interval set collected by theend of the testing is considered the final solution. An additionalprocedure is applied for the very first interval (corresponding tothe maximum peak in the first purity spectrum). Since at the firsttesting step there is nothing to compare modeling error with, it istested immediately after addition of the second successfulinterval as follows. The first interval is removed from the two-interval set and the model for the single second interval is built.Only if the modeling error in this case is higher than that withboth intervals, is the first one returned into the set.

Accuracy of the local models was characterized by the rootmean square error of cross-validation RMSECV, Eq. (2):

RMSECV ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXni¼1

ðycvi − yrefi Þ2

n

vuuutð2Þ

where ŷcv is the PLS-predicted fraction of a secondary structureelement in accordance with the cross-validation procedure; yref

is a reference value (i.e., calculated from the three-dimensionalprotein structure model based on X-ray crystallographicstudies); and n is the number of proteins in the data set. Thefull leave-one-out cross validation method was used. Addition-ally, the linear correlation coefficient, r, between the predictedŷcv and reference yref (Table 1) fractions of α- and β-forms wasused as a criterion of the model performance.

The PLS1 regression algorithm [5] was used, and hence, thetesting was separately performed for α-helix and β-sheetpredictions. The data was always mean-centered prior to themodeling. In some calculations, subsequent variable weightingby inverse standard deviation (auto-scaling) was applied. Thenumber of PLS factors to be kept in the model in each case wasdetected as the first local minimum of RMSECV. The number ofcomponents in local PLS models could not be greater than in theglobal full-range model.

2.4. Software

SIMPLSMA curve resolution and associated calculationswhere carried out with the ACD/UV-IR Processor software [30]available from ACD/Labs. The PLS regression and RMSECVcalculations were programmed in the MATLAB® software byMathWorks, Inc.

3. Results and discussion

First, curve resolution was performed on the data set using 2ndderivative SIMPLISMA with the offset α=10%, and three


component spectra, corresponding to pure variables at 1628.13,1655.15, and 1637.78 cm−1, extracted (Fig. 1b). The number ofcomponents to be extracted by SIMPLISMAwas determined toequal three based on the standard diagnostics of the algorithm,specifically, the first maximum of the component's Coefficientand minimum of Relative Total Sum of Purity-corrected StandardDeviation, encrypted as TSSR [18] (not shown). A newer methodbased on a modified empirical Indicator Function [29] detectedfour components at the α offset value of 10% (Eq. (1)) but thisnumber drops to three when α is increased to 12%. Determinationof the model size in SIMPLISMA, as in many other multivariatealgorithms, often cannot be unambiguously done without aninformed operator's decision. Nevertheless, the present method isnot very sensitive to the number of detected components since, aswill be seen later, variable selection is mostly based on the firstpurity spectrum which incorporates effects from all components;higher-order functions play a secondary role and thus under- orover-determination are not that dangerous in this context.

Resolved pure component spectra and corresponding concen-tration (more precisely, α- or β-element fraction) profiles, notbeing directly involved into the variable selection itself, provide

Fig. 2. Candidate interval selection: (a) raw data; (b) inverted 2nd derivative data;second; dash-dotted—third). Selected peak positions are shown with vertical dotted

complementary qualitative information about analyzed data. Thecomponent number 2 (pure variable at 1655.15 cm−1, Fig. 1b) canbe straightforwardly assigned α-helices based on a highcorrelation coefficient (r=0.88) between the element resolvedconcentration profile and its reference fraction (Table 1). Besides,the maximum peak of the component spectrum (dashed line inFig. 1b), coincidingwith the puritymaximum, closely agreeswiththe α-helix amide I band typically observed at 1650–1660 cm−1

[20,31–34]. With regards to the β-sheets, the interpretation ofresults is not as evident. Resolved spectra of both components 1and 3 have major peaks in the wavenumber interval characteristicof α-sheet's amide I band (1623–1643 cm−1 [20,31–34]). Thethird purity component's contribution profile is distinctlyassociated with the β-sheet fraction (r=0.51). However, thisrelation is not so strong as for α-helices and, perhaps, the firstcomponent also contributes into the β-form variance (r=0.23).

Perhaps the failure to resolve the contribution of β-sheets as asingle spectrum-profile pair comes from an instability of theelement's spectral shape due to the presence of multiple forms likeparallel and antiparallel β-sheets (both forms were formerlyreported to have closely located peaks in the region 1612–

(c) variables' standard deviation; and (d) purity spectra (solid—first; dashed—lines.

Table 2Testing results for the non-weighted data

n Peak, cm−1

(# p.s.)α-helix β-sheet

Interval cm−1 RMSECV + RMSECV +/− Interval cm−1 RMSECV + RMSECV +/−

1 1628.13 (1) 1622.34–1633.92 0.1241 0.1241 + 1622.34–1633.92 0.1011 0.1011 +2 1655.15 (1) 1649.36–1660.94 0.1164 0.1164 + 1649.36–1660.94 0.0919 0.0919 +3 1635.85 (1) 1630.06–1641.64 0.1137 0.1137 + 1630.06–1641.64 0.0884 0.0884 +4 1549.00 (1) 1543.21–1554.79 0.1103 0.1103 + 1543.21–1554.79 0.0919 0.0884 −5 1686.03 (1) 1680.24–1691.82 0.1092 0.1092 + 1680.24–1691.82 0.0886 0.0884 −6 1693.75 (1) 1693.75–1699.54 0.1087 0.1087 + 1693.75–1699.54 0.0886 0.0884 −7 1674.45 (1) 1668.66–1678.31 0.1086 0.1086 + 1668.66–1678.31 0.0878 0.0878 +8 1518.12 (1) 1512.33–1523.91 0.1074 0.1074 + 1512.33–1523.91 0.0874 0.0874 +9 1400.39 (1) 1394.60–1406.18 0.1074 0.1074 − 1394.60–1406.18 0.0874 0.0874 −10 1458.29 (1) 1452.50–1464.08 0.1078 0.1074 − 1452.50–1464.08 0.0877 0.0874 −11 1452.50 (1) 1446.71–1458.29 0.1077 0.1074 − 1446.71–1458.29 0.0877 0.0874 −12 1415.83 (1) 1410.04–1421.62 0.1074 0.1074 − 1410.04–1421.62 0.0874 0.0874 −13 1388.81 (1) 1383.02–1394.60 0.1074 0.1074 − 1383.02–1394.60 0.0874 0.0874 −14 1469.87 (1) 1464.08–1475.66 0.1077 0.1074 − 1464.08–1475.66 0.0876 0.0874 −15 1637.78 (2) 1643.57–1643.57 0.1075 0.1074 − 1643.57–1643.57 0.0871 0.0871 +16 1664.80 (2) 1662.87–1666.73 0.1072 0.1072 + 1662.87–1666.73 0.0871 0.0871 −17 1537.42 (3) 1531.63–1541.28 0.1080 0.1072 − 1531.63–1543.21 0.0873 0.0871 −18 1520.05 (3) 1525.84–1525.84 0.1071 0.1071 + 1525.84–1525.84 0.0870 0.0870 +19 1558.65 (3) 1556.72–1564.44 0.1071 0.1071 − 1552.72–1564.44 0.0876 0.0870 −20 1460.22 (3) 1454.43–1466.01 0.1075 0.1071 − 1454.43–1466.01 0.0873 0.0870 −21 1402.32 (3) 1396.53–1408.11 0.1071 0.1071 − 1396.53–1408.11 0.0870 0.0870 −∑ 0.1071 0.0870

The column ‘n’ is the testing step number, with the corresponding peak offset (purity spectrum number in parenthesis) and tested interval selected in ‘Peak’ and‘Interval’ columns, respectively (adjacent interval boundaries are underlined); the ‘RMSECV’ column includes the RMSECV PLS modeling error with the currentlytested interval added; testing decision on accepting (+) or rejecting (−) of the current interval is marked in the column ‘+/−’; ‘+ RMSECV’ gives the resultingcumulative RMSECVof the nth testing step.

Table 3Testing results for the inverse standard deviation-weighted data

n Peak, cm−1

(# p.s.)α-helix β-sheet

Interval cm−1 RMSECV + RMSECV +/− Interval cm−1 RMSECV + RMSECV +/−

1 1628.13 (1) 1622.34–1633.92 0.1244 0.1244 + 1622.34–1633.92 0.1018 0.1018 +2 1655.15 (1) 1649.36–1660.94 0.1179 0.1179 + 1649.36–1660.94 0.0932 0.0932 +3 1635.85 (1) 1630.06–1641.64 0.1139 0.1139 + 1630.06–1641.64 0.0882 0.0882 +4 1549.00 (1) 1543.21–1554.79 0.1119 0.1119 + 1543.21–1554.79 0.0921 0.0882 −5 1686.03 (1) 1680.24–1691.82 0.1093 0.1093 + 1680.24–1691.82 0.0927 0.0882 −6 1693.75 (1) 1693.75–1699.54 0.1104 0.1093 − 1687.96–1699.54 0.0930 0.0882 −7 1674.45 (1) 1668.66–1678.31 0.1047 0.1047 + 1668.66–1680.24 0.0888 0.0882 −8 1518.12 (1) 1512.33–1523.91 0.0977 0.0977 + 1512.33–1523.91 0.0896 0.0882 −9 1400.39 (1) 1394.60–1406.18 0.0972 0.0972 + 1394.60–1406.18 0.0900 0.0882 −10 1458.29 (1) 1452.50–1464.08 0.1036 0.0972 − 1452.50–1464.08 0.0898 0.0882 −11 1452.50 (1) 1446.71–1458.29 0.1029 0.0972 − 1446.71–1458.29 0.0902 0.0882 −12 1415.83 (1) 1410.04–1421.62 0.0982 0.0972 − 1410.04–1421.62 0.0890 0.0882 −13 1388.81 (1) 1383.02–1392.67 0.0969 0.0969 + 1383.02–1394.60 0.0886 0.0882 −14 1469.87 (1) 1464.08–1475.66 0.1024 0.0969 − 1464.08–1475.66 0.0907 0.0882 −15 1637.78 (2) 1643.57–1643.57 0.0978 0.0969 − 1643.57–1643.57 0.0878 0.0878 +16 1664.80 (2) 1662.87–1666.73 0.0945 0.0945 + 1662.87–1670.59 0.0869 0.0869 +17 1537.42 (3) 1531.63–1541.28 0.0993 0.0945 − 1531.63–1543.21 0.0900 0.0869 −18 1520.05 (3) 1525.84–1525.84 0.0942 0.0942 + 1514.26–1525.84 0.0890 0.0869 −19 1558.65 (3) 1556.72–1564.44 0.0953 0.0942 − 1552.86–1564.44 0.0874 0.0869 −20 1460.22 (3) 1454.43–1466.01 0.1004 0.0942 − 1454.43–1466.01 0.0875 0.0869 −21 1402.32 (3) 1408.11–1408.11 0.0944 0.0942 − 1396.53–1408.11 0.0868 0.0868 +∑ 0.0942 0.0868

The column ‘n’ is the testing step number, with the corresponding peak offset (purity spectrum number in parenthesis) and tested interval selected in ‘Peak’ and‘Interval’ columns respectively; the ‘RMSECV’ column includes the PLS modeling error with the currently tested interval added; testing decision on accepting (+) orrejecting (−) of the current interval is marked in the column ‘+/−’; ‘+ RMSECV’ gives the resulting cumulative RMSECVof the nth testing step.



1640 cm−1), or by other system variability effects (e.g., protein sizeor its tertiary structure) on the IR spectrum. PLS regressionmanages to handle non-linearity of this kind by addingmore factorsinto the model, but for the curve resolution it may be fatal.Nevertheless, for the variable selection,which is based on the puritycalculation rather than curve resolution itself, extraction of physi-cally interpretable components is not a necessary requirement.

The candidate intervals were calculated based on the firstthree purity spectra shown in Fig. 2d. Note that among 21different peaks found in these three curves, 14 are present in thefirst purity spectrum and are often repeated in subsequent ones.Since the first purity spectrum accounts for the whole variancein the data set (Eq. (1)), it is natural that it repeats a number ofpeaks corresponding to different mixture components fromsubsequent purity spectra. However, some features related tohigher-order components in the first purity spectrum may bemasked or distorted and revealed only after elimination of theeffect of previously modeled components. Therefore, using the

Table 4Comparison of PLS results of predicting fractions of α-helices and β-sheets on diff

Method α-helix

Intervals, cm−1 RMSECV (# comp.)

Spectrum range 1350.21–1749.71 0.1103 (1)Spectrum range /std 1350.21–1749.71 0.1009 (4)3 purity variables 1622.34–1643.57

1649.36–1660.940.1135(1)

3 purity variables /std –"– 0.1136(1)Present method using allpeaks in the first 3 purityspectra

1512.33–1525.841543.21–1554.791622.34–1641.641649.36–1699.54

0.1071 (1)

Present method using allpeaks in the first 3 purityspectra /std

1383.02–1406.181512.33–1525.841543.21–1554.791622.34–1641.641649.36–1691.82

0.0942 (1)

Present method using8 peaks of the 1st purityspectrum

1512.33–1523.911543.21–1554.791622.34–1641.641649.36–1660.941668.66–1699.54

0.1074 (1)

Present method using8 peaks of the 1st purityspectrum /std

1512.33–1523.911543.21–1554.791622.34–1641.641649.36–1660.941668.66–1691.82

0.0977 (1)

Initially selected intervals(all peaks in the first 3purity spectra)

1383.02–1421.621446.71–1475.661512.33–1525.841531.63–1564.441622.34–1643.571649.36–1699.54

0.1085 (1)

Initially selectedintervals /std

–"– 0.1021 (3)

iPLS range [13] 1543.21–1550.931659.01–1668.661682.17–1691.82

0.1012 (1)

iPLS range [13] /std –"– 0.0974 (2)

‘Method’ is the variable selection and preprocessing method (‘/std’ in the name meanvariable ranges used for PLS modeling in accordance with the method; ‘RMSECV’ iused in the model is given in brackets); and ‘Corr. coef.’ is the correlation coefficie

entire purity peak list from the complete SIMPLISMA analysisfor the interval testing ensures that no potentially useful regionis left behind. Each interval was defined as a 7-point region witha purity peak maximum in the center, which corresponds to thewidth of 11.58 cm−1. Such an interval approximately covers theupper one-third to one-quarter of a typical infrared peak that isless subject to overlap with its neighbors. This value is close toan optimum interval of 7.7–9.6 cm−1 found in the iPLS studiesof the same data [13].

The resulting list of purity peaks and their correspondingintervals selected for testing are provided in Tables 2 and 3. Thepeak positions are graphically shown in Fig. 2a–d over the rawand inverted derivative data, purity spectra, and the standarddeviation. It is worth noticing that the first purity spectrum (solidcurve in Fig. 2d) resembles the plot of variables' standarddeviation (Fig. 2c). This similarity follows from Eq. (1): whenthe offsetα-value increases, the purity spectrum converges to thesame shape. Both curves tend to reveal underlying component

erent variable sets

β-sheet

Corr. coef. Intervals, cm−1 RMSECV (# comp.) Corr. coef.

0.8699 1350.21–1749.71 0.0915 (3) 0.87860.8935 1350.21–1749.71 0.0941 (3) 0.87040.8620 1622.34–1643.57

1649.36–1660.940.0882(1) 0.8866

0.8617 –"– 0.0878(1) 0.88760.8779 1512.33–1525.84

1622.34–1643.571649.36–1660.941668.66–1678.31

0.0870 (1) 0.8896

0.9079 1396.53–1408.111622.34–1643.571649.36–1670.59

0.0868 (1) 0.8902

0.8772 1512.33–1523.911622.34–1641.641649.36–1660.941668.66–1678.31

0.0874 (1) 0.8888

0.8999 1622.34–1641.641649.36–1660.94

0.0882 (1) 0.8866

0.8744 1383.02–1421.621446.71–1475.661512.33–1525.841531.63–1564.441622.34–1643.571649.36–1699.54

0.0898 (2) 0.8822

0.8897 0.0912 (3) 0.9793

0.8917 1639.71–1647.431659.01–1668.661682.17–1691.82

0.0748 (2) 0.9197

0.9006 –"– 0.0858 (3) 0.8928

s variable weighting by the standard deviation); the ‘Intervals’ column containss the root mean square error of cross-validation (the number of PLS componentsnt between predicted and measured values (from the cross-validation).


signals where the variance of variables is typically higher.Weighting by the inverse mean used in the purity function leadsto amplification of low-intensity regions where a pure signal ofone or another individual components occur. This effect pro-duces some extra peaks compared to the standard deviationcurve (compare Fig. 2c and d). The 2nd derivative serves forpartial deconvolution of overlapped peaks [19] while thelinearity and additivity of spectral responses (Beer's law) iskept. This pretreatment is often required to reveal pure variablesin the IR spectra of complex systems where component peaksuperposition is high. The present method is based on a sugges-tion that the purity spectrum peaks point at the regions that bringinformation on system components specifically important for theregression. This assumption is reasonable considering the purityspectrum's being the indicator of both high variance (as thestandard deviation) and purity of the underlying variables.

All intervals in Fig. 2a–d are grouped into six larger regions(the boundaries between adjacent intervals are not shown).Although this initial filtering does not take into account thepotential use of underlying individual intervals for the modelingand prediction of α-helices or β-sheets, this variable set isalready better for the prediction of either of the elements than thewhole spectral range (Table 4). However, to obtain maximumperformance, the intervals should be individually tested withrespect to modeling both α- and β-forms (see Section 2.2 for theprocedure description).

The selected intervals were tested in PLS modeling on both theraw (non-weighted) data and after variableweighting by the inversestandard deviation. As a result, four final interval sets wereproduced, two for each structural element: to be used with non-

Fig. 3. Selected intervals for PLS modeling of the n

weighted (Table 2 and Fig. 3) and weighted (Table 3 and Fig. 4)data. The tables follow the prediction improvement (‘+ RMSECV’columns) during the interval testing/elimination procedure.Although we included possibly more candidate intervals into thetesting procedure, it is clearly seen that for both secondary structureelements, the maximum effect of the variable selection is achieveddue to the peaks of the first purity spectrum independent of theweighting. Moreover, the importance of an interval seems tocorrelate to the corresponding peak intensity (decreases with theinterval number n in Tables 2 and 3), which is particularlycharacteristic of the non-weighted data set. It is an importantobservation that enables a simplified variant of themethod based onthe first purity spectrumonly and a peak selection threshold of up to10–20%, which in our data corresponds to the first nine (10%) toeight (20%) of themost informative peaks (Tables 2, 3 and Fig. 2d).

It is interesting that for a number of included intervals(Tables 2 and 3), PLS prediction for both α- and β-forms tendsto exhibit better performance on the weighted data, althoughwith a small number of “major” intervals it does not perform aswell as without this preprocessing step.

Table 4 gives a comparative summary of PLS predictionresults for different variable sets. With respect to the full-spec-trum analysis, the gain in accuracy due to the present variableselection method is comparable to that of iPLS, although thelatter is probably close to an ultimate solution from a statisticalpoint of view. The advantage of our method over iPLS is that itmakes use of a guided interval selection instead of the iterativeexhaustive arbitrary search, which starts to noticeably slowdown above 20 to 30 intervals [15] and could therefore becometime-consuming on large multi-component data sets. The

on-weighted data: (a) α-helix; and (b) β-sheet.


disadvantage is that it is not as likely to yield the same degree ofdata set reduction as iPLS and associated gain in theperformance. Finally, while both methods offer a graphicaloverview of the local modeling, our approach offers a richerspectroscopic perspective due to its physically meaningful wayof variable localization and resolved component curves (spectraand concentration profiles) readily available for interpretation.

A simplified variable selection method using only the firstpurity spectrum and 20% relative intensity threshold for peakdetection still shows quite effective results (Table 4). Due to itssimplicity (Eq. (1)) and because only a few intervals need to betested, this simplified method can almost be applied manually(in conjunction with the conventional PLS software).

In an effort to study further simplification of the method, aninterval selection with regions based on just the main purityvariable for each of the three components was tested. TheRMSECV and r-values in this case were noticeably worse thanthose obtained in the main method using incremental selectionof peaks of three purity spectra (Table 4). Note that with threemain pure variable regions only, the results for α-helix wereeven slightly worse than with the full spectral range; showingthat other regions are needed to improve performance.

A common advantage of variable selection in spectral dataanalysis is the possibility of qualitative interpretation of selectedwavenumber intervals. Although the intervals optimal for cal-ibration are not necessarily distinctive of the prominent spectralpeaks of the analyte components, such trends are rather com-mon, and indirectly support the variable choice from the chem-

Fig. 4. Selected intervals for PLS modeling of the data weighted

ical point of view. In other words, variable selection results maybe used as an independent spectral interpretation tool. In somecases, it may reveal component-related features overlookedduring traditional peak-based studies. More thorough interpre-tation, in our approach, can be made based on considering theinitially selected purity peaks and their importance for theprediction of α-helical or β-stranded element fractions based onthe interval testing results as represented in Tables 2 and 3.

Selected variables in the amide I band region (between 1600and 1700 cm−1) agree with established high relevance features forboth α-helical and β-stranded secondary structure elements. Theα-helix amide I band, typically observed at 1650–1660 cm−1

[20,31–34] with a median of 1655 cm−1, corresponds perfectly tothe selection of the 2nd high-intensity pure variable at1655.15 cm−1 in Tables 2 and 3, which generates an interval of1649.36–1660.94 94 cm−1. In contrast, the interval suggested bythe best iPLS model (1658.7–1668.3 cm−1) is off-centered byabout 9 cm−1 with respect to traditional interpretation. Theparallel β-sheet feature in the amide I band is typically foundbetween 1623 and 1640 cm−1 [20,31–34], which againcorresponds very well with the 1st high-intensity purity variableat 1628.13 cm−1 in Tables 2 and 3 with an interval of 1622.34–1633.92 cm−1. The 3rd purity peak at 1635.85 cm−1 fits with the1630–1636 cm−1 and 1612–1642 cm−1 regions assigned to theantiparallel β-sheets feature by Arrondo et al. [33] and Pelton andMcLean [32], respectively.

The amide II band (between 1500 and 1600 cm−1) is seldomused by biochemists for interpretation, but remains interesting.

by inverse standard deviation: (a) α-helix; and (b) β-sheet.


The relevance of this region to the analysis of α-helices and β-sheets was first shown in Navea's iPLS studies [13] where thevariable sets 1542.9–1550.6 cm−1 and 1562.2–1569.9 cm−1

were selected among two and three most relevant intervals for αand β modeling, respectively. Table 2 presents two significantintervals in the amide II region between 1543.21–1554.79 and1512.33–1525.84 cm−1, which correspond to the 4th and 8thpurity peak regions, respectively. The α-helix feature regionpreviously selected in the iPLS study matches very well with ourresults of 1543.21–1554.79 cm−1 and assignment. In contrast, theβ-sheet feature from 1562.2 to 1569.9 cm−1 suggested in theiPLS study is not unambiguously supported by our study. In asimilar way, the 8th purity peak region 1512.33–1525.84 cm−1 isnot supported by the iPLS study. In the non-weighted data study,the latter region helps bothα andβ prediction results, but with theweighted data, it only helps the α-helix statistics.

Our interval around 1400 cm−1, detected with weighted data(Fig. 4), corresponds to a low-intensity band. It has not beenpreviously discussed in the secondary structure context and wasnot recognized in the iPLS data studies where no weighting wasapplied. In general, the pattern of selected intervals is similarwith that of the work by Navea (that serves as a benchmark),although, our intervals tend to be somewhat broader. Thus, inthe α-helix model (non-weighted data) iPLS intervals fallentirely inside the present algorithm's variable set.

An interesting fact which requires clarification is that thereduced data sets for the α and β models share a significantamount of variables in common (Figs. 3 and 4). It can beexplained by considering the high negative correlation betweenα and β fraction values that is generally observed in proteinmolecules: r=−0.88 in the present data (Table 1). Thisstatistical fact is not an inherent property of this specific dataset alone. The same correlation has been observed for arepresentative collection of over 500 proteins [35]. Such astrong internal correlation between the main two secondarystructural elements makes the features and variable regions dueto one form indirectly related to the contents of another and viceversa, which is captured by the PLS regression model. A similartrend is also clearly seen in the iPLS studies, especially whilemodeling within the amide I band region only (1600.8–1801.4 cm−1), where two of three selected intervals with 5 to 6channels were found to be the same for α and β. Although in thefull-spectrum iPLS, intersections between the α and β intervalswere eliminated from the final solution, the top-level candidateintervals in the extended lists of the most successful localmodels tended to be essentially the same for both secondarystructural forms [13].

Among all considered method variations (i.e., differentpreprocessing routines and initial variable selection), the purity-based variable selection method using three purity spectraSIMPLISMA components was found to be the best for thesecondary structure prediction from the studied IR spectral data.However, the performance of its simplified version using justthe first purity spectrum is almost as good. This simplificationcan be preferred in preliminary studies to quickly estimatepossible gain of variable selection application or whenSIMPLISMA curve resolution software is not readily available.

4. Conclusions

The purity function used by the SIMPLISMA curve resolutionalgorithm can be utilized for variable selection in multivariateregression analysis. A simple-to-use method has been developedon its basis that allows selecting the intervals of the most infor-mative variables with regard to a component of interest. Anexample of PLS modeling of the contents of protein secondarystructural elements, α-helices and β-sheets, from the infraredspectra of aqueous solutions is studied. It has been shown thatvariable selectionwith the presentmethod significantly reduces theprediction error compared to the whole spectral interval. Themethod's performance is comparable to the results of the iPLSalgorithm used on the same data by Navea et al. [13]. The sug-gested approach does not require extensive calculations nor priorknowledge, and can be “manually” implemented as a pre-processing step for PLS or another regression algorithm. Anotheradvantage is that the availability of SIMPLISMA-resolved purecomponent spectra enables critical evaluation of results by ex-perienced spectroscopists. The method can also be recommendedin the case of a very large number of variables and components,when the application of extensive combinatorial search becomestoo time-consuming. An appropriate variable selection was shownto be specifically useful in combination with data weighting,specifically, variableweighting by their inverse standard deviation.

Acknowledgements

Willem Windig is acknowledged for his help in the imple-mentation and use of SIMPLISMA.

References

[1] E.R. Malinowsky, Factor Analysis in Chemistry, 3rd ed.Wiley-Inter-science, New York, 2002.

[2] S. Weisberg, Applied Linear Regression, 2nd ed.Wiley, New York, 1985.[3] A.E. Boardman, B.S. Hui, H. Wold, Commun. Stat., Theory Methods 10

(1981) 613–639.[4] S. Wold, H. Martens, H. Wold, Lect. Notes Math. 973 (1983) 286–293.[5] K.H. Esbensen, Multivariate Data Analysis–In Practice, 5th ed., CAMO

Process AS, Oslo, 2001, pp. 75–76, 128–130, 142.[6] V. Centner, D.L. Massart, O.E. de Nord, S. de Jong, B. Vandenginste,

C. Sterna, Anal. Chem. 68 (1996) 3851–3858.[7] D. Jouan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Nord, D.L. Massart,

Anal. Chem. 69 (1997) 4317–4323.[8] R. Leardi, J. Chemom. 14 (2000) 643–655.[9] J.A. Hagerman, M. Streppel, R. Wehrens, L.M.C. Buydens, J. Chemom.

17 (2004) 427–437.[10] T.E.M. Nording, J. Koljonen, J.T. Alander, P. Geladi, in: J.T. Alander,

P. AlaSiuru, H. Hyötyniemi (Eds.), The Proceeding of the 11th FinnishArtificial Intelligence Conferences, 1–3, vol. 3, Finish Artificial IntelligenceSociety, Sept. 2005, pp. 99–113.

[11] C.B. Lucasius, M.L.M. Becker, G. Kateman, Anal. Chim. Acta 286 (1994)135–153.

[12] U. Horchner, J.H. Kalivas, Anal. Chim. Acta 311 (1995) 1–13.[13] S. Navea, R. Tauler, A. de Juan, Anal. Biochem. 336 (2005) 231–241.[14] S. Navea, R. Tauler, E. Goormaghtigh, A. de Juan, Proteins: Structure,

Function, and Bioinformatics, vol. 63, 2006, pp. 527–541.[15] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L.Munck, S.B. Engelsen,

Appl. Spectrosc. 54 (2000) 413–419.[16] H. Mark, Principles and Practice of Spectroscopic Calibration, Chemical

Analysis, vol. 118, John Wiley and Sons, New York, 1991.


[17] I. Guyon, A. Elisseef, J. Mach. Learn. Res. 3 (2003) 1157–1182.[18] W. Windig, J. Guilment, Anal. Chem. 63 (1991) 1425–1432.[19] W. Windig, D.A. Stephenson, Anal. Chem. 64 (1992) 2735–2742.[20] A. Dong, J.F. Carpenter, W.S. Caughey, Protein Infrared Database, 1995

(http://www.unco.edu/nhs/chemistry/faculty/dong/irdata.htm).[21] R. Laskowski, V. Chistyakov, EBI PDBsum Protein Data Bank, 2006

(http://www.biochem.ucl.ac.uk/bsm/pdbsum/).[22] N. Sreerama, CDPRO: A Software Package for Analyzing Protein CD

Spectra, , 2004 (http://lamar.colostate.edu/∼sreeram/CDPro/).[23] A. Dong, P. Huang, W.S. Caughey, Biochemistry 29 (1990) 3303–3308.[24] W. Windig, J. Mol. Struct. 23 (1993) 71–83.[25] J. Guilment, S. Markel, W. Windig, Anal. Chim. Acta 2318 (1995) 43–45.[26] W. Windig, B. Anatlek, J.L. Lippert, Y. Batonneau, C. Brémard, Anal.

Chem. 74 (2002) 1371–1379.[27] A. Savitsky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627–1639.

[28] W. Windig, Chemom. Intell. Lab. Syst., Lab. Inf. Manag. 36 (1997) 3–16.[29] A. Bogomolov, M. Hachey, A. Williams, in: A.L. Pomerantsev (Ed.),

Progress in Chemometrics Research, Nova Science Publishers, New York,2005, pp. 119–135.

[30] ACD/UV-IR Manager Processor, version 9, Advanced ChemistryDevelopment, Inc., Toronto, ON, Canada (http://www.acdlabs.com) 2006.

[31] T.F. Kumosinki, J.J. Unruh, Talanta 43 (1996) 199–219.[32] J.T. Pelton, L.R. McLean, Anal. Biochem. 277 (2000) 167–176.[33] J.L.R. Arrondo, A. Muga, F.M. Goni, Prog. Biophys. Mol. Biophys. 59

(1993) 23–56.[34] L.K. Tamm, S.A. Tatulian, Q. Rev. Biophys. 30 (1997) 365–429.[35] A. Bogomolov, G.P. Bourenkov, V.S. Lamzin, and A.N. Popov, Acta

Crystallogr. Section D (2006)—prepared for publication.

http://www.unco.edu/nhs/chemistry/faculty/dong/irdata.htm

http://www.biochem.ucl.ac.uk/bsm/pdbsum/

http://lamar.colostate.edu/~sreeram/CDPro/

http://www.acdlabs.com

Documents

Application of SIMPLISMA purity function for variable selection in multivariate regression analysis: A case study of protein secondary structure determination from infrared spectra