Automatic Generation of Peak-Shaped Models

986 Volume 58, Number 8, 2004 APPLIED SPECTROSCOPY0003-7028 / 04 / 5808-0986$2.00 / 0q 2004 Society for Applied Spectroscopy

Automatic Generation of Peak-Shaped Models

FRANK ALSMEYER* and WOLFGANG MARQUARDT*Lehrstuhl fur Prozesstechnik, RWTH Aachen, D-52056 Aachen, Germany

We describe how parametric spectral models for analytical appli-cations can be generated by an automatic curve-fitting algorithm.The algorithm does not require initial choices of parameters or oth-er human intervention, in contrast to established approaches thatrely on deconvolution or derivative spectroscopy. This algorithmhas been applied for quantitative analysis but can potentially beused in other applications that are based on parametric represen-tations of peak-shaped models or could benefit from using suchmodels, such as calibration transfer.

Index Headings: Parametric models; Hard models; Curve fitting;Calibration.

INTRODUCTION

A physically sound way to model peak-shaped signalssuch as spectra or elution profiles in chromatography isto resolve these signals into individual bands of postu-lated shapes. Such parametric models can serve two mainpurposes: physical interpretation, e.g., for the determi-nation of molecular structures, and quantitative analyses.1

In practice, however, alternative modeling methods arecommonly preferred for quantitative applications. Multi-variate soft models, e.g., partial least squares (PLS),2 arebased on large amounts of measured calibration data in-stead of postulated functional forms. One reason thathard models are less common in routine analyses is thedifficulty of accounting for unknown or varying interfer-ents. Another problem, and the one we are concernedwith in this paper, is the large amount of manual workand spectroscopic knowledge required for model build-ing.

In spite of this, hard models are attractive for severalreasons. Hard modeling does not require extensiveamounts of calibration data, being based on physicalgrounds. In addition, nonlinear mixture effects such aspeak shifting can be accounted for in a straightforwardmanner, even in situations where multivariate methodshave difficulties or fail.3 In addition, hard modeling is theonly choice if it is impossible to obtain adequate calibra-tion data sets for soft models, e.g., if there are unstableor reactive components.

Procedures for arriving at parametric models of spectrahave recently been reviewed focusing on ultraviolet–vis-ible (UV-Vis)1 and nuclear magnetic resonanace (NMR)4

spectroscopy. A more general treatment emphazising theinfrared (IR) case5 dates back to 1980. The modeling pro-cess today is typically an interactive and iterative pro-cedure including the following steps (according to Ref.1 with some extensions):

(1) Estimation of the number of bands by visual in-

Received 4 August 2003; accepted 17 October 2003.* Authors to whom correspondence should be sent. E-mail:

[email protected] and [email protected].

spection or numerical procedures such as decon-volution or derivative spectroscopy.

(2) Choice of a shape model for each band.(3) Choice of an adequate baseline model.(4) Nonlinear least squares optimization, usually with

the Levenberg–Marquardt algorithm and with ini-tial parameter values inferred from the above stepsor from prior knowledge.

(5) Assessment of the goodness of fit, frequently basedon intuitive judgement.

(6) If necessary, modification of the model (steps 1–3) and re-optimization (step 4).

The key problem is the unknown model structure, i.e.,to decide on the number and shape of bands and thebaseline in steps 1–3 above. All known methods for theautomatic determination of the number of bands showsome deficiencies. Deconvolution methods depend on theassumption of the correct instrumental response functionand cannot be applied if band widths show larger varia-tions. The use of derivative spectra is limited by mea-surement noise and the extent of peak overlap. Suchmethods require a certain detection threshold to be set,and the final decision as to the detected bands remainswith the researcher.

Many authors have focused on improving individualsteps of the procedure outlined above, for example, thechoice of the number of bands and initial values by de-convolution6 or the optimization step by use of alternativealgorithms such as simulated annealing7 or genetic al-gorithms.8 We are, however, not aware of fully automatedprocedures that would not require a priori assumptionsabout some of the parameters involved.

The existing procedure is time consuming, especiallywhen many bands are involved and several alternativemodels are considered, because the optimization step can-not (yet) be performed quickly enough to allow for atruly interactive procedure. Calculation times are some-times reduced by the use of local models, where onlyparameters of those peaks that lie in the vicinity of thespectral feature whose model structure has been changedmanually are re-optimized. Still, it would be far moredesirable to have an automatic procedure to generate suchmodels. Besides the reduced effort, a second motivationto use a dedicated algorithm is that the resulting modelsno longer depend on the subjective view of a modeler.

In the following, we present an automatic modelingalgorithm that aims at analytical applications. It does notrequire interaction with the modeler, nor does it requirea priori knowledge of the model structure or approximateparameter values. It generates a sequence of increasinglycomplex models until a goodness of fit criterion based onreplicate measurements is fulfilled. Before describing thealgorithm, we discuss the underlying spectral models that

APPLIED SPECTROSCOPY 987

can be treated. The algorithm is then applied to Fouriertransform infrared (FT-IR) and Raman spectra.

SPECTRAL MODEL

The following discussion of the models focuses on theFT-IR case. The resulting models are, however, of moregeneral applicability (Raman, UV-Vis, NMR, or evenchromatographic elution profiles). Here, we consider ab-sorbance spectra A( ) where is the wavenumber. Mea-n nsured spectra can be described as vectors of data pointsrecorded at m 5 1 . . . M discrete wavenumbers m. Vec-ntors for a single sample are denoted here as AW 5 (A( 1),n. . . , A( m))T. The measured information of N replicatensamples can then be described as an (M 3 N) spectralmatrix A.

Physically, a spectrum can be resolved into individualpeak functions that are described by their position, theirmaximum intensity and some parameters for the width orshape. Note that throughout this paper, we will use theterm peak or peak function for the mathematical func-tions that are used to represent spectral bands.

Band Shape and Location. A measured spectrum isformed by both the fundamental atomic or moleculartransition processes that generate it (‘‘true spectrum’’)and by instrumental effects distorting it. For single spec-tral lines, both effects result in distributions of the mea-sured intensity over a frequency region. The overallshape is described by the mathematical convolution ofthese distributions.9

Considering single lines in gases, there are three broad-ening effects, namely, natural, Doppler, and collisionbroadening. The resulting line shapes can be obtained byeither a full quantum mechanical treatment, or—espe-cially in the case of IR—to a good approximation byclassical mechanics.9 Most treatments are based on weakassumptions. Hence, the resulting line shape models canbe regarded as very good approximations to reality. IRlines can be modeled by either a Lorentzian distribution:

2gLL(n) 5 a (1)2 2(n 2 v) 1 gL

if the dominating effect is natural broadening attributableto the limited lifetime of a transition, or a Gaussian dis-tribution:

24 ln 2(n 2 v)G(n) 5 a exp 2 (2)

2[ ]gG

to account for Doppler broadening.9 Both distributionsare characterized by their peak position v, their maximumintensity a, and their half width at half height or half-value width g, respectively. Pressure or collision broad-ening in gases can in the vicinity of the central frequencyv also be approximated by a Lorentzian shape. In prin-ciple, all these effects must be combined, and the result-ing shape is the convolution of Lorentzian and Gaussiandistributions, known as the Voigt function:

`1V(x) 5 G(n)L(n 2 t) dt (3)Ea

2`

In the liquid phase, modeling becomes more empirical.

Collision broadening due to molecular interaction is thedominating effect. One observes broader spectral fea-tures, known as bands, which may contain several spec-tral lines. Although an exact derivation is impossible, theLorentzian distribution is generally regarded as a goodapproximation to the resulting shape of a band.5 However,even the use of the Gaussian distribution can in suchcases be justified. For example, complex bands tend toobtain increased Gaussian characteristics when they con-sist of a sum of Lorentzian bands whose central frequen-cies are distributed over the wavenumber axis.10

Instrumental effects are described with the instrumentline shape function (ILS). In Fourier transform (FT) spec-troscopy, the main influence comes from the apodizationfunction of the interferogram. However, if the instrumen-tal resolution is sufficient, the distortion of the spectrumby the ILS is negligible. Griffiths et al.11 define the res-olution parameter:

dr 5 (4)

2g

to facilitate an evaluation of instrumental effects. Here, dis the nominal resolution, defined as the reciprocal of themaximum retardation of the FT-IR spectrometer. For aresolution of d 5 2 cm21 and half widths larger than g5 10 cm21, as they appear in our measurements, r , 0.1.This value indicates that the measured spectrum is a veryclose approximation of the true spectrum and that instru-mental effects can be neglected.11

In summary, IR bands at sufficient instrumental reso-lution can be adequately represented by Voigt functions.The problem with this function is that it has no closedform and the computational effort to calculate the con-volution is substantial, especially for the purpose of pa-rameter estimation. There are rational approximations,12

but we here use an even simpler approach. It has beenshown13 that the Voigt function can be approximated bya linear combination of a Gaussian and a Lorentzianfunction having the same widths g 5 gG 5 gL as:

V( ) 5 bG( ) 1 (1 2 b)L( )n n n (5)

with the fractional parameter b. The error in the calcu-lation of the line shape area is less than 0.72% for anyratio of the underlying Gaussian and Lorentzian widths.V( ) is therefore sometimes called the pseudo-Voigt func-ntion.

Baseline. In the absence of a sample, the absorbanceshould be zero at all frequencies. In real measurements,this is not always the case. There are several causes thatone might summarize as spectrometer or sampling error.Examples are detector drift, changing environmental con-ditions such as temperature, or variations in the spectrom-eter purge.14 This is why the baseline should explicitlybe included in a spectral model. Note that the commonnotion of a baseline, which is different in each measure-ment, should not be confused with the possible influenceof bands lying out of the spectral range investigated,1

which are sometimes termed artificial baseline. If suchbands exist, their effect is always present in the mea-surements.

In most cases, baseline effects can be accounted for byincluding a polynomial of first or second order in the

988 Volume 58, Number 8, 2004

spectral model. Throughout this article, we assume a lin-ear baseline model:

B( , uB) 5 uB,1 1 uB,2n n (6)

where uB 5 (uB,1, uB,2)T denotes the baseline parameters.In the review paper by Maddams5 it has been noted thatone may observe a high correlation between some of thepeak parameters and the baseline parameters, notably forLorentzian bands with their extended wings. It is there-fore reasonable to use the simplest baseline model thatleads to a satisfactory fit.

Model Summary. The previous considerations resultin the following model equations. The spectral model Aconsists of i 5 1 . . . I Voigt distributions Vi( , ci) withnindividual parameter vectors ci 5 (vi, gi, ai, bi)T:

L

A(n, u) 5 V (n, c ) (7)O i ii51

Here, the model parameter vector u 5 (c , . . . , c )T isT T1 I

an abbreviating vector that comprises all peak functionparameters in the model. As noted in the introduction,the number of peak functions I is a parameter that isusually not known a priori but must be determined dur-ing model building.

When estimating the spectral model parameters u, weinclude a baseline model B( , uB) with a baseline param-neter vector uB, because baseline effects cannot be neglect-ed in general. The resulting overall model M( , u, uB) is:n

M( , u, uB) 5 A( , u) 1 B( , uB)n n n (8)

The parameters are simultaneously estimated by mini-mizing:

\A( ) 2 M( , u, uB)\min n nu,uB

(9)

Goodness of Fit. Goodness of fit for spectral modelsis usually expressed in terms of the root mean squarederror RMSE (sometimes also called DIS, the discrepancyindex). It is defined as:

1/2M12RMSE 5 [A(n ) 2 M(n )] (10)O m m5 6M m51

A( m) is the measured and M( m) the calculated absor-n nbance at the wavenumber indexed by m. In a statisticalnsense, it would be more accurate to subtract the numberof model degrees of freedom f from M in the denomi-nator of Eq. 10, but we adopt here the common approachof neglecting this because f K M. Maddams5 has statedthat a fit to an infrared or Raman spectrum can be con-sidered good if the RMSE value does not exceed 0.01,and more typically is about 0.005. This predication is stillquoted in more recent articles,1 although instrumentation-al advances in FT-IR should allow for higher experimen-tal accuracies. The values may still be useful as a rule ofthumb. Regardless of the absolute value of RMSE, thecriterion is sometimes used pragmatically to determinean adequate model complexity.15 For example, one canrequire a certain percentage decrease of the RMSE withthe inclusion of additional model terms. The reproach tosuch procedures is that they do not penalize large num-bers of parameters.

In most of the examples considered later, genuine rep-

licate spectra are available. Then, an estimate of the errorvariance v 5 (v1, . . . , vM)T can be obtained, and this cantheoretically be utilized for a statistically sound goodnessof fit criterion based on the F-test.16,17 For n 5 1, . . . , Nreplicate spectra, one calculates the mean spectrum AW (mean)

with elements:N1

meanA 5 A (11)Om m,nN n51

and the empirical variance v with elements:N1

(mean) 2v 5 [A 2 A ] (12)Om m,n mN 2 1 n51

Then, the weighted residual sum of squares error:N M 2[A 2 M(n )]m,n mS 5 (13)O Or vn51 m51 m

is split into two parts, the sum of weighted pure mea-surement errors:

N M (mean) 2[A 2 A ]m,n mS 5 (14)O Oe vn51 m51 m

with f e degrees of freedom, which is due to experimentalinsufficiencies, and the lack-of-fit error:

S1 5 Sr 2 Se (15)

with f 1 degrees of freedom, due to the inexact model.Then, the variance ratio:

S / f1 1F 5 (16)S / fe e

can be referred to a table of critical values Fc( f 1, f e, )aat significance level , leading to rejection or confirma-ation of the model at hand.18 We have considered both theF-test and the RMSE in the results presented below.

MODELING ALGORITHM

The main obstacles in the common procedures outlinedabove are the unknown shape and number of peaks. Theformer problem can be solved for the IR case at sufficientresolution by assuming a Voigt line shape for each peakin the model, which may or may not converge to theLorentzian or Gaussian form during the optimization pro-cedure. The latter problem is solved by iteratively addingpeaks to the model as long as—based on the model re-sidual—there is a statistical justification to do so. In es-sence, this amounts to making a judgment as to whetherthe model residual can be attributed to noise or not. Forthis purpose, goodness-of-fit tests can be applied, such asthe ones described in the previous section.

The algorithm is inspired by the approach that a userwould likely adopt without prior knowledge of spectralfeatures, making his judgments on a statistically soundbasis.

Roughly speaking, the algorithm strives to iterativelyimprove a model estimate Mi( , ui, uB,i) of the form innEq. 8 with index i referring to iteration i of the model-building procedure. To do so, it starts with a pure baselinemodel, i.e., I 5 0, and then adds one pseudo-Voigt peakper iteration i in the spectral region where the fit is theleast satisfactory. The initial parameter estimates for each


FIG. 1. Flow chart of the algorithm for automatic model generation.

FIG. 2. Identification of unmodeled peaks according to steps 4 and 6of the modeling algorithm. See text.

new peak c0,i are determined by analyzing the residualof the previous iteration. They are then optimized in afitting step.

We consider n 5 1 . . . N replicate spectra AW , which(raw)n

are vectors of data points recorded at the same m 5 1. . . M discrete frequencies m. This can be written as ann(M 3 N) data matrix A(raw), where superscript (raw) de-notes raw data.

In principle, A(raw) can be used to estimate statisticalinformation such as the measurement error variance toassess the goodness of fit later in the algorithm. However,as the baseline of each spectrum AW is different, the(raw)

n

estimated variance in the raw data may be larger than theactual measurement noise. In classical analyses, one triesto eliminate baseline effects by manual corrections, butthis is somewhat arbitrary. For variance estimation, acomplete correction is, however, not necessary. It is suf-ficient to maximize the overlap of the replicate measure-ments. This is done in the first step of the following al-gorithm.

The flow chart in Fig. 1 gives an overview of the it-erative procedure. The algorithm, in detail, works as fol-lows. Steps 1 through 8 correspond to the boxes in thefigure.

(1) Perform a correction of the raw data A(raw) to mini-mize the influence of baseline effects on the varianceestimation. To this end, the first spectrum AW is left(raw)

1

unchanged and serves as a reference. The remainingn 5 2, . . . , N replicates AW are corrected by ad-(raw)

n

dition of a baseline model B( , uB,n) each. The matrixnA of corrected data then contains the elements:

(raw)A 5 A (17)m,1 m,1

(raw)A 5 A 2 B(n , u ), ne{2, . . . , N} (18)m,n m,n m B,n

For estimation of the parameters uB,n, we calculatethe mean spectrum AW (mean) of the corrected data A ac-cording to Eq. 11 and the empirical variance v(corr)

according to Eq. 12. Then, the latter is minimizedby:

M(corr)min v (19)O m

u m51B,n

Baseline effects are not entirely eliminated after thiscorrection. They are accounted for by the model pa-rameters uB,i later in the algorithm, which must notbe confused with the parameters uB,n of this prepro-cessing step.

(2) Calculate a smoothed empirical variance vector vfrom the empirical variance v(corr). For the choice ofa smoothing algorithm, see the remarks followingthis enumeration.

(3) Set the initial model estimate M0( , u0, uB,0) 5 B( ,n nuB,0), where u0 is a vector of length zero and whereuB,0 5 (0, 0)T, i.e., assuming there are no baselineeffects.

(4) Calculate the mean residual spectrum at iteration i,the vector DAW i with elements DAi,m 5 A 2 Mi( m,(mean) nm

ui , uB,i) containing both model error and measure-ment noise. Error bounds for the residual errordue to measurement noise are estimated to be multi-ples ks of the empirical standard deviation s(sm 5Ïvm), where k is a user-specified scalar factor. Sub-tract this error to obtain an estimated model error ei

5 DAW i 2 ks (see Fig. 2). Determine the maximummodel error by finding the index c (corresponding tothe spectral position or wavenumber) for which c 5arg maxmei,m. This index yields initial values for the


FIG. 3. Model error vs. model complexity for the automatically gen-erated sequence of EtOH models.

FIG. 4. F-test vs. model complexity for sequence of EtOH models.

FIG. 5. Ethanol model and residual. Upper diagram: gray 5 raw data, black 5 peak functions Vi(·). Lower diagram: gray 5 62.58·s, correspondingto 99% of a normal distribution, black 5 model residual.

position vi 5 c and maximum ai 5 ei,c of a peaknthat is missing in the model and must be added inthe next iteration.

(5) If a specified stopping condition based on goodness-of-fit criteria like the F test is fulfilled, stop.

(6) Determine an initial value gi for the half width of thenewly introduced peak. A simple approach is illus-trated in Fig. 2. First, to both sides of c, find thenclosest indices u and l where the residual does notyet fall below ai/2. A reasonable initial value is thenthe smaller of the two distances z c 2 lz and z u 2n n n

cz.n(7) Add a peak with parameter initial values c0,i 5 (vi,

ai, gi, bi)T to the model, where bi 5 0.5, i.e., mixedGauss–Lorentzian character: Mi( , ui , uB,i) 5 Mi21( ,n nui21, uB,i21) 1 V( , c0,i), where ui 5 (u , c )T.T Tn i21 0,i

(8) Optimize all parameters ui, including baseline param-eters uB,i by minimizing the weighted least squaresobjective function:

N M 2[A 2 M (n , u , u )]m,n i m i B,imin (20)O Ovu ,u n51 m51 mi B,i

which is one possible implementation of Eq. 9.(9) Go to step 4.

Several remarks concerning the individual steps are in-dicated. Smoothing of the variance vector in step 2 isimportant for two reasons. First, it removes some noisefrom the estimated model error ei in step 4, thereby re-ducing the danger of detecting spectral features that aresolely due to localized noise. Second, if a non-smoothedempirical variance is used in the objective function, inparticular, if there are not many replicates, this can leadto numerators in Eq. 20 that vary by several orders ofmagnitude. It is, however, reasonable to assume that ad-jacent frequencies suffer from similar measurement er-rors, justifying a filter step. We use a moving averagefilter with a window length of 20. For a spectral resolu-tion of 2 cm21 in our IR measurements, this is on theorder of the typical number of data points in a liquid-phase band.

Note that step 4 guarantees that ai , DAi,c. From ourexperience, convergence is more safely obtained if theadded peak lies entirely below the spectral residual, i.e.,if Vi( m , c0,i) , DAi,m. To be precise, the latter conditionnis not exactly guaranteed with the initial estimate for thehalf width in step 6. We did not, however, encounterproblems, because the estimate comes close enough tothe optimized value. In this work, the factor k is chosen


FIG. 6. Ethyl acetate model and residual. Upper diagram: gray 5 raw data, black 5 peak functions Vi(·). Lower diagram: gray 5 62.58·s,corresponding to 99% of a normal distribution, black 5 model residual.

such that, for an adequate model and Gaussian distributedmeasurement errors, 99% of the data points should liewithin the error bounds defined by 6ks. This conditionleads to k 5 2.58.

It is of course possible to use a different objectivefunction in step 8, such as a maximum likelihood for-mulation for the simultaneous estimation of the covari-ance matrix, or unweighted least squares when no repli-cates are available.19

IMPLEMENTATION

All calculations were performed with a program writ-ten in MATLAB,20 using the optimization toolbox forMATLAB.21 It includes the automatic curve-fitting al-gorithm and a graphical user interface for manuallychanging or fixing of the model parameters. The opti-mization routinely uses MATLAB’s SQP (sequential qua-dratic programming) implementation. This optimizationalgorithm allows for the solution of constrained optimi-zation problems, which is necessary for the methods de-scribed in the second part of this paper. In the absenceof constraints, the Levenberg–Marquardt algorithm canalternatively be used in this implementation.

Both the parameters and the objective function arescaled to unity at the beginning of each optimization. Allparameters are bounded within physically sensible inter-vals. The program allows for different objective functions(least squares, weighted least squares) and generates sev-eral diagnostic plots (e.g., RMSE, F-test as a function ofmodel complexity) to facilitate model choice.

The generation of a sequence of pure component mod-els of, say, up to 50 peak functions can be accomplishedwith this implementation during one night on currentdesktop computers. We did not attempt to optimize com-putational performance. However, there is great potentialto do so, e.g., by using a compiled programming lan-guage.

EXPERIMENTAL

As examples, we have modeled the pure componentspectra of two data sets.

Data Set 1 (Infrared). The first data set consists ofFT-IR spectra of the quaternary system ethanol (EtOH),n-butanol (BuOH), ethyl acetate (EtOAc), and n-butyl ac-etate (BuOAc). The pure component spectra of the fourconstituents are modeled below.

All measurements for this system were performed inthe laboratory of Bayer AG, Leverkusen, Germany, at60 8C in a closed, thermostated vessel. The samples wereprepared gravimetrically and introduced into the appara-tus through a sampling port. The spectra were recordedin the wavenumber range of 700–4000 cm21, using theattenuated total reflection (ATR) technique on a BioRadExcalibur FTS 3000 IR spectrometer with a standard im-mersion probe Axiom DPR 240 with two reflections. Thespectra are averaged from 16 scans at a resolution of 2cm21 and were obtained with the internal DTGS (deuter-ated triglycine) detector of the spectrometer. All chemi-cals used in the experiments were obtained from Merckwith purities exceeding 99%, at water contents of 0.2%and below.

Data Set 2 (Raman). A second data set consists ofspectra for the ternary system cyclohexane/dioxane/tol-uene, measured at the Lehrstuhl fur Technische Ther-modynamik, RWTH Aachen. The set includes measure-ments of the three pure components, which are consid-ered below.

The spectra were recorded at room temperature in therelative wavenumber range of 318–1647 cm21 on an ex-perimental setup consisting of an 0.5 to 10 W Argon ionlaser (Spectra Physics) with an excitation wavelength of514.5 nm, an ARC Spectra Pro 500 i spectrograph, anda Roper Scientific NTE/CCD-1340/400 air-cooledcharge-coupled device camera. Further details of the set-up are described in Ref. 22. The substances for theseexperiments were obtained from Fluka with nominal pu-rities of 99.8 mass-% and were not further purified.

MODELING RESULTS

The goal of this paper is to verify the quality of thespectral models based purely on goodness-of-fit criteria.We note that in addition, the models have also been ap-


TABLE I. Parametric pure component models for data set 1.

ComponentNo. ofpeaks RMSE zDAzmax F/Fc

EtOHEtOAcBuOHBuOAc

32504137

0.00160.00220.00150.0019

0.00860.00790.00600.0163

4.414.01

19.974.87

TABLE II. Parametric pure component models for data set 2.

Component No. of peaks RMSE zDIszmax

CyclohexaneDioxaneToluene

323436

0.000280.000290.00023

0.00170.00110.0010

FIG. 7. Butanol model and residual. Upper diagram: gray 5 raw data, black 5 peak functions Vi(·). Lower diagram: gray 5 62.58·s, correspondingto 99% of a normal distribution, black 5 model residual.

plied successfully for mixture analyses, described in aseparate paper.3

Pure Component Models for Infrared Spectra. Foreach of the four pure components in data set 1, we gen-erated a sequence of spectral models of the form in Eq.8, with up to I 5 50 peaks or until the F-test was fulfilled.As an example, Figs. 3 and 4 depict the model errors(RMSE) and the results of an F-test vs. model complexityfor ethanol.

On the one hand, starting at peak function 32 the im-provement with each newly added peak is only minor.Visual inspection of the respective residuals suggestedthat these peaks merely model localized noise compo-nents.

On the other hand, the F-test, with F compared to thecritical value Fc at a confidence level of 95%,18 is notfulfilled for any of the generated models (F/Fc . 1).Therefore, it cannot be used as a criterion for model se-lection. The F-test works well on simulated data sets(e.g., Ref. 16) when certain requirements are fulfilled,such as normally distributed measurement errors and un-correlated noise. These conditions may not be fulfilledfor real spectral data. In addition, we may in fact haveunmodeled effects. For example, one may observe peri-odic behavior superposed to the spectra, resulting fromerroneous data in the interferogram.23 Last but not least,we know that the Voigt line shape—even the exact one—is just an approximation to physical reality.

With respect to the RMSE, however, the model qualityis excellent even when less than 50 peak terms are in-cluded. Figures 5 through 8 show selected models foreach component. The figures depict the measured spec-trum and the individual peak functions Vi(·) of the modelin the upper diagram, and the model residual in the lower

diagram. Furthermore, the lower diagram includes twogray curves that represent multiples of the smoothed em-pirical standard deviation s, corresponding to the 99%region of a normal distribution, suggesting that an excel-lent fit is obtained in all cases.

The selection of one model per component to be dis-cussed was done on the basis of goodness-of-fit curveslike those for EtOH in Figs. 3 and 4. We included onlythat number of peaks resulting in the last major improve-ment in RMSE or the F-test (e.g., 32 for ethanol). Forthe resulting model complexity, the residual was, in ad-dition, visually inspected, with regard to potentially un-modeled bands. No contradiction was found. In summa-ry, we have not applied an entirely objective criterion formodel choice. We may note, however, that the algorithmaims at quantitative applications, and that model selectionis not too crucial if the goal is calibration, once the maincharacteristics of the spectrum are captured. The key isthat the area under the spectrum is adequately represent-ed. Inclusion of additional peaks with the automatic pro-cedure typically adds small, localized noise componentsto the model. If such peaks are included, their contribu-tion to the overall spectral area is negligible.

Table I lists the number of peak functions for the se-lected models and some numerical details that allow as-sessment of the goodness of fit. The value for the RMSEis significantly lower than the value of 0.005 given byMaddams,5 and even the maximum deviations zDAzmax arehardly larger.

Pure Component Models for Raman Spectra. Theintensity values of data set 2 were scaled by 1026 to yieldband maxima around unity for the pure components, tomake the results comparable to the IR results presentedfor data set 1. The scaled intensities are denoted by Is.Sequences of models were automatically generated for


FIG. 8. Butyl acetate model and residual. Upper diagram: gray 5 raw data, black 5 peak functions Vi(·). Lower diagram: gray 5 62.58·s,corresponding to 99% of a normal distribution, black 5 model residual.

FIG. 9 Model and residual of dioxane Raman spectrum. Upper diagram: gray 5 raw data, black 5 peak functions Vi(·). Lower diagram: black 5model residual.

the three components. The F test could not be appliedbecause no replicate measurements were available for thisdata set and, therefore, error variance estimates cannot beobtained. As an alternative criterion, the algorithm wasstopped when the maximum positive residual zDIszmax fellbelow 0.001. Details on the resulting model quality aregiven in Table II. According to the RMSE and the max-imum model deviations zDIszmax, the representations aremuch better than for data set 1. This is partly because thespectra are simpler and partly because they are more di-rectly measured. As an example, Fig. 9 depicts the modeland residual for dioxane. Note that the scale of the resid-ual is about one tenth of that for the IR examples, al-though the spectral intensity is comparable.

CONCLUSION

We have presented a new algorithm that automaticallygenerates parametric spectral models. The algorithm it-eratively builds models of increasing complexity by add-ing peak functions at wavenumbers where the modelquality in terms of the spectral residual is not yet suffi-cient, and then optimizing their parameters. This ap-

proach is derived from the intuitive procedure that a hu-man modeler would apply if she or he had no priorknowledge of spectral features. There will probably be adifference in the sequence of adding peaks, as the humaneye would favorably detect broader model deficienciesfirst. The final models, however, do not look much dif-ferent.

We have applied the algorithm to model FT-IR andRaman spectra. With respect to classical goodness-of-fitcriteria like the RMSE, an excellent model quality couldbe obtained in all cases. The IR models were, in addition,assessed by an exact statistical criterion, namely, the F-test. Here, the models did not seem to be adequate. Webelieve that some of the prerequisites for applying the F-test were not fulfilled and that one is likely to reach asimilar conclusion for most spectral models that are ap-plied in practice. For any practical analytical application,however, it should be sufficient that the IR residuals didalmost entirely lie within the 99% limits of a normaldistribution, based on the empirical standard deviation.One successful quantitative application of the models hasbeen described in a separate paper.3


We do not claim that the resulting models are physi-cally correct. While human judgment is deemed essentialin the modeling process when it comes to a serious phys-ical interpretation of the spectral features under consid-eration, the situation is less critical for analytical purpos-es. If certain spectral features can be associated with anabsorbing component, it is sufficient to adequately rep-resent their spectral shape and area. In the interest of agood representation, one might even make a compromiseas to the physical interpretability. This allows for an au-tomatic fitting procedure such as the one outlined in thispaper.

The primary goal pursued in this paper is to simplifyquantitative analyses based on parametric models by re-ducing the manual effort that is required in establishedapproaches. However, with this reduced modeling effort,a whole range of further potential applications of para-metric spectra comes into perspective. One example iscalibration transfer, and another is (multivariate) curveresolution that could be enhanced to be no longer con-strained to the validity of Beer’s law, as current approach-es are.

ACKNOWLEDGMENTS

Parts of this work were performed at and funded by Bayer AG, Le-verkusen. The authors thank G. Olf and M. Leckebusch from Bayer fortheir support. Additional financial support of the Deutsche Forschungs-gemeinschaft (DFG) within the Collaborative Research Center (SFB)540 ‘‘Model-based Experimental Analysis of Kinetic Phenomena in Flu-id Multi-Phase Reactive Systems’’ is also gratefully acknowledged.Special thanks to D. Hennebeil, who did some of the experimental workfor data set 1, and V. Goke for providing the excellent experimentaldata set 2.

1. L. Antonov and D. Nedeltcheva, Chem. Soc. Rev. 29, 217 (2000).

2. H. Martens and T. Næs, Multivariate Calibration (John Wiley andSons, New York, 1989).

3. F. Alsmeyer, H.-J. Koß, and W. Marquardt, Appl. Spectrosc. 58,975 (2004).

4. J. Higinbotham and I. Marshall, Annu. Rep. NMR Spectrosc. 43,59 (2000).

5. W. F. Maddams, Appl. Spectrosc. 34, 245 (1980).6. A. Miekina, R. Z. Morawski, and A. Barwicz, IEEE Trans. Instrum.

Measurement 46, 1049 (1997).7. A. Ferry and P. Jacobsson, Appl. Spectrosc. 49, 273 (1995).8. A. P. De Weijer, C. B. Lucasius, L. Buydens, G. Katerman, H. M.

Heuvel, and H. Mannee, Anal. Chem. 66, 23 (1994).9. A. P. Thorne, Spectrophysics (Chapman and Hall, London, 1974).

10. J. Toft, O. M. Kvalheim, F. O. Libnau, and E. Nodland, Vib. Spec-trosc. 7, 125 (1994).

11. P. R. Griffiths and J. A. de Haseth, Fourier Transform InfraredSpectrometry, volume 83 of Monographs on Analytical Chemistryand its Applications (John Wiley and Sons, New York, 1986).

12. J. Humlicek, J. Quant. Spectrosc. Radiat. Transfer 27, 437 (1981).13. S. Bruce, J. Higinbotham, I. Marshall, and P. H. Beswick, J. Magn.

Reson. 142, 57 (2000).14. J. H. Duckworth, ‘‘Spectroscopic Quantitative Analysis’’, in J.

Workman and A. W. Springsteen, Eds., Applied Spectroscopy. ACompact Reference for Practitioners (Academic Press, New York,1998).

15. C. E. Alciaturi, M. E. Escobar, and C. De La Cruz, Anal. Chim.Acta 376, 169 (1998).

16. W. E. Stewart, T. L. Henson, and G. E. P. Box, AIChE J. 42, 3055(1996).

17. Analytical Methods Committee. ‘‘Is my calibration linear?’’ Ana-lyst (Cambridge, U.K.) 119, 2363 (1994).

18. I. N. Bronstein and K. A. Semendjajew, Taschenbuch der Mathe-matik (Pocketbook of mathematics) (Teubner, Stuttgart, Leipzig,1991), 25th ed.

19. Y. Bard, Nonlinear Parameter Estimation (Academic Press, NewYork, 1974).

20. MathWorks, Using MATLAB, Version 6 (The MathWorks, Inc., Na-tick, MA, 2000).

21. MathWorks, Optimization Toolbox for Use with MATLAB, User’sGuide Version 2 (The MathWorks, Inc., Natick, MA, 2000).

22. A. Bardow, W. Marquardt, V. Goke, H.-J. Koß, and K. Lucas,AIChE J. 49, 323 (2003).

23. P. Saarinen and J. Kauppinen, Appl. Spectrosc. 45, 953 (1991).

Documents

Automatic Generation of Peak-Shaped Models