Mutual peak matching in a series of HPLC–DAD mixture analyses

Analytica Chimica Acta 490 (2003) 41–58

Mutual peak matching in a series ofHPLC–DAD mixture analyses

Andrey Bogomolova,∗, Michael McBrienb

a Advanced Chemistry Development, Inc., 6 Akademika Bakuleva, 117513 Moscow, Russiab Advanced Chemistry Development, Inc., 90 Adelaide Street West, Suite 702, Toronto, Ont., Canada M5H 3V9

Accepted 21 May 2003

Abstract

One of the largest challenges in high performance liquid chromatography (HPLC) method development is the necessity fortracking the movement of peaks as separation conditions are changed. Peak increments are then used to build a mathematicalmodel capable of minimizing the number of experiments in an optimization circuit. Method optimization for an unknownmixture is, moreover, complicated by the absence of any a priori information on component properties and retention timeswhen direct signal assignment is not possible. On the contrary, achievement of the maximum separation becomes an importantfactor for successful identification or quantitation. In this case, the optimization may be based on assigning peaks of the samecomponent chosen from different experiments to each other. In other words, mutual peak matching between the HPLC runsis required.

A new method for mutual peak matching in a series of HPLC with diode array detector (HPLC–DAD) analyses of the sameunknown mixture acquired at varying separation conditions has been developed. This approach, called mutual automated peakmatching (MAP), does not require any prior knowledge of the mixture composition. Applying abstract factor analysis (AFA)and iterative key set factor analysis (IKSFA) on the augmented data matrix, the algorithm detects the number of mixturecomponents and calculates the retention times of every individual compound in each of the input chromatograms. Everycandidate component is then validated by target testing for presence in each HPLC run to provide quantitative criteria for thedetection of “missing” peaks and non-analyte components as well as confirming successful matches. The matching algorithmby itself does not perform full curve resolution. However, its output may serve as a good initial estimate for further modeling.A common set of UV-Vis spectra of pure components can be obtained, as well as their corresponding concentration profilesin separate runs, by means of alternating least-square multivariate curve resolution (ALS MCR), resulting in reconstructionof overlapped peaks.

The algorithms were programmed in MATLAB® and tested on a number of sets of simulated data. Possible ways to improvethe stability of results, reduce calculation time, and minimize operator interaction are discussed. The technique can be usedto optimize HPLC analysis of a complex mixture without preliminary identification of its components.© 2003 Elsevier B.V. All rights reserved.

Keywords:Peak matching; Multivariate data analysis; Self-modeling curve resolution; HPLC

∗ Corresponding author.E-mail address:[email protected] (A. Bogomolov).

1. Introduction

Overlapped peaks represent one of the most seri-ous problems in high performance liquid chromatog-

0003-2670/$ – see front matter © 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0003-2670(03)00667-6

42 A. Bogomolov, M. McBrien / Analytica Chimica Acta 490 (2003) 41–58

raphy (HPLC) data interpretation. Hyphenated tech-niques such as LC-UV with diode array detection hasgiven growth to a number of new methods of data anal-ysis applying chemometrics’ potential for extractinghidden information.

Over the last two decades, a wealth of excellentwork has been devoted to the curve resolution problemin the hyphenated techniques using multivariate dataanalysis methods. Comprehensive overviews of thesemethods are made in Malinowski’s book[1] and inthe paper by Hamilton and Gemperline[2]. Interest inthe curve resolution problem among researchers andpractical analysts is constantly growing and there isclear progress in this area. However, an unambiguoussolution suitable for quantitative or qualitative analysisis not always possible. Besides, these methods are stillstate-of-the-art, requiring intensive interaction with anexperienced operator and specific software, which isnot a part of the instrument integration package.

Therefore, in practical mixture analysis significantefforts are made for method optimization to achievemaximal separation of components. Typically, for op-timization of a chromatographic experiment, a seriesof spectrochromatograms is obtained under a widerange of the main system parameters (temperature,pH, solvent composition, and gradient). The analy-sis of peak movements while a parameter value ischanged enables the chromatographer to find appro-priate separation conditions. Additional data regardingknown components of the mixture under analysis canfacilitate the optimization considerably and reduce thenumber of experimental runs. Component retentiontimes can be forecast using experimental data[3,4], athand or published, or based on the physicochemicalproperties of a molecule.

The analysis of unknown mixtures is a serious chal-lenge for the analyst because direct identification ofpeaks is impossible. Finding the best conditions byarbitrary searching can be time-consuming and ex-pensive for complex mixtures. In the case of an un-known mixture, modeling of the system behavior isstill possible if mutual peak matching between chro-matograms in a series is performed. This means thateach suggested component in the mixture should beassigned a peak in each separate experiment. In fact,the peak matching problem consists of distinguishinga signal pattern that can be represented as a table withthe rows corresponding to mixture components and

experiments in columns. The predicted retention timesare then inserted into the corresponding cells.

Mutual matching during the optimization cycle iscomplicated because the first spectrochromatogramsare often taken at arbitrary conditions, therefore the de-gree of signal overlap may be rather high. The identityof two peaks referring to the same component cannotbe proven by mere comparison of their spectra or peakareas if at least one of these is overlapped by anothersignal. This is certainly a multicomponent problem,and is, therefore, suited to a chemometrics solution.

Peak matching has much in common with thecurve resolution problem. In both cases, individualpeak related data is extracted from the analysis ofoverlapping signals. However, they have an essentialdifference. Curve resolution aims at achieving an ul-timate solution, i.e. the reproduction of all spectra ofseparate components and their concentration profiles.In contrast to this, peak matching takes an interest inthe reproduction of the generalized peak pattern ofindividual components. The isolation of each compo-nent spectrum and its profile is not generally required.Moreover, the optimization of a chromatographic ex-periment may not always require high accuracy forretention time evaluation. The accuracy of maximaevaluation usually needs only to be high enough todistinguish between the overlapping and standalonepeaks. It may seem tempting to use the curve resolu-tion approaches for solving the peak matching prob-lem, for example, to apply alternating least-squaremultivariate curve resolution (ALS MCR) on a datamatrix augmented for common spectra[5,6] or makeuse of another curve resolution technique. Then ob-taining the retention times in profile maxima is trivialprovided that the resolution is successful. However,in our opinion, this is not a straightforward approach.A huge amount of intensive calculation for obtainingunnecessary data about the form of curves is not theonly issue. The main problem with curve modelingapproaches is their sensitivity to incorrect evaluationof the true number of input parameters. This is espe-cially characteristic of multivariate data analysis, andmay be problematic in the case of flat curve fitting.Incorrect evaluation of the model size may cause badconvergence and result in an incorrect solution. Failedcurve resolution would mean, in this case, unsuccess-ful peak matching in general. Therefore, an approachallowing a partial solution (that is, the matching of

A. Bogomolov, M. McBrien / Analytica Chimica Acta 490 (2003) 41–58 43

the main peaks in a mixture only) and less sensitive tothe deduced number of factors would be preferable.

To solve this problem, we applied an approach basedon defining a key set of spectra. The defined set is thensubject to validation and selection. Retention timesare determined using the key set in a regression step,and eventually, the results are validated. These arethe major stages of the suggested approach. Let usconsider the key set selection as a major part of theapproach in detail.

There are several approaches to finding a key set ofvariables or samples: iterative key set factor analysis(IKSFA) by Schostak and Malinowski[7], orthogonalprojection approach (OPA)[8,9], simple-to-use itera-tive self-modeling mixture analysis (SIMPLISMA) byWindig and Guilment[10]. There are also other knownmethods[11–13]. All of these approaches aim to de-fine key sets or data matrix columns (or rows) that areused as axes for a new reduced space to cover dataspace most efficiently. Such variables (or samples) arecalled key or typical variables. However, the above ap-proaches involve varying criteria for defining the keyset and may yield different results. The SIMPLISMAapproach is based on defining pure variables (i.e. spec-tral wavelength with a single-component contribution)and thus, revealing relative component concentrations.OPA is based on Gram–Schmidt orthogonalization,and on the assumption that the purest spectra are mu-tually more dissimilar than the corresponding mixturespectra. IKSFA is based on the same assumption asOPA and looks for a set of the most orthogonal spec-tra but solves the task in a slightly different manner.In IKSFA, an initial set of key rows (or columns) isfirst obtained by key set factor analysis (KSFA)[14]operating on scores in the abstract factor space. Be-cause the selection process in KSFA is sequential, theprocedure will not necessarily yield the most orthog-onal key set. IKSFA represents a refinement of KSFAwhere the initially obtained key set is iteratively im-proved by substitutions, to produce the best key set ina space ofn primary factors.

In our method, we gave preference to IKSFA. Thismethod has been applied for spectroscopic data anal-ysis repeatedly, and its effectiveness has been proven[15,16]. For the purposes of this research, the approachwas modified to adapt it to the specific problem. Theessence of the algorithm, however, remained generallythe same. The IKSFA approach is appropriate for the

given problem. The key spectra obtained using modi-fied IKSFA tend to be found in the maxima of actualcomponent concentration profiles. This capability todetect hidden peak maxima is an obvious advantageof this approach regarding the peak matching prob-lem. However, a comparison of methods has not beenmade in this research, and thus the applicability ofother approaches remains possible.

In this paper, we suggest a new method of peakmatching in a series of spectrochromatograms ofthe same mixture obtained under varying separationconditions called mutual automated peak matching(MAP). The method solves two interrelated problemssimultaneously: the definition of the main number ofanalytes in the mixture and the evaluation of theirretention times in each experiment of a series. Whileelaborating on the method, we made special empha-sis on its reliability for detection of the main signals,rather than its sensitivity to admixtures at the limit ofdetection. An important property of the method is itsexcellent performance under poor separation condi-tions while peak intensities vary dramatically. Specialattention was paid to the validation of the obtainedsolution to reduce the effect of noise and non-analytefactors. Another important feature of the method isits applicability to inconsistent data. Missing compo-nents in one or several experiments can be detected.

As this work is mainly devoted to the theoreti-cal basis of the peak matching approach, simulateddata was used for its demonstration. The method wastested on a set of simulated data modeling a chromato-graphic separation of a mixture of ten components ofthe same homology: phenanthrene and its nine mono-substituted derivatives. The potential of the method forpeak matching on data with a complex pattern, highlyoverlapping signals, and two different noise sourcesis shown. Approaches to prospective optimization andperformance improvement are discussed.

2. Method

2.1. General method assumptions

The following assumptions are imposed on the databy the algorithms used in the method:

(1) UV-Vis spectra of components obey Beer’s law;


(2) the spectrum of an analyte is constant during thesame HPLC run as well as between different runsin one series within experimental error; and

(3) the spectra of different components are signifi-cantly different compared to experimental error,more generally, any pure component spectrum isnot a linear combination of others.

Though these assumptions have an underlying the-oretical basis, experimental data may not always abideby these requirements. This, however, is not a reasonto automatically disregard this method. Its applicabil-ity should be considered for each individual case sepa-rately taking into account the properties of the systemunder study and the deviation degree.

Data used in this paper also have some internal al-lowances, that is, a constant baseline and the Gaussianshape of the concentration profiles of the components.However, these properties do not refer to the basic as-sumptions of the method because they do not imposeany constraints over the application area of the algo-rithms; and thus should not exert a crucial influenceover the result obtained.

This application area of MAP may be expanded toinclude conditions violating the above assumptions.This, however, is beyond the scope of this paper andmay be elaborated during our further investigations.

2.2. Deducing the number of components

The study of a series of spectrochromatograms is athree-way data analytical problem where several ma-trices forming a three-dimensional data array are si-multaneously involved in analysis. However, because

Fig. 1. Schematic illustration of creating an augmented data matrix with a common spectral axis.

component spectra are supposed to be the same, itssolution can be transformed into a two-way plane byconnecting individual HPLC runs into a single aug-mented matrix with a joint evolutionary axis (Fig. 1).The augmented data matrixDaug is composed of indi-vidual datasetsD1, D2, . . . , Dk (spectra in columns)as shown in (1).

Daug = [ D1 D2 · · · Dk ] (1)

Whereas spectral axes of all matrices in a datasetshould be the same in order to make augmentation pos-sible, time axes may vary. They will simply be joinedtogether to form a new augmented time scale.

The first step of the algorithm is the determinationof the number of system components (factors). Thenumber of factors should be determined on the aug-mented data matrix because the same matrix is usedfor further calculations. The total number of compo-nents is determined by abstract factor analysis (AFA,also known as principal component analysis, PCA) ofDaug and based on analysis of products of data matrixdecomposition (2):

Daug = RabsCabs+ E (2)

whereRabs is an abstract row matrix (scores) withncolumns andCabs an abstract column matrix (load-ings) withn rows. The matrixE is supposed to consistof the extracted error.

Numerous approaches, a comprehensive overviewof which is made in the book by Malinowski[1],can be applied to determine the number of primaryfactorsn. Methods based on experimental error suchas real error (RE) and imbedded error (IE) functionsare generally preferred. The RE function requires the


experimental error residual standard deviation(R.S.D.) to be known, providing a cut-off level ata proper number of factors. The IE is based on de-tecting the function minimum corresponding to thebest model. Both techniques produce excellent resultswhen the error is represented by noise having distri-bution close to normal. The factor indicator function(IND) [17] also showed good ability to pick out theproper number of factors.

Calculating dimensionalityn of the factor space isnecessary since it is used by the IKSFA algorithm.AFA detects the number of components in an abstractmathematical sense and, for real world systems, it maydiffer from an analyst’s estimates. However, as we aregoing to show later, the present approach is relativelytolerant of an error in the number of factors foundat the AFA stage. Moreover, the AFA result need notnecessarily coincide with the final number of compo-nents found by the peak matching method. Factors at-tributed to non-detectable or non-analyte componentsas well as the noise could be detected and removedat later stages. Considering this, it may be relevant toincreasen by a few extra factors in the model prior tomoving to the next step.

Individual datasetsD1, D2, . . . , Dk can be factoranalyzed as well to check the data integrity.

2.3. Finding a set of typical spectra

The matrixDaugis subjected to IKSFA to determinea set of then most orthogonal spectra along the aug-mented evolutionary axis. To remain consistent withthe traditional notation, the algorithm synopsis belowis given for finding a key set of data rows. Applied ona transposed matrix, it will similarly produce a solu-tion for typical columns.

IKSFA finds typical data rows analyzing the matrixof scores forn abstract factors. Each row in this matrixis first normalized to unit length to eliminate a factorof intensity. The algorithm results in indices of thenmost orthogonal rows in the normalized scores. Cor-responding rows of the original data form the requiredkey set.

For the purposes of this research, the IKSFA ap-proach was modified to meet the specific requirementsof the method. The normalization of the abstract scorematrix used by the key set selection algorithm was notperformed as recommended by the classic approach

[7]. This change had a two-fold purpose. First, due tothis alteration, the components gain greater weight atgreater intensity and are reliably detected by the algo-rithm, which is in exact agreement with our purpose.In addition, the absence of normalization prevents theexaggeration of noise factors that introduce additionalerror into the model. Pure noise poses a serious prob-lem to the IKSFA procedure because the normaliza-tion gives the noise points equal weight with real data.Therefore, preliminary screening of data with the re-moval of spectra with a root mean square (rms) re-sponse less than five times the real error is stronglyrecommended by the regular algorithm[1]. The refusalof the normalization could bring about some lossesof method sensitivity to low-intensity components. Atthe same time, it provides more reliable detection ofthe main mixture analytes and some simplification ofcalculations by omitting data pre-processing stages.

As a consequence of being a combination of themost “different”, each of the key spectra is a closestapproximation of a pure component spectrum. At thesame time, the IKSFA algorithm with the above mod-ification tends to select spectra in maximum positionsof individual peaks even if those are significantly over-lapped. In other words, spectra with the highest con-tent of analytes are detected. The main advantage ofusing the augmented data matrix instead of analyzingthe datasets separately is that the purest peaks of everycomponent among all the experiments are detected ina single step. Note that finding actual pure spectra isnot required by the method.

2.4. Key set refinement

Typical spectra resulting from IKSFA do not neces-sarily represent an optimal set of factor axes spanningthe space of experimental data most effectively. If thenumber of key spectra is less than the actual dimen-sionality of the data spacen, the model is underdeter-mined, and some useful information may be lost. Incasen is, to the contrary, overestimated, spectra mod-eling experimental error may be introduced, leadingthe model to degradation. Hence, key spectra have ananalogy with abstract factors obtained by AFA.

Because the model sizen is often task-dependentand to some extent subjective, its exact evaluation atthe AFA stage is not always possible. To avoid obtain-ing an underdetermined model it is advisable to add


one or two extra factors to the number obtained fromAFA. The key set refinement procedure enables thedetection and exclusion of excessive and “bad” spec-tra as described below. Here we represent an iterativekey set optimization approach based on tools providedby target factor analysis.

First, one must exclude factors modeling mostlynoise as well as outlying measurements, as experimen-tal artifacts. Spectra of these two types are unique tothe whole dataset and can be easily detected by variousprocedures known as target testing[1]. In the presentwork, we used the SPOIL function. Its value is a reflec-tion of how much error is present in the target relativeto the data matrix. In fact, the SPOIL gives us a “mea-sure of quality” of the target for reproducing the datamatrix. The larger the value, the poorer will be the datamatrix reproduction. A rule-of-thumb criterion wassuggested that prompts for exclusion (SPOIL> 6) oracceptance (SPOIL< 3) of a target being tested. Anintermediate function value requires the attachment ofadditional information to make a well-grounded deci-sion. For details on the SPOIL function definition andusage, the reader may refer to the original source[18].

Each spectrum in an initial key set is individuallytarget tested on abstract data space withn factors (n isthe number of key spectra). In cases where the SPOILvalue exceeds 6,n should be decremented by 1 andthe key set recalculated forn − 1 factors as describedin Section 2.3. The procedure is repeated until all theremaining spectra satisfy the above condition. SinceSPOIL is an empirical parameter, its upper cut-offvalue may be varied by an operator to suit a specificdata analysis situation.

Absence of unique spectra in a key set, by itself,does not guarantee that the remaining spectra repre-sent the optimal key set because it may still containsome redundant spectra spoiling the model. The sec-ond procedure of the key set refinement stage is directexamination for its being overdetermined in an iter-ative target combination cycle as follows. First, eachof n spectra fromDkey is individually target tested onRabs (the n-factor scores ofDaug as defined by for-mula (2)), resulting in a set of transformation vectorst1, t2, . . . , tn. (Note,n stands for a current number ofspectra in the key set left after previous eliminationsteps. The number of factors retained in the scorematrix used for target analysis should be the same.)Multiplication of the inverse of the transformation

matrix T = [ t1 t2 · · · tn ] by the loadings ofDaug, transforms the abstract column matrixCabsintothe physical space, producing an augmented estimateof concentration profilesCaug, complementary withspectra inDkey.

T −1Cabs= Caug = [ C1 C2 · · · Ck ] (3)

whereC1, C2, . . . , Ck are the portions of the aug-mented profile Caug corresponding to individualexperimentsD1, D2, . . . , Dk.

An optimal key set can be defined as one producinga minimum prediction error as expressed by formula(4).

||DkeyCaug− Daug||2 = minimum (4)

Straight double brackets designate the norm of a ma-trix. Based on this definition, key spectra which are“bad” for data reproduction can be detected and re-moved fromDkey by the following procedure. Thefirst spectrum is excluded fromDkey and concentra-tion profiles are calculated for this truncated set. Whendeleting the spectrum results in an increase of the pre-diction error, calculated by the left side of the expres-sion (4), this spectrum is considered significant for themodel and should be kept. In the same way the second,third, etc. spectra are consecutively deleted followedby the above check. When deletion of a spectrum ofDkey leads to an improvement of the model decreasingthe prediction error, the whole key set is recalculatedfor n − 1 factors. The full cycle is repeated until allnspectra become useful for the model.

The target combination step also serves to confirmthat the selected key set of spectra adequately repro-duces the original data and the error defined by (4) ismeaningful.

Eventually, it is expected that every spectrum of therefinedDkey will correspond to an individual com-ponent of an analyzed mixture. Note, however, thata drifting baseline and other non-analyte factors pos-sibly present may result in an increase of the modelsize. Therefore, it is always recommended to inspectDkey andCaugvisually in order to reveal such “ghost”factors and ignore peak maxima produced by them.

In practice, the stages of finding a key set andits refinement are compiled into a single algorithmicstep. Here the separation was applied to emphasize thesignificance of key set validation prior to starting thepeak analysis.


2.5. Finding peaks

Retention times of components are obtained fromCaug. For this purpose, it is split along the augmentedtime axis onto submatricesC1, C2, . . . , Cj, . . . , Ck

corresponding tok individual experimental runsD1,D2, . . . , Dj, . . . , Dk. EveryCj includesn rows relatedto the spectra inDkey. Index of a maximum inith rowof Cj represents a least-square estimate of the reten-tion time ofith component injth dataset. The result canbe represented as a table with experiments for columnsand retention times of a component in each row.

To demonstrate the main concept behind collectingretention times as maxima of augmented concentra-tion profiles resulting from the regression, let us firstconsider a simpler example of a single HPLC run.In a successful key set, every spectrum represents acomponent of the analyzed mixture. In an ideal case,where each of them is a pure component spectrum,the regression will result in estimates of actual con-centration profiles of the analytes. We do not expectthat only pure spectra will be found. However, wesuggest that the spectra that were selected were closeto the maxima of the actual component peaks, whichmay be implicit because of an overlap. In the lattercase, shapes of regression-resolved concentration pro-files may be distorted because of a mathematical am-biguity of the system. However, the profile maximaexperience the least bias and their indices still can betaken as retention time estimates. In our system,Dkeycontains spectra being a key set for the whole jointdata matrixDaug. Nevertheless, it is logical to suggestthat the rows inCaug reveal maxima of componentpeaks in every submatrixCj representing an individ-ual run. Consequently,Caug comprises retention timeestimates of all mixture components in each run, nomatter whichDj gave birth to a spectrum ofDkey re-sponsible for a particular component.

Since a maximum point is always found, the table ofpeaks compiled in this way will be complete no matterwhether a component is really present in a datasetor not. The next stage serves to confirm the actualcomponents and detect missing ones.

2.6. Testing for missing components

At this stage, retention times found should be con-sidered as candidate peaks since the same component

may be present in one experiment and have no sig-nal in another (e.g. as a result of incomplete analysis),but the peak table is always produced. To confirm thepresence of a component in a certain experiment, keyset spectra are individually target tested on separatedatasetsD1, D2, . . . , Dj, . . . , Dk. If experimental er-ror is known, real error in the target (RET)[1] allowsone to judge the presence or absence of a tested spec-trum in the abstract space of another dataset. Whena component is missing in the space ofDj the RETwill be significantly higher than both the experimen-tal error and RETs of testing the same spectrum onother datasets where the corresponding component ispresent. In case of unknown experimental error, otherstatistical criteria such as the SPOIL function orF-test[19] calculations can be applied.

Comparison of concentration profiles in theCj por-tion of Caug with their target-improved analogs fromindividual experiments can also provide a relevant ba-sis for detecting a missing component.

2.7. Curve resolution (optional step)

Finally, the analysis may be completed by decom-position of data matrices onto spectra of individualcomponents and their concentration profiles. This canbe performed by ALS MCR on an augmented datamatrix with common spectra[5,6]. Obtaining qualityinitial estimates of spectra or concentration profiles toinput into the iterative improvement process is, proba-bly, the most problematic stage of most self-modelingcurve resolution techniques. In our case, however,they are readily acquired. Initial estimates of concen-tration profiles are obtained using an approach pro-posed in[20], as a result of target transformation ofuniqueness vectors. Uniqueness vectors are composedof unity values in positions corresponding to the re-tention times of components, produced by the presentMAP method, and zeros elsewhere. Pure spectra andprofiles are iteratively improved by least-square calcu-lation while subjecting the solutions to the constraintsof non-negativity and unimodality (a requirementof a single maximum in an individual concentrationprofile). The cycle is repeated until convergence isachieved. However, satisfactory convergence may notbe attained in the case where an insufficient or exces-sive number of factors were accepted in the model.There may also be other hard-to-control factors


impeding the solution. These are standard problemsfor iterative self-modeling curve resolution.

3. Data and Software

3.1. Data choice

Two simulated series of HPLC with diode array de-tector (HPLC–DAD) analyses were used for demon-stration of the method performance, one of them basedon real UV-Vis spectra. The choice of artificially con-structed data was intentional.

In our opinion, simulated data are better able toshow the basics of the new peak matching approachthan experimentally acquired datasets. The main ad-vantage of simulated data is the opportunity of com-parison of calculation results with “true” values anddirect evaluation of the method accuracy. On the otherhand, data modeling enables control of various inter-nal factors such as spectral noise, sampling error, etc.This makes it possible to investigate the influence ofa certain factor separately, in the absence of other fac-tors. Finally, flexibility of modeling in construction ofdatasets of desired complexity provides unique capa-bilities for testing the limits of method applicability.Of course, further testing on live experimental data isnecessary for final method validation. Simulated data,nevertheless, is perfectly suited to demonstration ofits algorithmic basis, which is the main subject of thepresent paper. At the same time, utilizing real spectragives the test problem a fair portion of realism, sincespectral factors (as spectral similarity of homologi-cally related substances) indeed pose the main limita-tions to method applicability.

3.2. Datasets

Data matrices were constructed by the formula (5):

Dj = SQCj + Derr (5)

whereDj is the (w x t) matrix of HPLC–DAD data ofjth run in a series of analyses of the same mixture,S

the (w x n) matrix of component absorptivity spectra,Q the (n x n) diagonal matrix of component concen-trations,Cj the (n x t) matrix of component concentra-tion profiles inith experiment normalized to unit area,Derr the (w x t) matrix of experimental error (noise),

w the number of wavelengths,t the number of spectra,andn the number of components.

Two series of analyses were emulated: A and B.Both of them included three simulated HPLC runsof a ten-component mixture. The difference betweenthe series can be attributed to the complexity of com-ponent peak patterns and the degree of their over-lap. Series A represents a case of moderate complexitywhereas Series B is constructed to model a challengingsituation of badly resolved chromatograms. Moreover,the data in Series B were constructed with real UV-Visspectra destined to emulate a chromatographic sepa-ration of substances of the same homological family:phenanthrene and its nine monosubstituted derivativesin positions 2, 3, and 9.

Each analysis in Series A consisted of 901 spectra(representing retention times from 0 to 900) registeredat 351 wavelengths. Spectrochromatograms in SeriesB included 801 spectra at retention times from 0 to800.

Typical noise added to the data matrices in SeriesA was a normally distributed error with the standarddeviation (R.S.D.) equal to 0.001. Series B includedtwo types of noise simultaneously present in the data.These were a constant (background) noise R.S.D. =0.0005 and a random error weighted by intensity(detector noise) R.S.D. = 0.005. In some calcula-tions noise was varied to check its influence on thesolution.

Summary chromatograms in Series A and B (asmaximum intensity plots), as well as constituting com-ponent concentration profiles, are presented inFigs. 2and 3.

3.3. Spectra

3.3.1. Series AEach spectrum in Series A represented a sum of 1–3

wide Gaussian peaks with randomly chosen maxima(Fig. 4) and covered the wavelength range from 200to 900 nm with step 2 nm.

3.3.2. Series BSpectra were obtained from the printed atlas by

Lang[21]. Original spectra were acquired on a Beck-man Model DU, in a cell with pathlength 1 cm. Spectrawere registered by single absorbance measurementsstepped between 1 and 5 nm. The substance concentra-


1

2

3

4

5 6 7

8

9

10

1

1

2

3

4

5 6 7

8

9

10

2

0 100 200 300 400 500 600 700 800 900

1

2

3

4

5 6 7

8

9

10

3

Retention Time

9

10

6

10

1

2

5

Fig. 2. Maximum intensity (absorbance) plots and component concentration profiles in Series A.Y-axis label corresponds to the numberof the experiment.

tion varied for different wavelength ranges to providean accuracy of three significant digits. Ethanol wasused as a solvent for substituted phenanthrenes. Thespectrum of phenanthrene was registered in isooctane.

Spectra were digitized in the wavelength range from210 to 360 nm with step 1 nm. Absorptivity spectrawere calculated by division of measured values ofspectral absorbance by substance concentration. Ab-sorptivity spectra used for constructing Series B areshown inFig. 5a and b.

3.4. Component concentrations

3.4.1. Series AComponent concentrations in Series A varied as

shown inTable 1.

3.4.2. Series BInitial component concentrationsC0 chosen to

model the analyzed mixture in Series B are shown inTable 2.

Although the mixture was intended to have thesame composition throughout the analysis, some er-ror was added to bring more realism to the data. Theactual concentration involved in the data constructioncontained normally distributed error with 10% stan-dard deviation of the initial concentration valueC0.The error was randomly generated and added indi-vidually in every experiment. Thus, all of the mod-eled analyses were somewhat different in componentratio.

3.5. Concentration profiles

Concentration profiles were modeled by the Gaus-sian function with unit height and half-height widthchosen as a random integer number between 5 and 20.All profiles were then normalized to unit area to pro-vide equal areas of the same profile in different anal-yses provided that the concentration is constant. Peakpositions were set to provide the desired complexityof the pattern in each series.


Fig. 3. Maximum intensity (absorbance) plots and component concentration profiles in Series B.Y-axis label corresponds to the numberof the experiment.

3.5.1. Series APeak positions were chosen to provide almost every

component peak being overlapped to a different extentup to 100% (embedded peak) by another signal inat least one of the analyses (Table 1). The resultingconcentration profiles are shown inFig. 2.

Table 1Component concentrations (C0, arbitrary units), retention times, maximum signal intensities (absorbance), peak overlap (%area), andresolution (in parentheses, lowestRs value for the peak) in Series A

# C0 Retention times Maximum intensity Overlap (Rs)

D1 D2 D3 D1 D2 D3 D1 D2 D3

1 10 20 35 110 1.260 1.260 0.525 12 (0.50) 42 (0.25) 30 (0.18)2 50 30 100 100 4.519 3.163 6.326 2 (0.50) 0 (2.07) 6 (0.28)3 1 200 410 200 0.096 0.129 0.051 0 (2.08) 0 (1.33) 0 (2.57)4 5 250 40 25 0.927 1.043 1.669 0 (1.85) 32 (0.25) 0 (3.27)5 8 300 305 105 0.123 0.082 0.154 0 (1.85) 0 (3.54) 100 (0.18)6 2 400 610 500 0.049 0.049 0.033 0 (3.24) 40 (0.43) 0 (3.78)7 3 500 450 640 0.134 0.054 0.060 100 (0.30) 0 (1.33) 52 (0.29)8 40 510 550 300 0.730 0.913 1.095 10 (0.30) 0 (1.67) 0 (3.13)9 1 795 800 400 0.022 0.012 0.024 99 (0.21) 0 (6.79) 0 (4.00)

10 4 800 600 650 0.454 0.568 0.454 5 (0.21) 4 (0.43) 12 (0.29)

3.5.2. Series BThe peak pattern was created to provide the follow-

ing conditions of complexity (Table 2):

• every component is overlapped at least once throughthe series;


200 300 400 500 600 700 800 9000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2

3

4

5

6

7

8 9

10

Wavelength (nm)

Arb

itrar

y ab

sorp

tivity

Fig. 4. Spectra of components in Series A.

Table 2Concentrations (C0), component retention times, maximum signal intensities (absorbance), peak overlap (%area), and resolution (inparentheses, lowestRs value for the peak) in Series B

# C0 (10−6 × mol/l) Retention times Maximum intensity Overlap (Rs)

D1 D2 D3 D1 D2 D3 D1 D2 D3

1 10 68 80 76 0.598 0.408 0.413 29 (0.17) 50 (0.11) 44 (0.13)2 7 75 75 81 0.166 0.226 0.312 69 (0.17) 85 (0.11) 98 (0.13)3 6 200 610 150 0.185 0.144 0.225 78 (0.08) 3 (0.29) 17 (0.47)4 5 203 200 292 0.175 0.129 0.242 100 (0.08) 0 (2.73) 43 (0.23)5 12 210 400 400 0.301 0.220 0.320 53 (0.18) 47 (0.20) 0 (2.63)6 70 300 408 166 0.232 0.390 0.243 62 (0.00) 65 (0.20) 11 (0.47)7 50 300 700 750 0.135 0.236 0.164 100 (0.00) 0 (2.25) 3 (0.17)8 40 400 420 500 0.125 0.095 0.090 0 (2.70) 99 (0.08) 0 (2.63)9 60 500 423 300 0.149 0.154 0.142 0 (2.70) 73 (0.08) 46 (0.23)

10 2 700 600 755 0.007 0.003 0.004 0 (6.45) 96 (0.29) 100 (0.17)


220 240 260 280 300 320 340 3600

1

2

3

4

5

6

7

x 105

1

2 3

4

5

6 7 8 910

Wavelength (nm)

Ab

sorp

tivit

y

220 240 260 280 300 320 340 3600

1

2

3

4

5

x 104

6 7

8 9

10

Wavelength (nm)

Ab

sorp

tivit

y

(a)

(b)

Fig. 5. (a and b) Absorptivity (mol−1 cm−1) spectra used to build Series B: (1) 9-carboxyphenanthrene; (2) 9-cyanophenanthrene; (3)9-bromophenanthrene; (4) phenanthrene; (5) 9-acetylaminophenanthrene; (6) 2-acetophenanthrene; (7) 2-acetylaminophenanthrene; (8)3-acetophenanthrene; (9) 3-acetylaminophenanthrene; (10) 3-hydroxyphenanthrene. (a) Spectra 1–10; (b) spectra 6–10.


Table 3AFA of Daug in Series A: residual standard deviation (R.S.D.)and decimal logarithm of indicator function (IND) for differentnumber of factors (NF)

NF R.S.D. (×103) lg(IND)

1 69.517 −6.2462 34.392 −6.5493 13.036 −6.9684 7.363 −7.2145 1.834 −7.8156 1.185 −8.0027 1.067 −8.0458 1.008 −8.0679 0.999 −8.068

10 0.996 −8.06711 0.995 −8.06512 0.994 −8.063

• five components of ten have overlapped signals inevery experiment;

• there are partially overlapping groups of two, three,and four peaks;

• embedded peaks are present; and• there is an instance of two peaks with coinciding

retention times.

The resulting concentration profiles are shown inFig. 3.

3.6. Software

All algorithms presented here were programmedin MATLAB ®, Version 6.1, a product of The Math-Works, Inc.

4. Results and discussion

4.1. Test case of moderate complexity (Series A)

Abstract factor analysis of the augmented data ma-trix Daug followed by comparison of the R.S.D. fordifferent values ofnwith the experimental error knownto be 0.001 detected nine primary factors (principalcomponents). Ninth factor is disputable, its R.S.D. isalmost equal to the error. We preferred to keep it basedon the IND function method results (Table 3). The factthat one of the mixture components was assigned sec-ondary factors can account for the low intensity of the

component signals significantly affected by the noise.Nevertheless, the dimensionality of the factor spacewas increased by two as recommended (n = 11).

IKSFA on the columns of theDaugresulted in a keyset of 11 spectra collected among all the experiments.Their retention times were (the experiment number isgiven in parentheses): 100 (3); 300 (3); 25 (3); 20 (1);600 (2); 410 (2); 105 (3); 500 (1); 400 (1); 400 (3);49∗ (1).

Note that every key set spectrum except one (de-noted by∗) matches the retention time of an actualcomponent (Table 1).

Visual inspection of spectra of the key set providesa rationale for rejection of spectrum 11 as containingnothing but noise (Fig. 6). Spectrum 10, while alsonoisy, still brings some real information; the operatormay decide to retain it. Target testing of the key setin the abstract space of augmented data leads to thesame conclusion. SPOIL= 48 for spectrum 11 is anon-ambiguous proof of its uniqueness. At the sametime, spectrum 10 produces the SPOIL= 5.0 leav-ing the decision whether to retain the component upto an operator, although the key set refinement hasrejected the tenth factor as non-optimal according tocondition (4).

Concentration profiles were calculated and peaktables obtained for both nine- and ten-componentmodels (Table 4). Comparing the results with theoriginal data (Table 1) shows that the calculation ac-curacy (measured as the root mean square error, rms)

Table 4Calculated retention times in nine- and ten-component models(Series A)

# Ten componentsa Nine componentsb

D1 D2 D3 D1 D2 D3

1 20 35 110 20 35 1122 30 100 100 30 100 1003 200 410 200 200 410 2014 250 40 25 250 40 255 300 305 105 300 305 1056 400 611 502 400 611 5007 500 450 638 500 450 6408 510 550 300 510 550 3009 795 800 401 – – –

10 800 600 649 800 600 650

arms= 0.6325.brms= 0.4714.


Fig. 6. Key set spectra 10 and 11 in Series A.

of retention times is better in the model with ninecomponents. The tenth component (its assigned num-ber is 9) is a signal of very low intensity and intro-duces the noise it is mixed with into the whole model,disturbing other factors. Thus, there is a trade-offbetween accuracy and the potential for loss of com-ponents. Regardless, reasonable results can still beobtained.

The influence of noise level on the algorithm detec-tion and peak matching capabilities was investigated.The added error in the data for the above calculationwas R.S.D. = 0.001. When the noise was decreasedto R.S.D. = 0.0009 or less, all ten components passedthe refinement procedure followed by successful peakmatching. Increasing the noise level to the standarddeviation value of 0.0019 led to a refined key setof eight IKSFA-produced spectra. Nevertheless, peakmatching is still able to satisfactorily recognize peaksfrom nine components (1–8, 10) up to approximatelyR.S.D. = 0.004. However, the error in detecting peak

positions at that level of noise reaches rms= 1.089.Under these conditions, the refinement procedure de-tects an optimal key set of six component spectra(1–4, 8, 10), and their retention times calculated for asix-factor model exactly fit the true values. At the sametime, the error of localization of the same six compo-nents atn = 9 amounts to rms= 0.557. This result is avivid demonstration of the fact that retention in the keyset of the non-optimal spectra rejected at the refine-ment stage introduces a large amount of error in the re-sult. This may still produce a sensible solution, but theerror significantly affects the high-intensity compo-nents that are matched accurately with an optimal keyset. If the noise is increased above 0.004, a solution canonly be found for eight components. Retention timesof components 6 and 9, having the smallest intensity inthe series, are detected incorrectly. An incorrect result,as a rule, means that a minor component peak is notdetected in one or more HPLC runs of the series. A re-tention time that belongs to another, already matched


peak (a duplicate) often appears instead. In practice,mismatch detection may be problematic; therefore,one should carefully inspect a solution obtained with akey set that does not meet the optimum condition (4).

Comparing the above results with maximum inten-sities of analytes (Table 1), one can estimate the gen-eral sensitivity of the present peak matching methodin the presence of normally distributed backgroundnoise. The method is capable of detecting signals 8–10times the noise standard deviation (in fact, this is SNR,signal-to-noise ratio) as it was shown for the lowestintensity mixture components 9 and 6. Reliable de-tection and matching is achieved for those componentpeaks whose SNR is above 15–20 times experimentalerror. Peaks below this level are detected at the ex-pense of accuracy of the entire solution. High error inthe data does not generally lead to method failure butonly to loss of some minor components which, beingspoiled by the noise, are detected at the refinementstage.

In order to check how missing components can bedetected among the calculated retention times, we re-moved the signal of component 3 from the data matrixof experiment 1 and recalculated the table of peaksfor 10 components. Every spectrum of the commonkey set was then target tested for presence in the ab-stract space of each individual experiment. ObtainedRET values are given inTable 5. A visual demonstra-tion of successful and failed target tests is given inFig. 7.

Data matrices were successfully decomposed intonormalized concentration profiles and correspondingspectra for all ten components. The correlation coeffi-

Table 5Results of target testing (RET) of key set spectra on individualexperiments in Series A

# D1 D2 D3

1 0.001 0.001 0.0012 0.001 0.001 0.0013 0.018 0.001 0.0014 0.001 0.001 0.0015 0.001 0.001 0.0016 0.001 0.001 0.0017 0.001 0.001 0.0018 0.001 0.001 0.0019 0.001 0.001 0.00110 0.001 0.001 0.001

cients between original and reconstructed spectra were0.9982 and higher.

4.2. Test case of high complexity (Series B)

AFA on the augmented data matrix detected nineprimary factors. This number was increased by 2 and11 key spectra were calculated by the IKSFA algo-rithm. However, after the key refinement procedure,only nine spectra remained (Fig. 8) at the followingretention times (the experiment number is given inparentheses): 408 (2); 68 (1); 82 (3); 400 (1); 400 (3);292 (3); 150 (3); 423 (2); 700 (2).

The extracted key spectra were processed by thepeak matching procedure to produce the componentretention times shown inTable 6.

One can see that the minor intensity component 10in the mixture has not been matched. However, this isan expected result considering the noise level (SNR∼6) and its high degree of overlap in experiments 2and 3. The results obtained are more than satisfactory.The maximum error in detection of component reten-tion times worked out to be only four units, whichis an acceptable error for the purpose of further op-timization of the chromatographic separation. The al-gorithm detected overlapping peaks, even when therewas a great deal of mathematical ambiguity as em-bedded and co-eluting signals. The model with ninecomponents accounted for 99.99% of the cumulativevariance in the data.

Table 6Calculated retention times in nine-component model resultingfrom peak matching and improved by ALS MCR curve resolution(Series B)

# Peak matchinga ALS MCRb

D1 D2 D3 D1 D2 D3

1 68 83 73 68 81 752 78 74 82 76 74 813 199 611 150 199 611 1504 203 200 292 203 200 2925 210 398 400 210 400 4006 300 408 166 300 408 1667 299 700 750 299 700 7508 400 416 500 400 420 5009 500 423 301 500 423 300

10 – – – – – –

arms= 1.401.brms= 0.5092.


Fig. 7. (a and b) Target test results for component 3 in Series A: (a) successful (experiment 2) and (b) failed (experiment 1).

ALS MCR curve resolution, started from initial esti-mates of peak retention times, successfully convergedin the nine-component model (Fig. 9). The correlationcoefficients between the original and reconstructedspectra were 0.9980 and higher. The improved con-centration profiles resulted in noticeable improvementin the predicted retention times (Table 6).

However, ALS MCR failed to resolve the curves inthe ten-component model. This should not be a sur-prise, considering strict assumptions about peak andspectrum shapes. Distortions caused by the noise inthe minor component signals may be crucial for themethod convergence.

4.3. Possible ways to improve the methodperformance

In the present work, the data was not pre-treatedin any way. This serves for a better demonstrationof the method’s capacity for solving the analyticalproblem, starting with raw data and applying a min-

imum number of operations. Nevertheless, variouspre-processing techniques, such as spectral smooth-ing or variable selection could be applied to partiallyremove noise and exclude non-informative measure-ments prior to the analysis. This would increase themethod sensitivity and improve its accuracy at esti-mation of retention times. In practical data analysis,a situation may occur when the initial set of severalanalyses does not produce a satisfactory peak match-ing. The problem may often be resolved by addingnew datasets to span more space of experimentalparameters, thus making the augmented data ma-trix better conditioned. Some efforts should be alsomade to recognize and eliminate such non-analytefactors as drifting baseline, which may be a part ofexperimental data. Alternative methods of key setselection may be compared in their ability to producethe best solution for the peak matching problem.Thus, there is great potential for improvement ofthis approach, which we hope to investigate in thefuture.


Fig. 8. Refined key set spectra in Series B.

0 100 200 300 400 500 600 700 800

220

240

260

280

300

320

340

360

Retention Time

Wav

ele

ngth

(nm

)

0 100 200 300 400 500 600 700 800

Retention Time

220

240

260

280

300

320

340

360

Wav

ele

ngth

(nm

)

Fig. 9. ALS MCR curve resolution in experiment 2 (Series B).


5. Conclusion

A method for mutual peak matching in a series ofHPLC–DAD analyses of the same mixture at varyingconditions has been developed. The approach does notrequire any prior knowledge of the mixture composi-tion. Because of the effective validation of the model,the algorithm is tolerant to overestimation of the num-ber of components produced by initial abstract fac-tor analysis. The system was tested on simulated dataof varying concentration and noise levels. The algo-rithm showed good performance in detecting highlyoverlapped peaks. An approach for detecting missingcomponents in the case of inconsistent data has beenproposed. Although the algorithm does not apply fullcurve resolution to obtain the result, the calculated re-tention times provide a good initial estimate for sub-sequent curve resolution if the latter is necessary.

The algorithm reliability in detecting the main com-ponents in a mixture provides the basis for creating anautomated peak matching system. The approach canbe used in optimization of HPLC–DAD analysis ofcomplex mixtures of unknown composition.

References

[1] E.R. Malinowski, Factor Analysis in Chemistry, third ed.,Wiley/Interscience, New York, 2002.

[2] J.C. Hamilton, P.J. Gemperline, J. Chemom. 4 (1990) 1.[3] T. Wennberg, J.P. Rauha, H. Vuorela, Chromatographia

53 (Suppl.) (2001) S240.[4] R.G. Wolcott, J.W. Dolan, L.R. Snyder, S.R. Bakalyar, M.A.

Arnold, J.A. Nichols, J. Chromatogr. A 869 (2000) 211.[5] R. Tauler, A. Izquierdo-Ridorsa, E. Casassas, Chemom. Intell.

Lab. Syst. 18 (1993) 293.[6] R. Tauler, E. Casassas, A. Izquierdo-Ridorsa, Anal. Chim.

Acta 248 (1991) 447.[7] K.J. Schostak, E.R. Malinowski, Chemom. Intell. Lab. Syst.

6 (1989) 21.[8] F.C. Sánchez, J. Toft, B. Van den Bogaert, D.L. Massart,

Anal. Chem. 68 (1996) 79.[9] K. De Braekelleer, D.L. Massart, Chemom. Intell. Lab. Syst.

39 (1997) 127.[10] W. Windig, J. Guilment, Anal. Chem. 63 (1991) 1425.[11] B.W. Grande, R. Manne, Chemom. Intell. Lab. Syst. 50 (2000)

19.[12] W.J. Krzanowski, Appl. Stat. 36 (1997) 22.[13] G.P. McCabe, Technometrics 26 (1984) 137.[14] O. Exner, Collect. Czech. Chem. Commun. 31 (1966) 3222.[15] E.R. Malinowski, Anal. Chim. Acta 134 (1982) 129.[16] D.L. Massart, A. Dijkstra, L. Kaufman, Evaluation

and Optimization of Laboratory Methods and AnalyticalProcedures, Elsevier, Amsterdam, 1978 (Chapter 17).

[17] E.R. Malinowski, Anal. Chem. 49 (1977) 612.[18] E.R. Malinowski, Anal. Chim. Acta 103 (1978) 339.[19] E.R. Malinowski, J. Chemom. 3 (1988) 49;

E.R. Malinowski, J. Chemom. 4 (1990) 102.[20] B. Vandengniste, W. Derks, G. Kateman, Anal. Chim. Acta

173 (1985) 253.[21] L. Lang (Ed.), Absorption Spectra in the Ultraviolet and

Visible Region, vol. 1, Publishing House of the HungarianAcademy of Sciences, Budapest, 1963.

Documents

Mutual peak matching in a series of HPLC–DAD mixture analyses