17
The North American Chapter of the International Chemometrics Society Newsletter # 22 NAmICS January 2002 Barry M. Wise President Eigenvector Research, Inc. 830 Wapato Lake Road Manson, Washington 98831 [email protected] Patrick Wiegand President-Elect Union Carbide Corporation P.O. Box 8361 South Charleston, West Virginia 25303 [email protected] Margaret A. Nemeth Secretary Monsanto Company 800 N. Lindbergh Boulevard St. Louis, Missouri 63167 [email protected] Aaron J. Owens Editor-in-Chief DuPont Company POB 80249 Wilmington, DE 19880-0249 [email protected] Neal B. Gallagher Treasurer Eigenvector Research, Inc. 830 Wapato Lake Road Manson, Washington 98831 [email protected] Ronald E. Shaffer WebMaster GE Corporate R&D Building K1, Room 3B12 P. O. Box 8 Schenectady, New York, 12301 [email protected] David Duewer Membership Secretary National Institute of Standards and Technology Gaithersburg, Maryland 20899 [email protected] Table of Contents Guest editorial..................................................................................2 Listserv changes...............................................................................2 Chemometrics in a Network Economy ............................................3 Genetic Algorithms in Variable Selection.......................................7 Parsimonious models in chemometrics .........................................10 Objective Data Alignment followed by Chemometric Analysis of Two-Dimensional Data………………14 Chemometric Movies Coming To a Theatre Near You….............17 Thanks to for supporting the duplication and mailing of the newsletters. MATFORSK has a long tradition in chemometrics, starting in the late 70’s with multivariate calibration in NIR applications, and some years later multivariate analysis of sensory data was started up. Today, the group of chemometrics and statistics consists of about 10 persons divided among research scientists and Ph.D. students. More info can be found at http://www.matforsk.no/web/statchem.nsf

International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

The

North American Chapterof the

International Chemometrics SocietyNewsletter # 22 NAmICS January 2002

Barry M. WisePresident

Eigenvector Research, Inc.830 Wapato Lake Road

Manson, Washington [email protected]

Patrick WiegandPresident-Elect

Union Carbide CorporationP.O. Box 8361

South Charleston, West Virginia 25303

[email protected]

Margaret A. NemethSecretary

Monsanto Company800 N. Lindbergh Boulevard

St. Louis, Missouri [email protected]

Aaron J. OwensEditor-in-Chief

DuPont CompanyPOB 80249

Wilmington, DE [email protected]

Neal B. GallagherTreasurer

Eigenvector Research, Inc.830 Wapato Lake Road

Manson, Washington [email protected]

Ronald E. ShafferWebMaster

GE Corporate R&DBuilding K1, Room 3B12

P. O. Box 8Schenectady, New York, 12301

[email protected]

David DuewerMembership Secretary

National Institute of Standards and Technology

Gaithersburg, Maryland [email protected]

Table of Contents

Guest editorial..................................................................................2Listserv changes...............................................................................2Chemometrics in a Network Economy............................................3Genetic Algorithms in Variable Selection.......................................7Parsimonious models in chemometrics .........................................10Objective Data Alignment followed by Chemometric Analysis of Two-Dimensional Data………………14Chemometric Movies Coming To a Theatre Near You….............17

Thanks to

for supporting the duplication and mailing of the newsletters.

MATFORSK has a long tradition in chemometrics, starting in the late 70’swith multivariate calibration in NIR applications, and some years latermultivariate analysis of sensory data was started up. Today, the group ofchemometrics and statistics consists of about 10 persons divided amongresearch scientists and Ph.D. students. More info can be found at

http://www.matforsk.no/web/statchem.nsf

Page 2: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 2

Guest editorial

In a recent review in Trends in Biotechnology it isclaimed: “Too often, data mining activities aresimply large-scale applications of poorlyunderstood methods to poorly understood data”.Well, what is data mining anyway? A quick searchon the Internet revealed that many companieswithin data mining rely on methods such as clusteranalysis, factor analysis, neural networks, decisiontrees. Interestingly enough, regression is often notmentioned. It is because this term generates badvibrations among the “data miners” and theircustomers?

For two years now, I have work withchemometrics at MATFORSK; Norwegian FoodResearch Institute. The applications span frombioinformatics to analysis of consumer related datain a variety of different projects, and it involvesplanning experiments, sampling from biologicalsystems and analysis of considerable amounts ofdata, often in blocks, from measurements on rawmaterials, processes and products. Thecontributions in this issue cover some aspectsregarding the role of chemometricians as“information extractors” in the era of vast amountsof data.

What is the role of the chemometrician today? –and what is a chemometrician? I am inclined tosupport the view that “chemometrics is whatchemometricians do” (see also Jerry Workman’scontribution in this issue). At MATFORSK weoften work with sensory analysis in food research,and why should we refrain from analyzingaccompanying data from questionnaires related toconsumer’s preferences, attitudes, eating habitsetc.? As sound users of statistical methods, weshould initiate proper use of them also in otherfields of science. Now is the time to applyversatile methods that we have found to be apt forhuge data sets in chemistry to other fields, likegenetic data. I think that chemometricians cancontribute greatly within data mining, being itbioinformatics or CRM. Many variables, relativelyfew objects, missing data, outliers and manyresponses are situations we are used to handlewithin chemometrics. The use of resampling/sub-sampling methods to make models on somerepresentative sets of objects is another issue thatfits into chemometric thinking.

All the best for the new chemometric year - andmay your data be with you!

Frank [email protected]

Listserv changes

The University of Maryland is terminating theIBM VM/CMS computer known asumdd.umd.edu. This means, among other things,that Listserv lists will be moved to a newcomputer. The ICS-L Listserv list (subscriptionsand list archives) has been moved. All postingsshould now be sent to:

[email protected] (NOT [email protected])

Listserv administrative mail (subscribe, unsub-scribe, changesubscription options, etc.) should besent to:

[email protected] (NOT [email protected])

Note that mail from the list is now sent throughLISTSERV.UMD.EDU, instead of UMDD.UMD.EDU. If you filter your messages based on theemail address on the TO: line or REPLY-TO: line,you may need to modify your filters for this list.

The new server (operating on a Unix system) is thecurrent version of the Listserv system, muchnewer than the version we were able to use on theUMDD system. One of the nice new features forus is a web interface to the Listserv server. TheListserv server "home page" may be accessed at:

http://www.listserv.umd.edu

The web interface for this list may be accessed at: http://www.listserv.umd.edu/archives/ics-l.html

Page 3: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 3

Chemometrics in a NetworkEconomy

Jerry Workman, Jr.

Kimberly-Clark Corp., Analytical Science andTechnology Group, West Research andEngineering, 2100 Winchester Road, Neenah,Wisconsin 54956 U.S.A.

This paper is an excerpt of a presentation made atthe Fourth International Conference onEnvironmetrics and Chemometrics, Las VegasUSA September 2000. For a more completediscussion please refer to the paper, “The State ofMultivariate Thinking for Scientists in Industry:1980-2000,” to be published in Chemometrics andIntelligent Laboratory Systems 2001.

1. Introduction

Chemometrics has enjoyed tremendous success inthe areas related to calibration of spectrometersand spectroscopy-based measurements.Chemometric-based spectrometers have beenwidely applied for process monitoring and qualityassurance. However, chemometrics has thepotential to revolutionize the very intellectualroots of problem solving. Are there barriers to amore rapid proliferation of chemometric basedthinking, particularly in industry? What are thepotential effects of chemometrics technology andthe New Network Economy (NNE), or simply thenetwork economy, working in concert? Who willbe the winners in the race for faster, better,cheaper systems and products? These questionsare discussed briefly in terms of the principles ofthe network economy and in the promise ofchemometrics for industry. What then is the stateof chemometrics in modern industry? Severalpowerful principles are derived from an evaluationof the network economy and chemometrics whichcould allow chemometrics to proliferate muchmore rapidly as a key general problem solvingtool.

In chemistry, one’s ideas, however beautiful,logical, elegant, imaginative…are simply withoutvalue unless they are actually applicable to the one

physical environment we have; in short, they areonly good if they work. – R. B. Woodward

Relating the theme of chemometrics to a busytechnical community

Twenty years after the term chemometrics wasfreshly “minted,” by Bruce Kowalski and SvanteWold, the chemometrics community still seems tobe searching for a universal definition and a clearidentity. This paper begins by examining severaldefinitions for chemometrics, the clarity of thesedefinitions, and the message communicated to theindustrial community.

"Chemometrics is what chemometricians do."Anon.

"Chemometrics has been defined as theapplication of mathematical and statisticalmethods to chemical measurements [1]."

"Chemometrics is the chemical discipline that usesmathematical and statistical methods for theobtention in the optimal way of relevantinformation on material systems [2]."

"Chemometrics developments and theaccompanying realization of these developmentsas computer software provide the means to convertraw data into information, information intoknowledge and finally knowledge into intelligence[3]."

"Analytical chemistry has been called a sciencewithout a theory. Some say that the theories andprinciples of analytical chemistry have beenhanded down from other branches of science.Developments in chemometrics that are beginningto effect instrument design and specify the limitsof analysis are shaping the foundation for thisscience....research in chemometrics will contributeto the design of new types of instruments, generateoptimal experiments that yield maximuminformation, and catalog and solve calibration andsignal resolution problems. All this whilequantitatively specifying the limitations of eachinstrument as well as the quality of the data itgenerates [4]."

"Chemometrics, the application of statistical andmathematical methods to chemistry...[5]"

Page 4: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 4

"Chemometrics is the discipline concerned withthe application of statistical and mathematicalmethods, as well as those methods based onmathematical logic, to chemistry [6-9]."

"Chemometrics can generally be described as theapplication of mathematical and statisticalmethods to (1) improve chemical measurementprocesses, and (2) extract more useful informationfrom chemical and physical measurement data[10]."

"Chemometrics is an approach to analytical andmeasurement science based on the idea of indirectobservation. Measurements related to the chemicalcomposition of a substance are taken, and thevalue of a property of interest is inferred fromthem through some mathematical relation [11]."

“Chemometrics (this is an international definition)is the chemical discipline that uses mathematicaland statistical methods, (a) to design or selectoptimal measurement procedures and experiments;and (b) to provide maximum chemical informationby analyzing chemical data [12].”

From these definitions we are left with a fewnearly irrefutable facts: chemometrics involveschemistry and math….most probably data, andpossibly sensors and measurements of processes.

Whatever the clear and present definition ofchemometrics is, the industrial understanding of itis that it is complicated, it requires computers andthat it could possibly be beneficial. However weare not exactly certain how or why it would be anadvantage to use chemometrics. The fact is thatchemometrics allows us to take off the shelf data,which many institutions have been generating adinfinitum, and “wring it out” to remove all theinformation content. This information can furtherbe scrutinized to obtain real knowledge ofprocesses and measurements. Knowledge foroptimized and new products, processes,intellectual property estates, at reduced cost.

Chemometrics is so often linked with ProcessAnalytical Chemistry, again defined by Kowalski“as the discovery and development of new andsophisticated analytical methods for use in-line asan integral part of automated chemical processes[13].” Some have said that process analyticalchemistry is 90% hardware and 10%chemometrics. To an engineer that quantitative

statement means one may be able to do without it.What we have then is a process – we makemeasurements – we collect data – we usechemometrics to obtain information – we reviewthe information and attain real knowledge. Ifchemometrics is difficult to clearly define andcommunicate, what are its advantages anddisadvantages?

2. Advantages of chemometrics

What, then are the clear advantages ofchemometrics? (1) Chemometrics provides speed in obtaining

real-time information from data; (2) It allows high quality information to be

extracted from less resolved data. (3) It provides clear information resolution and

discrimination power when applied to second,third, and possibly higher-order data.

(4) It provides methodology for cloning sensors –for making one sensor take data “precisely“ asanother sensor.

(5) It provides diagnostics for the integrity andprobability that the information it derives isaccurate.

(6) It promises to improve measurements. (7) It improves knowledge of existing processes.(8) It has very low capital requirements – it’s

cheap.

In summary, it provides the promise of faster,cheaper, better information with known integrity.In addition, it is common sense to know that mathis cheaper than physics – that computer programsthat can solve problems that traditionally haverequired extensive hardware developments andadvances; it represents a superior approach. Wesee then, that intelligence can replace physical andmaterial solutions, much as the digital chipreplaces the mechanical clock works. This is animportant theme to further develop.

Recent reviews describing the remarkableproliferation of near infrared, infrared, and Ramanchemometrics-based analyzers for use in processanalysis are given in reference [14]. Thereferences found in this review and many otherscite several thousand cases where chemometricswas applied to calibration of sensors to analyzecomplex chemical mixtures and used for on-line orat-line analysis. Without chemometrics most, ifnot all, of these applications would not have been

Page 5: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 5

possible. In fact, multivariate calibration iscommonly accepted for process, research, andquality use applications throughout the world ofspectroscopy.

3. Disadvantages of chemometrics

The perceived disadvantage of chemometrics isthat there is widespread ignorance about what it isand what it can realistically accomplish. Thenotion that ‘many people talk about chemometrics,but there are relatively few actually using it fordaily activities and major problem solving inindustrial situations.’ This science is consideredtoo complex for the average technician andanalyst. The mathematics can be misinterpreted asesoteric and not relevant. And most important forindustry, there are a dismal lack of officialpractices and methods associated withchemometrics.

Chemometrics requires a change in one’s approachto problem solving from univariate to multivariatethinking; since we live in an essentiallymultivariate context. From pondering overspreadsheets to actually analyzing the data for itsfull information content. The old scientificmethod is passing away; a new scientific method isarising from its ashes. A new method requiring nota thought ritual, but rather a method involvingmany inexpensive measurements, possibly a fewsimulations, and chemometric analysis. The newmethod looks at all the data from a multivariateapproach, whereas the old method requires thescientist’s assumed powers of observation, from aunivariate standpoint, to be the key data processor.

The Old Scientific Method (used for hundreds ofyears)(1) Stating the problem(2) Forming the hypothesis (3) Observing and Experimenting (4) Interpreting Data (traditionally univariate –

pondering stage)(5) Drawing Conclusions

The New Scientific Method (for routine problemsolving)(1) Measure a process (any chemical phenomenon

or process)(2) Analyze the Data (multivariate analysis)(3) Iterate if necessary(4) Create and test model

(5) Develop fundamental multivariate under-standing of the process

Industry relies on approved and accepted methodswhich can easily be defended in a court of law.The implementation of methods must involve aminimum of risk to the user and to theorganization sponsoring the user. Historically,most NIR papers use Ordinary Least Squares tocompare predicted NIR results against laboratoryresults. However, some regulatory groups havequestioned the use of least squares and associatedmultivariate calibration for analytical methodsinvolved with product release or compliance.

4. Lessons for chemometrics from the networkeconomy

Rules of the NETWORK ECONOMY [15]

To start, let’s look at two provocativestatements relating to the network economy: “Giveit away and it becomes priceless…keep it foryourself and it becomes worthless, ” and “One faxmachine is worthless, two are extremely valuable,many are priceless….” In the network economyincreased complexity is the friend of confusionand chaos. The average person remembers 7 � 2objects per human byte and the modes ofcommunication provide multi-channel competitionfor any concept. Among other means ofcommunication, there is mobile connectioncommerce, Internet commerce, direct print,television, radio, telephone and fax commerce, anddirect human contact. Concepts must then beclearly formulated and communicated to have anymeaningful impact.

To be fast and first requires risk taking, risk bydefinition has a high failure rate. The lesson is toexpend energy reducing the cost of risk, not therate of risk. One must make it more expensive tobe slow than wrong. So what are the laws of thenetwork economy?

(1) The world is moving toward connectedness(2) Services become more valuable the more

plentiful they are(3) Networked systems grow exponentially(4) Success becomes infectious(5) Value explodes with membership(6) Cost goes down the better and more valuable

the services are

Page 6: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 6

(7) The more of something given free the morevaluable it becomes - wealth feeds offubiquity

(8) Allegiances move away from organizationsand toward networks

(9) Devolution is essential – grassroots andbottoms-up (users) are in control

(10) A move from atoms to bits – smaller, smarterelectronic systems over mechanical solutions

(11) Sustainable disequilibrium – constant changeis the order of things

(12) Find the right task, not how to do the wrongtask better.

Assumption to knowledge ratio causes moreproblems than ever before: (1) In the networkeconomy, experience can be your worst enemy – itcan lie to you about today’s reality, (2)technologies must maximize learning rate whileminimizing cost, (3) technological approachesmust reduce the cost of failure, not the rate offailure. In summary, you can fail often if you failcheaply. Human attention is one of the keyresource problems in the network economy,making clear communications one of the keyaspects of successful attention getting.

5. Summary: Calibration is not all thatchemometrics has to offer

Calibration of infrared and near infraredspectrometers has been far and away the mostnoted use of chemometrics in industry. But is thisthe best and most desirable use of this powerfultechnology? Do industrial managers get excitedabout calibrating a sensor using a new optimizedtechnique, or in answering the question as towhether PLS is better than PCR in this or thatcase? I think not. In fact the sensors are allsupposed to make some measurement using allthose optics and electronics in the box - and thatis that! Most chemometric books discuss mostlythe aspects of calibration with a few miscellaneousapplications. Examples [16-17] are adequate todemonstrate this principle.

Now industry may be missing something, and withall deference to the authors of these texts, but thismaterial is still uncodified (esoteric) andundiffused (not distributed for generalconsumption) and it looks like throughchemometrics one mostly has tools for calibration– some for quantitative analysis, some forqualitative. Analysis techniques using the various

forms of multivariate analysis and patternrecognition techniques are certainly available.What else is there? How about codifying theconcept of applying the principles of multivariatethinking to every possible problem that suffersfrom “univariatism.” Chemometric tools would bemost powerful if used for discovery of principleswhere multivariate data is available to studyimportant variables in processes and productperformance, where understanding cause andeffect leads to product improvements.

In conclusion, eight basic, but powerful principlesmight be derived from the network economy thatrelate to chemometrics:(1) The commodity most lacking in today’s

network economy is human attention(2) Value = Utility + Ubiquity(3) A clear powerful message with results receives

attention and communicates utility(4) The easiest way to get the world using

chemometrics is to solve their most urgentproblems using the most parsimonioussolutions

(5) Give these solution for free over the internet(6) Once the value is noticed the techniques will

proliferate more rapidly(7) Then make more advanced tools and

instruction available for solving data problemsthrough standard commercial solutions

(8) Make web-based data and enhancedalgorithms available for everyone (i.e.,network the global PC community intochemometric – and multivariate - thinking).

The chemometrician is thus encouraged to applymultivariate thinking as a new means to routineproblem solving for calibration and discovery. Byapplying multivariate problem solving approachesto both analysis and discovery an improvement inboth the depth of discovery and the speed ofdiscovery is possible. Solve urgent problems firstusing the simplest approach. Offer solutions tothose who will cooperate in a network ofdiscovery to provide ubiquity and synergy.Develop special tools and approaches to discoverythat will lead to faster and more insightful work.

Page 7: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 7

6. References

1. B. R. Kowalski, Chemometrics, AnalyticalChemistry, 52 (1980)112R-122R.

2. I. E. Frank and B. R. Kowalski,Chemometrics, Analytical Chemistry, 54(1982) 232R-243R.

3. M.F. Delaney, Chemometrics, AnalyticalChemistry, 56 (1984) 261R-277R.

4. L. S. Ramos, K. R. Beebe, W. P. Carey, E.Sanchez, B. C. Erickson, B. E. Wilson, L. E.Wangen, B. R. Kowalski, Chemometrics,Analytical Chemistry, 58 (1986) 294R-315R.

5. S. D. Brown, T. Q. Barker, R. J. Larivee, S. L.Monfre, and H. R. Wilk, Chemometrics,Analytical Chemistry, 60 (1988) 252R-273R.

6. S. D. Brown, Chemometrics, AnalyticalChemistry, 62 (1990) 84R-101R.

7. S. D. Brown, R. S Bear, Jr., and T.B. Blank,Chemometrics, Analytical Chemistry, 64(1992) 22R-49R.

8. S. D. Brown, T.B. Blank, S. T. Sum, and L.G. Weyer, Chemometrics, AnalyticalChemistry, 66 (1994) 315R-359R.

9. S. D.Brown, S. T. Sum, and F. Despange,Chemometrics, Analytical Chemistry, 68(1996) 21R-61R.

10. J. J. Workman, P. R. Mobley, B. R. Kowalski,R. Bro, Review of Chemometrics Applied toSpectroscopy: 1985-95, Part I, AppliedSpectroscopy Reviews, 31 (1996) 73-124.

11. B. Lavine, Chemometrics, AnalyticalChemistry, 70 (1998) 209R-228R.

12. B. R. Kowalski, in a formal CPACpresentation, Neenah, Wisconsin, December12, 1997

13. B. R. Kowalski (personal communication) –Dec. 1997

14. J. J. Workman, Interpretive Spectroscopy forNear-Infrared, Applied Spectroscopy Reviews31(1996) 251-320.

15. R. G. McGrath, Lecture notes and discussionfrom LECO program, Columbia UniversitySchool of Business, July-Aug., 2000.

16. E. R. Malinowski, D.G. Howery, FactorAnalysis in Chemistry, John Wiley & Sons,Inc., New York, 1980.

17. K. R. Beebe, R. J. Pell, M. B. Seasholtz,Chemometrics: A Practical Guide, John Wiley& Sons, Inc., New York, 1998.

Genetic Algorithms in FeatureSelection

Riccardo Leardi

Department of Pharmaceutical and FoodChemistry and Technology. University of Genova,

Italy

1. Introduction

Since their presentation by Holland in 1975, theGenetic Algorithms (GA) have attracted a lot ofcuriosity. The goal (trying to simulate theevolutionary process of a living species) andjargon (using typical biological terms such as“gene”, “chromosome”, “mutation” and “cross-over” in the description of an algorithm) of GAshave helped to create something like an aura ofmystery around them.

In that period, the main limitation to the realdevelopment of GAs in terms of applicability wasthe fact that the huge amounts of computationrequired by them could not be handledsatisfactorily by the computers then available. Foralmost 20 years this has been the main problem forthose who would have liked to apply them to theirproblems but did not have the possibility ofaccessing a suitable computer: for “common size”problems a mainframe would have been required,while for complex problems the computation timewould have been too long even with the mostpowerful computers.

Since the beginning of the 1990s this majorproblem has been progressively removed, andnowadays every personal computer can be used toapply GA to easy/moderate-scale problems, whilethe mainframes allow one to tackle very complexproblems such as those typical of molecularmodelling. This is the reason why, after a firstperiod in which the interest of the scientificcommunity was focused mainly on the theoryitself, the number of papers reporting applicationsof GAs to real problems and the number ofscientists and of disciplines using them have beengrowing exponentially.

In 1993, the Journal “Science” 1 published a paperthat gave a general presentation of geneticalgorithms, some mathematical analysis about how

Page 8: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 8

GAs work and how best to use them, and someapplications in modelling several naturalevolutionary systems, including immune systems.In 1995 an article in “Nature” 2 described aproblem of molecular dynamics that had beensuccessfully solved by a GA where conventionaltechniques had failed.

Several tutorials about GAs have been publishedin journals devoted to different research fields. Asexamples we cite those by Lucasius and Kateman3-6, Hibbert 7, Shaffer and Small 8, Wehrens andBuydens 9 and Luke 10. The set-up of the structureof a GA is a very critical point, and a guideleading to a good architecture is highly beneficial.Wehrens et al. 11-12 proposed a set of qualitycriteria to evaluate the performance of a GA,considering not only the best solutions suggestedby the algorithm, but also the repeatability of theoptimization and the coverage of the search space.

GAs have found widespread application in severalfields involving regression problems. One of themost important steps in a calibration is theselection of the relevant variables. The size of thesearch domain (with v variables, 2v-1combinations are possible) and the presence ofmany local optima make GA one of the suggestedmethods. It is interesting to notice that severalauthors have published papers about featureselection by GAs, each of them using a differentGA structure, sometimes rather far from the“standard” algorithm. This demonstrates the needto modify the algorithm according to thepeculiarities of the problem to be solved. In thecase of feature selection a chromosome is made bya very high number of genes (as many as thevariables), each of them being just 1 bit long (0 =variable absent, 1 = variable present). Leardi et al.13 use a simulated data set to show that a GA canalways find the global maximum of a simpleproblem, in a time much shorter than the timerequired by a full search. Lucasius et al. 14 showedthat a GA generally performs better than simulatedannealing and stepwise regression; on the otherhand, Hörchner et al. 15 demonstrated thatsimulated annealing can give the same results.Wise et al. also developed their GA for featureselection 16-17.

Wallet et al. 18 solve the problem of selecting aminimal model which correctly predicts theresponse by applying a GA using a two-criteriapopulation management scheme.

Broadhurst et al. 19 applied GAs to pyrolisis massspectrometry, with the goal of determining theoptimal subset of variables to give the bestpossible prediction or determining the optimalsubset of variables to produce a model with apredictive ability higher than or equal to a givenvalue (both in MLR and in PLS models).

The method proposed by Bangalore et al. 20 leadsto the selection of wavelengths and to thedefinition of the PLS model size. To do that, a“model size” gene taking on the integer valuecorresponding to the number of latent variables tobe used in building the calibration model is addedto the genes coding the presence or absence ofeach variable in the model.

Jouan-Rimbaud et al. 21 successfully applied a GAto a problem of wavelength selection for MLRcalibration, while Arcos et al. 22 obtained a set ofwavelengths able to perform a PLS calibration ofmixtures of indomethacin and acemethacin, inspite of the fact that the two compounds havealmost identical spectra.

A more complex optimization was performed byShaffer and Small 23. They apply a GA to the NIRanalysis of glucose in biological matrices,optimizing at the same time five importantvariables: the position and width of the bandpassfilter, the starting and ending points of the spectralrange submitted to the PLS regression, and thenumber of latent variables employed in thecalibration model.

In spectroscopic infrared imaging applied todiscriminate between different materials, theselection of a limited number of spectroscopicwavelengths guaranteeing the optimaldiscrimination makes the acquisition andprocessing time much faster. This goal has beenachieved by using GA by Van den Broek et al. 24.Depczynski et al. 25 devised a method formulticomponent analysis by near infraredspectrometry by combining wavelet coefficientregression with a GA.

One of the main problems when applying featureselection to spectroscopical data is that thesolution is often given by wavelengths scatteredthroughout the spectrum, instead of spectralregions as the solution given by spectroscopist.This problem has been tackled by Leardi 26, whomodified its previous algorithm in order to force it

Page 9: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 9

as much as possible toward the selection ofcontiguous wavelengths, in such a way that thefinal model can be more easily accepted also byspectroscopists. Although spectral data sets are themost common field of application of GAs forfeature selection, owing to the very large searchdomain, also in the case of non-spectral variablessome good results can be obtained, as reported byAishima et al. 27.

Overfitting is the greatest risk when applying GAsto feature selection. This aspect has been takeninto account especially by Leardi 28-29, Leardi andLupiáñez González 30 and Jouan-Rimbaud et al. 31.

The ever-increasing number of papers in whichGAs are applied to very different fields ofchemistry and chemometrics shows theeffectiveness and validity of this technique. Theadvantage of GAs over the “classical” techniquesbecomes greater the greater the complexity of theproblem, especially owing to the good balancebetween exploration and exploitation. The resultsobtained by a GA can be highly improved after“hybridization” with a standard technique,typically having a very poor exploration abilityand a very high exploitation potential. This allowsone to define with great precision the globaloptimum, whose location has been effectivelyfound by the GA. It is anyway to be highlightedthat it is not possible to define a “best” GAarchitecture to be used in all applications, since theoptimal structure is extremely problem-dependent.

3. References1. Forrest S. ‘Genetic Algorithms: principles of

natural selection applied to computation’.Science, 1993, 261: 872-878.

2. Maddox J. ‘Genetics helping moleculardynamics’. Nature, 1995, 376: 209.

3. Lucasius CB, Kateman G. ‘Understanding andusing genetic algorithms. Part 1. Concepts,properties and context’. Chemom. Intell. Lab.Syst., 1993, 19: 1-33.

4. Lucasius CB, Kateman G. ‘Understanding andusing genetic algorithms. Part 2.Representation, configuration andhybridization’. Chemom. Intell. Lab. Syst.,1994, 25: 99-145.

5. Lucasius CB, Kateman G. ‘Gates towardsevolutionary large-scale optimization: asoftware-oriented approach to GeneticAlgorithms – I. General Perspective’.Computers Chem., 1994, 18, 2: 127-136.

6. Lucasius CB, Kateman G. ‘Gates towardsevolutionary large-scale optimization: asoftware-oriented approach to GeneticAlgorithms – II. Toolbox description’.Computers Chem., 1994, 18, 2: 137-156.

7. Hibbert DB. ‘Genetic algorithms inchemistry’. Chemom. Intell. Lab. Syst., 1993,19: 277-293.

8. Shaffer RE, Small GW. ‘Learningoptimization from nature: SimulatedAnnealing and Genetic Algorithms’. Anal.Chem., 1997, 69: 236A-242A.

9. Wehrens R, Buydens LMC. ‘Evolutionaryoptimization: a tutorial’. TrAC, 1998, 17, 4:193-203.

10. Luke BT. ‘An overview of genetic methods’.Genetic Algorithms in Molecular Modeling, J.Devillers, Ed., Academic Press, 1996, 35-66.

11. Wehrens R, Pretsch E, Buydens LMC.‘Quality Criteria of Genetic Algorithms forstructure optimization’. J. Chem. Inf. Comput.Sci., 1998, 38: 151-157.

12. Wehrens R, Pretsch E, Buydens LMC. ‘Thequality of optimization by genetic algorithm’.Anal. Chim. Acta, 1999, 388: 265-271.

13. Leardi R, Boggia R, Terrile M. "GeneticAlgorithms as a strategy for feature selection".J. Chemom., 1992, 6: 267-281.

14. Lucasius CB, Beckers MLM, Kateman G.‘Genetic algorithms in wavelengths selection:a comparative study’. Anal. Chim. Acta, 1994,286: 135-153.

15. Hörchner U, Kalivas JH. ‘Furtherinvestigation on a comparative study ofsimulated annealing and genetic algorithm forwavelengths selection’. Anal. Chim. Acta,1995, 311: 1-13.

16. Wise BM, Gallagher NB, Eschbach PA,Sharpe SW, Griffin JW. ‘Optimization ofprediction error using Genetic Algorithms andContinuum Regression: determination of thereactivity of automobile emissions from FTIRspectra’. Fourth Scandinavian Symposium onChemometrics (SSC4), Lund, Sweden, June1995.

17. Wise MB, Gallagher NB, Eschbach PA.‘Application of a Genetic Algorithm tovariable selection for PLS models’.EUCHEM-Conference, Gothenburg, Sweden,June 3-6, 1996.

18. Wallet BC, Marchette DJ, Solka JL, WegmanEJ. ‘A Genetic Algorithm for best subsetselection in linear regression’. Proceedings ofthe 28th Symposium on the Interface, 1996.

Page 10: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 10

19. Broadhurst D, Goodacre R, Jones A, RowlandJJ, Kell DB. ‘Genetic algorithms as a methodfor variable selection in multiple linearregression and partial least squares regression,with applications to pyrolisis massspectrometry’. Anal. Chim. Acta, 1997, 348:71-86.

20. Bangalore AS, Shaffer RE, Small GW.‘Genetic Algorithm-based method forselecting wavelengths and model size for usewith Partial Least- Squares Regression:application to Near-Infrared spectroscopy’.Anal. Chem., 1996, 68: 4200-4212.

21. Jouan-Rimbaud D, Massart DL, Leardi R, DeNoord O. "Genetic Algorithms as a tool forwavelength selection in multivariatecalibration". Anal. Chem., 1995, 67: 4295-4301.

22. Arcos MJ, Ortiz MC, Villahoz B, Sarabia LA.‘Genetic-algorithm-based wavelengthsselection in multicomponent spectrometricdeterminations by PLS: application onindomethacin and acemethacin mixture’. Anal.Chim. Acta, 1997, 339: 63-77.

23. Shaffer RE, Small GW. ‘Genetic Algorithm-based protocol for coupling digital filteringand Partial Least-Squares Regression:application to the Near-Infrared analysis ofglucose in biological matrices’. Anal. Chem.,1996, 68: 2663-1675.

24. van den Broeck WHAM, Wienke D, MelssenWJ, Buydens LMC, ‘Optimal wavelengthrange selection by a Genetic Algorithm fordiscrimination purposes in spectroscopicinfrared imaging’. Applied Spectroscopy,1997, 51, 8: 1210-1217.

25. Depczynski U, Jetter K, Molt K, Niemöller A.‘Quantitative analysis of near infrared spectraby wavelet coefficient regression using agenetic algorithm’. Chemom. Intell. Lab. Syst.,1999, 47: 179-187.

26. Leardi R. “Application of genetic algorithm-PLS for feature selection in spectral data sets”,Journal of Chemometrics, 2000, 14: 643-655.

27. Aishima T, Togari N, Leardi R. "Selection ofaroma components to predict sensory qualityof Kenyan black teas using genetic algorithmsfor multiple regression models". Food Sci.Technol., Int., 1996, 2: 124-126.

28. Leardi R. "Application of a Genetic Algorithmto feature selection under full validationconditions and to outlier detection". J.Chemom., 1994, 8: 65-79.

29. Leardi R. "Genetic Algorithms in featureselection", in ‘Genetic Algorithms inMolecular Modeling’. J. Devillers, Ed.,Academic Press, 1996; 67-86.

30. Leardi R, Lupiáñez González A. "GeneticAlgorithms applied to feature selection in PLSregression: how and when to use them".Chemom. Intell. Lab. Syst., 1998, 41: 195-207.

31. Jouan-Rimbaud D, Massart DL, de Noord OE.‘Random correlation in variable selection formultivariate calibration with a geneticalgorithm’. Chemom. Int. Lab. Syst., 1996, 35:213-220.

Parsimonious models inchemometrics

Frank Westad

MATFORSKNorwegian Food Research Institute

www.matforsk.no

1. Introduction

The need for effective and safe methods for dataanalysis is increasingly important. Instrumentationand data collecting system generate huge amount ofdata. Examples are 2D or 3D analytical instruments,high-throughput screening, microarray, QSAR andimaging.

The parsimony principle [1] goes back severalcenturies. In a scientific context, a model with fewparameters is to be preferred to other possible modelsif their interpretations and predictive abilities aresimilar. Various approaches exist for findingparsimonious models. A typical situation is some kindof best subset approach, which is discussed below inmore detail. Another approach is to compress theoriginal data, and estimate models on coefficientsfrom the compressed representation of the data withmethods such as splines and wavelets. This is usefulboth for storage and analysis of the data. Yet anotherapproach to parsimony is to work with multiway,multiblock or multi-domain methods [2][3]. In thistext we will not pursue this further, but focus on thetask of finding relevant variables regardless if theyare original or derived variables. It is also relevant tomention the newly developed o-PLSR [4] in thiscontext. Another important issue that will not be dealt

Page 11: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 11

with in this text is preprocessing of data, e.g. somekind of signal correction.

2. Some views on multivariate modelling

A vital aspect in all kinds of models is to estimateuncertainties for the model parameters. Since the truthis seldom known in empirical modelling, theuncertainties have to be estimated from a sample ofthe system or process under observation. Suffice fornow the discussion about what is atraining/calibration set and (independent) test set interms of validation, let us assume that the objective isto make a model from the available data. Then, later,we might want to see if the observed system/processhas changed over time or is behaving differently atanother location and with other equipment. I think itis fair to say that model validation in one way oranother is one of the important contributions fromchemometricians in multivariate modelling. Thisshould not be interpreted as we claim chemometrics“invented” model validation such as cross validation(CV), but it is an inherent part of most chemometricsoftware packages.

Uncertainties may be estimated from resamplingmethods such as jackknifing (JK) and bootstrapping(BS). Jackknifing is closely connected to CV, thedifference lies in whether the models with all objectsor the mean of all individual models from theresampling should be regarded as the “reference”.Since we are applying resampling for specificpractical purposes such as variable selection, we feelit is more natural to look upon the model on allobjects as the “reference”. According to studies byEfron [5], the difference between these two is of order1/N2, N being the number of objects.

Bootstrapping can be performed in various ways. Theoriginal approach was to resample from a populationsample numerous times with replacement. From thetheory, this will include 63% of the objects in onebootstrap sample. Another approach is to make apreliminary model, e.g. y = Xb + �, and then drawindividual pairs (x,�) randomly and generate newobjects. Both these procedures are typically repeated500 times. The results below are based on BS onoriginal data.

Cross validation is intuitive in terms of stability of amodel, formulated as the question “How is the modelchanging compared to the full model when some ofthe objects are taken out?” If then the model changesconsiderably, this deserves more detailed

investigation as to why this occurred. This question isonly common sense, but to quote Efron [5]: “Goodsimple ideas, of which the jackknife (JK) is a primeexample, are our most precious intellectualcommodity, so there is no need to apologize for theeasy mathematical level”.

How good are our estimates? Results from situationswhere the “truth” in terms of ANOVA is availableshow that both BS and JK estimates are close toANOVA. Since JK has an aspect of validation, theseestimates tend to be more conservative. An exampleis shown in Table 1 for the data set “Helicopter”taken from the literature [5]. The data are from acentral composite design with four design variablesand a blocking variable.

Table 1. Results for the Helicopter data

Variable ANOVA JK PC1 BS PC1Block 0.613 0.671 0.583

Wing Area 0.960 0.970 0.962Wing ratio 0.005 0.024 0.004Body width 0.880 0.913 0.888Body length 0.001 0.008 0.001

Variable selection in PCA

Variable selection is often thought of in a regressioncontext, where one or more response variables (Y) areto be modeled by a set of predictor variables (X).However, the importance of finding the relevantvariables is also vital in exploratory data analysistools such as PCA, and for cluster analysis for thatmatter. In a recent review article about biotechnologyit is claimed that “finding the few genes that are mostresponsible for the observed patterns in the data is awell-studied, but still unsolved, statistical problem”.

We believe that jackknifing in PCA has a potential tosolve some aspects of this “problem”. It has now beenapplied on many different types of data. Below is anexample from microbiology, where 26 bacteria havebeen tested to determine if they have the ability toferment 28 different sugars, represented as binarydata. PCA was employed, three PCs were found to berelevant. Significant variables on either or both PC 1& 2 are marked as stars in Figure 1.

Page 12: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 12

Figure 1 Loadings with significant variables marked

Although some of the variables have high negativeloadings on PC2, they were found not to besignificant. The reason is that these variables have theability to ferment all sugars except one. The JKestimate thus warns us that the correlation betweenthese variables and this PC is due to one object only;this is a way to assess the stability of the model. Fordata arising from experimental designs, the results sofar show that significance variables found from JKestimates correspond well with Analysis of Effectsfrom ANOVA.

Variable selection in regression

The majority of applications in variable selectionconcerns regression. Some of the most commonmethods for variable selection are:

� Stepwise approaches� Genetic algorithms� Methods based on uncertainty estimates (e.g. JK

or BS).

There is a distinction between finding a model withgood fit for a few variables, e.g. 3-7, and a model withall non-relevant variables removed. The JK and BSbased approaches allow for removing all the non-relevant variables, but you are not restricted fromtaking out predictor variables because some of themare highly correlated. This of course is tied to the factthat most models do not have the best predictive norinterpretation abilities at full numerical rank. Whilewe often see that JK uncertainty estimates from PLSRare higher than OLS estimates for full rank models,this is not contradictory to the general principle thatthe coefficients themselves are smaller for PLSR

(“PLSR shrinks”). When the validation indicates thatthe optimal number of components is << numericalrank, then we deliberately want the estimates to beconservative for the OLS solution when employinge.g. cross validation.

Genetic algorithms have been shown to yield goodresults in a lot of different applications (see R.Leardi’s contribution), but they require tuning of anumber of parameters and they still are quitecomputer intensive for data with thousands ofvariables. An initial step where the non-relevantvariables are removed by t-tests based on JKestimates of uncertainties will reduce the complexity.

The stepwise methods have a general problem in thatrelevant variables are never to be included becausesimilar information is described by other variables,especially when the selection of variables is based onfull rank models without proper validation. It is worthmentioning that this situation arises even when thereare no numerical problems of matrix inversion due tocollinearity. This is not to say that stepwiseapproaches do not give good models or predictions,for a small number of variables, typically 3-8, but oneshould monitor the performance with respect to themodel rank.

The term “rank” with respect to a multivariate modeldeserves some comments, as “rank” has variousfacets:

1. The numerical rank. This rank is the one basedon numerical computations, e.g. the number ofcomponents that can be computed withoutsingularity problems.

2. The statistical rank. The important issue is hereto find the optimal rank from a statisticalcriterion, preferably based on some propervalidation method.

3. The application specific rank. This judgement istypically a combination of backgroundknowledge, prediction ability, model complexity,and interpretation aspects. In most situations, thisrank is lower than the statistical rank, i. e. thedata-analyst tends to be more conservative.

Comparison of thousands, not to say millions ofmodels to find THE best one may sound intriguing,but some questions arise in this context:

1. What is the stop criterion of not toinclude/exclude variables and/or components?

Page 13: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 13

2. How do we judge if one model is better thananother?

Variable selection by employing individual t-testsbased on uncertainties estimated by jackknifing/CVand bootstrapping is not error-free. There is a dangerto keep variables although they should have beenremoved or to overlook variables. Repeated randomcross validation, e.g. 100 times, might be useful to seehow many times a variable is included in the model.Another alternative is cross model validation [7][8].Also, repeating the procedure iteratively helps inremoving the non-relevant variables, as shown in [9].This will in most cases also make it easier to assessthe correct rank of the model.

Should JK or BS be employed? As the results inTable 2 indicate, the JK seems to be a moreconservative method, and it has an inherent modelvalidation. A study on the beer data reported in [9],showed that the BS estimates were smaller than JK,but thereby found a higher number of significantvariables in the noise part of the spectrum. Table 2shows the results when predicting 20 test samples.The results from calibration (not shown) were almostidentical, but the JK approach gives significantlybetter prediction of the test samples.

Table 2: Results from modelling of beer data

Iter #var/PC JK

RMSEP JK

#var/PCBS

RMSEP BS

0 0 2.44 0 2.441 926/6 0.74 926/5 0.722 162/6 0.20 375/6 0.583 90/9 0.20 300/8 0.444 16/7 0.20 237/8 0.46

It is a general experience that when a best subsetmodel of e.g. five variables is the objective, manymodels will have more or less the same predictiveability.

In our pursuit of finding the best prediction (orclassification) model based on variable selection, theability to interpret the underlying structure might bedistorted in the final model. It might seem illogical todiscard variables we carefully have beenobserving/collecting. One alternative is to have twomodels:

1. One model with all variables where newsamples are projected in the score plot, leverageand residuals can be interpreted etc.

2. One model for optimal classification/ predictionability.

Instead of removing the non-significant variables, it ispossible to down-weigh them with a factor of e.g. 10-4

so that their (numerical) influence in the modelbecomes small. Then, visualize them in thecorrelation loadings plot [10], which shows thecorrelation between the original variables and thecomponents. This measure is invariant to how thevariables were weighted during modelling. We end upwith a model that has good predictive ability and atthe same time all the variables can be interpreted.

3. Conclusions

The principle of parsimony has shown to be a fruitfulapproach in terms of variable selection to removenon-relevant variables. Jack-knifing is a conservativeand less computer-intensive alternative tobootstrapping for estimates of parameter uncertaintiesin multivariate models. The list of variables sorted bysignificance is a good starting point for GeneticAlgorithms and stepwise methods. Whenever variableselection in some form is employed, model validationis essential.

4. References

1. Seasholtz, M-B, and Kowalski, B. Anal. Chim.Acta, 277, 165-177 (1993).

2. Reberg, J. O. and Martens, H. PatentW09629679 (1996).

3. Westad, F. and Martens, H. Chemometrics andIntel. Laboratory Systems, 45, 361-370 (1997).

4. Trygg, J. and Wold, S. Journal of Chemo-metrics (in press).

5. Efron, B. Society for Industrial and AppliedMathematics, Philadelphia, Pennsylvania, ISBN0-89871-179-7 (1982).

6. Box, G.E.P. and Liu, P.Y.T. Journal of QualityTechnology, Vol 31, 1 (1999).

7. Anderssen, E and Martens, H. In prep. forJournal of Chemometrics.

8. Nørgaard, L. and Bro, R. Proceedings, LesMethode PLS. Symposium International PLS'99187-202. (1999).

9. Westad F. and Martens. H. Journal of NearInfrared Spectroscopy 8, 117-124 (2000).

10. Martens, H. and Martens, M. Food Qualityand Preference 11, 5-16 (2000).

Page 14: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 14

Objective Data Alignmentfollowed by Chemometric

Analysis of Two-DimensionalSeparations Initially with

Retention Time Shifting on BothDimensions

Carlos G. Fraga, Bryan J. Prazen andRobert E. Synovec

Center for Process Analytical Chemistry,Department of Chemistry, Box 351700, University

of Washington, Seattle WA 98195, USA

1. Abstract

A retention time alignment preprocessingalgorithm is presented that objectively corrects forrun-to-run retention time variation on bothseparation dimensions of comprehensive two-dimensional (2-D) separations prior to applicationof chemometric data analysis algorithms. Thealignment algorithm is easy to apply and robust.Thus, data from 2-D separation techniques such ascomprehensive 2-D gas chromatography (GC xGC), liquid chromatography/liquidchromatography (LC x LC) and liquidchromatography/capillary electropho-resis (LC xCE) can be readily analyzed by variouschemometric methods to increase chemicalanalysis capabilities. Complex samples can bemore effectively studied.

2. Introduction

Comprehensive 2-D separations are ideally suitedfor the analysis of complex samples, and areemerging as powerful tools for chemical analysis[1-11]. Even with a large peak capacity, theprobability of peak overlap in 2-D separations canbe quite severe, especially for highly complexsamples. Peak overlap becomes even more likelyif one desires to speed up the analysis by designinga given separation method to provide a reductionin the run time. Thus, traditional methods ofchromatographic and electrophoretic data analysis,such as peak height and peak area measurements,become less effective as the analyst moves into therealm of high-speed chemical analysis. Thelimitations brought upon by the likelihood of peakoverlap can be overcome, to a large extent, by the

implementation of appropriate chemometricmethods. Essentially, chemometric methods willeffectively enhance the resolving power of 2-Dseparation methods.

The Generalized Rank Annihilation Method(GRAM) is a chemometric method that resolvesand quantifies overlapped peaks in commonbetween sample and standard runs. Several paperscover the development of the GRAM algorithm indetail [12-14]. The specific requirements forGRAM analysis of comprehensive 2-D separationshave been reported [1]. In order to use GRAM,the data comprising the peaks due to analytes andinterferences present in the sample and standardmust be bilinear or approximately bilinear. Abilinear peak is a 2-D peak that can bemathematically represented by the vector productof its elution profiles along each columnseparation axis, producing a data matrix.Generally, a bilinear peak in a contour plotappears as an elliptical (or circular) zone that hasits major and minor axes aligned with the twoseparation axes. Contour plots from published GCx GC, LC x LC, and LC x CE papers depict 2-Dpeaks that appear bilinear [1-4, 8-11]. Usingsample and standard matrices of data for sectionsof the 2-D separations that contain the analyte(s)of interest, GRAM calculates the pure elutionprofiles of overlapped 2-D peaks. Each 2-D peakcan then be individually reconstructed using itsrespective two separation elution profiles. Inaddition, GRAM provides the concentrations foranalytes in the sample relative to in the standard.The standard can be simply prepared using theoriginal sample by the standard addition method[2].

Not all comprehensive 2-D separation techniquesprovide bilinear data. In particular, temperatureprogramming simultaneously on both columns inGC x GC will not produce bilinear data. For thisreason we have been developing a high-speedvalve-based comprehensive GC x GC that hasindependently controlled temperatureprogramming of both columns so the first columncan be temperature programmed while the secondcolumn can be held isothermal at a desiredtemperature.

In order to use bilinear chemometric methods suchas GRAM, the unwanted shifting of a 2-D peak'sretention time(s) or migration time(s) betweensample and standard runs must be objectively

Page 15: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 15

corrected. Indeed, run-to-run retention timeshifting has been a severe impediment to the use ofchemometric methods on data collected fromseparation techniques. We have addressed theretention time alignment problem, and havepreviously developed an objective rank-basedalignment method to correct retention time shiftsalong one time axis [15]. Rank alignment hasbeen critically developed and successfully appliedto GC x GC separations that need peak alignmenton the first column time axis only [2-4]. Recently,we have now modified our alignment method todetermine and correct the run-to-run peak shiftingalong both separation time axes. Here we reportour initial findings. Correction of retention timeshifting on both separation dimensions broadensthe scope considerably for the use of chemometricmethods, since most 2-D separation techniquesproduce run-to-run shifting on both dimensionsthat is significant and detrimental to chemometricapplications, if not corrected.

3. Experimental

The original alignment method applies an iterativeroutine to determine the peak shift (along one timeaxis) between the 2-D peaks in common between acalibration standard data matrix, N, and"unknown" sample data matrix, M. In most casesthe data matrices M and N that are being analyzedare small regions of separation data matriceswhich contain overlapping peaks. Regions of the2-D separation data matrices in which the peaksare resolved are analyzed using standard peakheight or volume methods.

The iterative alignment approach is based on thefact that the augmented data matrix [M|N], has aminimum rank when the 2-D peaks are alignedalong the time axis in question. Hence, thealignment method shifts the 2-D peaks in Mrelative to N along the first column time axis untilthe resulting matrix [M|N] achieves a minimumrank. The alignment method uses the secondaryeigenvalues from the singular value decompositionof [M|N] to find the peak shift correctionassociated with a minimum rank. Once thecorrect peak shift between M and N is determined,M can be aligned to N. This approach works finefor 2-D separations that have highly reproducibleretention times along the second dimension timeaxis, such as is the case for most GC x GC data.Building upon our previous work, for 2-D

separations that require shift corrections alongboth time axes, we report an alignment methodthat can be used to determine and correct the peakshifts along both time axes. By applying theoriginal alignment method on the augmentedmatrix [M|N] and then subsequently on the

augmented matrix MN

���

��, peak shifts occurring

along both time axes can be independently andsuccessively corrected. In this work we havetaken 2-D data collected from a GC x GC and havesimulated 2-D data from other separationtechniques by randomly adding retention timeshifting that is consistent with reported precisionfrom recent reports for LC x CE [10, 11]. Wereport how this improved algorithm corrects bothdimensions in an independent, step-wise fashion.

4. Results and Discussion

In Figure 1 is shown the representative 2-Dseparation data for a four component mixture.Five replicate runs of the four component mixtureserved as sample runs. Next, a standard additionwas made to the sample and run five times for thestandard runs (not shown for brevity). In Figure 2is shown the application of the improvedalignmentalgorithm to the five paired combinations ofsample/standard runs. Note that in Figure 2 eachdata trace depicts one boundary contour linedemarcating the peak boundaries for one run as inFigure 1. As shown in Figure 2, the improvedalignment algorithm successfully shifted the fivereplicate 2-D sample runs in both dimensions.Subsequent GRAM analysis was very successful.

5. Acknowledgement

This work was funded in part by the Center forProcess Analytical Chemistry (CPAC), aUniversity/Industry Cooperative center at theUniversity of Washington.

Page 16: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 16

6. References

1. C. A. Bruckner, B. J. Prazen and R. E.Synovec, Anal. Chem. Comprehensive Two-Dimensional Gas Chromatography withChemometric Analysis 1998, 70, 2796-2804.

2. C. G. Fraga, B. J. Prazen and R. E. Synovec,Anal. Chem. Comprehensive Two-DimensionalGas Chromatography and Chemometrics for theHigh-Speed Quantitative Analysis of AromaticIsomers in a Jet Fuel using the StandardAddition Method and an Objective RetentionTime Alignment Algorithm 2000, 72, 4154-4162.

3. C. G. Fraga, B. J. Prazen and R. E. Synovec, J.High Resolut. Chromatogr. Enhancing theLimit of Detection for Comprehensive Two-Dimensional Gas Chromatography (GC x GC)Data using Bilinear Chemometric Analysis2000, 3, 215-224.

4. C. G. Fraga, C. A. Bruckner and R. E. Synovec,Anal. Chem. Increasing the Number ofAnalyzable Peaks in Comprehensive Two-Dimensional Separations through Chemo-metrics 2001, 73, 675 -683.

5. Z. Liu and M. L. Lee, J. Microcol. Sep.Comprehensive Two-Dimensional SeparationsUsing Microcolumns 2000, 12, 241-254.

6. J. B. Phillips, R. B. Gaines, J. Blomberg, F. W.M. van der Wielen, J. M. Dimandja, V. Green,J. Granger, D. Patterson, L. Racovalis, H. J. deGeus, J. de Boer, P. Haglund, J. Lipsky, V.Sinha and E. B. Ledford Jr., J. High Resolut.Chromatogr. A Robust Thermal Modulator forComprehensive Two-Dimensional GasChromatography 1999, 22, 3-10.

7. J. V. Seeley, F. Kramp and C. J. Hicks, Anal.Chem. Comprehensive Two-Dimensional GasChromatography via Differential FlowModulation 2000, 72, 4346-4352.

8. M. M. Bushey and J. W. Jorgenson, Anal.Chem. Automated Instrumentation forComprehensive Two-Dimensional High-Performance Liquid Chromatography ofProteins 1990, 62, 161-167.

9. M. M. Bushey and J. W. Jorgenson, Anal.Chem. Automated Instrumentation forComprehensive Two-Dimensional High-Performance Liquid Chromatography/CapillaryZone Electrophoresis 1990, 62, 978-984.

10. T. F. Hooker and J. W. Jorgenson, Anal. Chem.A Transparent Flow Gating Interface for theCoupling of Microcolumn LC with CZE in aComprehensive Two-Dimensional System1997, 69, 4134 -4142.

11. A. V. Lemmo and J. W. Jorgenson, Anal.Chem. Transverse Flow Gating Interface for the

FID

sig

nal

Colum

n 2 tim

e, se

c

140160

180 1.01.1

1.20

4

8

12

Column 1 time, sec

145 155 165 175

1.051.151.25

145 155 165 1751.051.151.25

145 155 165 175

1.051.151.25

Column 1 time, sec

Col

umn

2 tim

e, s

ec

A

B

C

Figure 1. 2-D data obtained fromrun of four component mixture.

Figure 2. (A) Overlaid contour boundary plots (data as in Figure 1) from 5 runs, illustrating retention time variation on both dimensions. (B) After alignment along column 1 axis. (C) The 5 runs after alignment along column 2 axis.

FID

sig

nal

Colum

n 2 tim

e, se

c

140160

180 1.01.1

1.20

4

8

12

Column 1 time, sec

FID

sig

nal

Colum

n 2 tim

e, se

c

140160

180 1.01.1

1.20

4

8

12

Column 1 time, sec

145 155 165 175

1.051.151.25

145 155 165 1751.051.151.25

145 155 165 175

1.051.151.25

Column 1 time, sec

Col

umn

2 tim

e, s

ec

A

B

C

Figure 1. 2-D data obtained fromrun of four component mixture.

Figure 2. (A) Overlaid contour boundary plots (data as in Figure 1) from 5 runs, illustrating retention time variation on both dimensions. (B) After alignment along column 1 axis. (C) The 5 runs after alignment along column 2 axis.

Page 17: International Chemometrics Society - Cornell University · Analytical Chemistry, again defined by Kowalski “as the discovery and development of new and sophisticated analytical

NAmICS Newsletter # 22, January 2002 17

Coupling of Microcolumn-LC with CZE in aComprehensive 2-Dimensional System 1993,65, 1576-1581.

12. E. Sánchez and B. R. Kowalski, Anal. Chem.Generalized Rank Annihilation Factor Analysis1986, 58, 496-499.

13. E. Sánchez, L. S. Ramos and B. R. Kowalski,J. Chromatogr. Generalized Rank AnnihilationMethod: I. Application to LiquidChromatography-Diode Array UltravioletDetection Data 1987, 152-164.

14. L. S. Ramos, E. Sánchez and B. R. Kowalski,J. Chromatogr. Generalized Rank Annihilationmethod II: Analysis of BimodalChromatographic Data. 1987, 385, 165-180.

15. B. J. Prazen, R. E. Synovec and B. R.Kowalski, Anal. Chem. Standardization ofSecond-Order Chromatographic/SpectroscopicData for Optimum Chemical Analysis 1998, 70,218-225.

Chemometric Movies Coming Toa Theatre Near You

In our continuous effort to spread the word aboutthe unbearable lightness of chemometrics, theChemometric Society has plans to re-make classicmoves in chemometric versions. These are amongthe movies we expect to appear on screen thisyear:

� ”SIMCAsablanca” starring Nouna & Svante

� ”For a few loadings more” starring HaraldMartens

� “Honey I shrunk the regression coefficients”starring Sijmen de Jong

� ”Things to do at SSC7 when you’re drunk”starring ….*

* fill in whatever name you find appropriate