15
ORIGINAL PAPER Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods Ute Ro ¨misch Henry Ja ¨ger Xavier Capron Silvia Lanteri Michele Forina Johanna Smeyers-Verbeke Received: 13 February 2009 / Revised: 9 July 2009 / Accepted: 16 August 2009 / Published online: 2 October 2009 Ó Springer-Verlag 2009 Abstract The aim of the European wine project was to test the possibility of determining the country of origin of wines based on their chemical composition. The results of descriptive and inductive univariate methods of data anal- ysis are discussed in part II of this series of papers. Here the results of some selected multivariate methods of discrimi- nation and classification such as classification and regres- sion trees (CART), regularized discriminant analysis (RDA), and partial least squares discriminant analysis (PLS-DA) are compared and discussed. Special attention is paid to the development of models that are efficient both in terms of predictive performance and number of required variables. Using CART South African wines could be separated very easily from those of the East European countries by only one isotopic parameter, but it gives less good results for the discrimination of the East European wines. The application of RDA and PLS-DA, and its uninformative variable elimination variant (PLS-UVE) lead to discriminant models, which allow a correct classification of the wines from the East European countries with rates between 88 and 100%. Comparing RDA and PLS-DA, RDA-models contain somewhat fewer variables than PLS- DA-models, because PLS-DA is constrained to two-group comparisons (‘‘one country against the other countries’’). Keywords Wine discrimination Classification and regression trees (CART) Regularized discriminant analysis (RDA) Partial least squares-uninformative variable elimination (PLS-UVE) Introduction The identification of the geographical origin of wines on the basis of a minimal number of the most important chemical- analytical parameters was the main aim of the European project ‘‘Establishing of a wine data bank for analytical parameters from Third Countries’’ (G6RD-CT-2001- 00646-WINE-DB). A wine data base containing about 600 authentic and 600 commercial white and red wines from four countries was created over a period of 3 years during 2001–2004. Sixty-three chemical parameters were consid- ered for each of those samples. An introduction to this project can be found in part I [1] of this series of papers. The statistical data analysis involved methods of uni- variate descriptive and explorative data analysis as well as multivariate methods. While the results of applying uni- variate methods to the wine data were analyzed in part II [2], here mainly the results of the used multivariate dis- crimination and classification methods are discussed. Project participants R. Wittkowski, BfR, Germany; C. Fauhl- Hassek, BfR, Germany; K. Schlesier, BfR, Germany; P. Brereton, CSL, United Kingdom; M.Baxter, CSL, United Kingdom; E. Jamin, Eurofins, France; X. Capron, VUB, Belgium; J. Smeyers-Verbeke, VUB, Belgium; C. Guillou, JRC, Italy; M. Forina, UGOA, Italy; U. Ro ¨misch, TU Berlin, Germany; V. Cotea, UIASI.VPWT.LO, Romania; E. Kocsi, NIWQ, Hungary; R. Schoula, CTL, Czech Republic; F. van Jaarsveld, ARC Infruitec-Nietvoorbij, South Africa; Jan Booysen, Winetech, South Africa. U. Ro ¨misch (&) H. Ja ¨ger Technische Universita ¨t Berlin, Fak. III, Gustav- Meyer- Allee 25, 13355 Berlin, Germany e-mail: [email protected] X. Capron J. Smeyers-Verbeke Vrije Universiteit Brussel, Farmaceutisch Instituut, Laarbeeklaan 103, 1090 Brussels, Belgium S. Lanteri M. Forina Dipartimento di Chimica e Tecnologie Farmaceutiche ed Alimentari, Via Brigata Salerno 13, 16147 Genoa, Italy 123 Eur Food Res Technol (2009) 230:31–45 DOI 10.1007/s00217-009-1141-x

Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

Embed Size (px)

Citation preview

Page 1: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

ORIGINAL PAPER

Characterization and determination of the geographical originof wines. Part III: multivariate discrimination and classificationmethods

Ute Romisch • Henry Jager • Xavier Capron •

Silvia Lanteri • Michele Forina •

Johanna Smeyers-Verbeke

Received: 13 February 2009 / Revised: 9 July 2009 / Accepted: 16 August 2009 / Published online: 2 October 2009

� Springer-Verlag 2009

Abstract The aim of the European wine project was to

test the possibility of determining the country of origin of

wines based on their chemical composition. The results of

descriptive and inductive univariate methods of data anal-

ysis are discussed in part II of this series of papers. Here the

results of some selected multivariate methods of discrimi-

nation and classification such as classification and regres-

sion trees (CART), regularized discriminant analysis

(RDA), and partial least squares discriminant analysis

(PLS-DA) are compared and discussed. Special attention is

paid to the development of models that are efficient both in

terms of predictive performance and number of required

variables. Using CART South African wines could be

separated very easily from those of the East European

countries by only one isotopic parameter, but it gives less

good results for the discrimination of the East European

wines. The application of RDA and PLS-DA, and its

uninformative variable elimination variant (PLS-UVE) lead

to discriminant models, which allow a correct classification

of the wines from the East European countries with rates

between 88 and 100%. Comparing RDA and PLS-DA,

RDA-models contain somewhat fewer variables than PLS-

DA-models, because PLS-DA is constrained to two-group

comparisons (‘‘one country against the other countries’’).

Keywords Wine discrimination �Classification and regression trees (CART) �Regularized discriminant analysis (RDA) �Partial least squares-uninformative variable elimination

(PLS-UVE)

Introduction

The identification of the geographical origin of wines on the

basis of a minimal number of the most important chemical-

analytical parameters was the main aim of the European

project ‘‘Establishing of a wine data bank for analytical

parameters from Third Countries’’ (G6RD-CT-2001-

00646-WINE-DB). A wine data base containing about 600

authentic and 600 commercial white and red wines from

four countries was created over a period of 3 years during

2001–2004. Sixty-three chemical parameters were consid-

ered for each of those samples. An introduction to this

project can be found in part I [1] of this series of papers.

The statistical data analysis involved methods of uni-

variate descriptive and explorative data analysis as well as

multivariate methods. While the results of applying uni-

variate methods to the wine data were analyzed in part II

[2], here mainly the results of the used multivariate dis-

crimination and classification methods are discussed.

Project participants R. Wittkowski, BfR, Germany; C. Fauhl-Hassek, BfR, Germany; K. Schlesier, BfR, Germany; P. Brereton,CSL, United Kingdom; M.Baxter, CSL, United Kingdom; E. Jamin,Eurofins, France; X. Capron, VUB, Belgium; J. Smeyers-Verbeke,VUB, Belgium; C. Guillou, JRC, Italy; M. Forina, UGOA, Italy;U. Romisch, TU Berlin, Germany; V. Cotea, UIASI.VPWT.LO,Romania; E. Kocsi, NIWQ, Hungary; R. Schoula, CTL, CzechRepublic; F. van Jaarsveld, ARC Infruitec-Nietvoorbij, South Africa;Jan Booysen, Winetech, South Africa.

U. Romisch (&) � H. Jager

Technische Universitat Berlin, Fak. III,

Gustav- Meyer- Allee 25, 13355 Berlin, Germany

e-mail: [email protected]

X. Capron � J. Smeyers-Verbeke

Vrije Universiteit Brussel, Farmaceutisch Instituut,

Laarbeeklaan 103, 1090 Brussels, Belgium

S. Lanteri � M. Forina

Dipartimento di Chimica e Tecnologie Farmaceutiche ed

Alimentari, Via Brigata Salerno 13, 16147 Genoa, Italy

123

Eur Food Res Technol (2009) 230:31–45

DOI 10.1007/s00217-009-1141-x

Page 2: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

In this paper, we consider three methods of discrimi-

nation and classification of multivariate data: the classifi-

cation and regression trees (CART), the regularized

discriminant analysis (RDA) including the linear (LDA)

and quadratic (QDA) case, and the partial least squares-

discriminant analysis (PLS-DA) with its uninformative

variables elimination (PLS-UVE).

At the beginning our attention was focussed on discrim-

inating the four countries in order to confirm our expectation

from the univariate evaluation, PCA [2], and from the

analyses of single years [3, 4], that authentic as well as

commercial South African wines seem to be discriminated

very easily from those from East European countries. The

application of CART and RDA on the wine data of the three

East European countries has confirmed that the discrimina-

tion of wines between Hungary, Czech Republic, and

Romania was much more difficult because of their geo-

graphical location. Finally PLS-classification methods were

used to build ‘‘one versus all other’’ discriminant models.

Data

The data set consists of wine samples from four different

countries: Hungary, Czech Republic, Romania, and South

Africa. For each country authentic and commercial wine

samples were collected and analyzed over a period of

3 years. The sampling strategy, the used analytical meth-

ods and the data, including sample sizes for the different

countries, and types of wines are described in part I [1] and

part II [2] of this series of papers.

Description of multivariate statistical methods

Classification and regression trees

Classification and regression trees (CART) is a well-known

method in statistics which is thoroughly documented in the

literature [5] and hence the principle of this method is only

shortly illustrated with the help of Fig. 1. In this situation

three groups (circles, squares, triangles) described by two

variables (x1 and x2) must be discriminated. CART is a

partition method whose goal is to find critical values for x1

and x2 so that the combination of simple binary splits will

form a decision tree. This tree will then be used to deter-

mine the class membership of a new incoming sample.

Looking at Fig. 1a, circles are clearly characterized by

a value of x1 higher than 0.6 and hence their discrimi-

nation is straightforward. On the other hand, squares and

triangles both have their x1 value below 0.6 and hence

cannot be separated yet. However, triangles are easily

distinguished from squares with the help of the second

variable since they are characterized by a value of x2

which is lower than 0.4. Therefore, the model finally

obtained by CART (Fig. 1b) is a decision tree of which

the first split compares the value of x1 to 0.6 and the

second split checks whether the value of x2 is lower or

higher than 0.4. Following this tree, it is possible to

perfectly discriminate the three classes present in the data.

In practice, the construction of a CART model is per-

formed in three steps of which the first consists in

building a tree which will perfectly describe the training

data. However, the predictive ability of the model might

be poor due to over-fitting and hence the next step is

to ‘‘prune’’, i.e., to successively cut, the last branches of

this over-large tree. It is then necessary to determine

which of the smaller trees obtained after pruning is

optimal in terms of predictive power. This last step is

usually achieved by tenfold cross-validation, a re-sam-

pling technique which uses the training data for valida-

tion and therefore does not require additional independent

samples [6]. The most attractive feature of CART cer-

tainly is the simplicity of the model obtained, which

makes interpretation straightforward. Indeed, only few

original variables and their corresponding critical values

have to be known to determine the class membership of a

new sample.

x1

x 2

0.6

0.4

x

x 2

0.6

0.4

X1<0.6

X2<0.4

X1

X2

(b)(a)Fig. 1 CART a the initial space

is divided into more pure

subspaces; b the corresponding

decision tree

32 Eur Food Res Technol (2009) 230:31–45

123

Page 3: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

All computations for CART were performed with a

Matlab program, written in Matlab 6.5 (The MathWorks,

Natick, MA) on a computer running Microsoft Windows

XP.

Regularized discriminant analysis

The discriminant analysis [6, 7] is used to analyze differ-

ences of two or more groups (or classes) with respect to a

set of variables measured on the objects of these groups.

The influence of those independent variables on the groups

is to be investigated. Discriminant functions, which contain

significant variables, are estimated and objects will be

classified on the basis of the estimated discriminant model.

‘‘Good’’ discriminant models contain the most important

variables for explaining differences between the groups

with minimal misclassification rates.

Under the assumption of Gaussian distribution of the

p-dimensional feature vector Xk in the kth group, Xk *N(lk, Rk), (k = 1,…, K), where lk denotes the group means

and Rk the group covariance matrices, quadratic discrimi-

nant analysis (QDA) minimizes the misclassification rate

and separates the disjoint regions of the feature space

corresponding to each group assignment by quadratic

boundaries. Linear discriminant analysis (LDA) is used

under the hypothesis of identical group covariance matri-

ces, such that the rule that minimizes the misclassification

rate leads to a linear separation of the groups.

Regularized discriminant analysis (RDA) [8] was

introduced as a compromise between linear and quadratic

discriminant analyses, when the number of parameters to

be estimated is comparable or even larger then the sample

size. In the regularization step the estimated group

covariance matrix Rk is stabilized by

RkðkÞ ¼ kRk þ ð1� kÞR

The regularization parameter k [ [0, 1] controls the

degree of shrinkage of the group covariance matrix

estimates toward the pooled estimate. The limiting cases

correspond to LDA (k = 0) and QDA (k = 1). To

determine the optimal value of this parameter k, the error

rate estimations has to be minimized during the model

building process. Rates of misclassification are estimated

based on methods as re-substitution, cross validation

(leave-one-out), and prediction from a test set (data are

split into learning and test set).

The Matlab-program ‘‘ldagui’’ [9], which allows build-

ing models interactively step by step in dependence on a

minimal classification error (methods of re-substitution,

cross validation-leave-one-out and simulation) and an

optimal choice of the regularization parameter k was used.

The strategy of model building is described in more detail

in [3].

Partial least square-discriminant analysis and partial

least squares-uninformative variable elimination

Partial least square-discriminant analysis (PLS-DA) takes

advantage of the fact that any regression algorithm such as,

e.g., PLS can be applied to a discrimination problem, the

group (or class) membership of samples being encoded as a

number. However, such an approach is limited to the dis-

crimination of two groups at a time [10], the first group

being identified, e.g., as -1 while the second group is

encoded as ?1. Moreover, the number of models necessary

to discriminate samples from three or more groups differs

whether ‘‘one group versus one group’’ or ‘‘one group

versus all other groups’’ models are constructed. In the last

case, fewer models have to be built and hence only ‘‘one

versus all’’ PLS-DA models are considered during this

study.

The principle of PLS consists in finding linear combi-

nations of the original variables which maximize the

covariance between X and y. Each of those factors explains

as much as possible of the linear relationship existing

between the independent variables, i.e., the chemical con-

tent of a wine sample in this case, and the dependent

variable y which here represents the country of origin of

the sample. As it was the case with CART, over-fitting is

an issue which has to be taken into account and the PLS

solution to this problem consists in incorporating only a

limited number of factors in the model, the optimal number

often being determined with the help of cross-validation

approaches [6]. Once the optimal complexity of the model

has been assessed, the vector of regression coefficients b

can be determined and the prediction of the group mem-

bership of a new sample can easily be computed with the

following relationship:

bynew ¼ xTnewb

where bynewð1� 1Þ is the y predicted value for the new

sample and xnew (p 9 1) is a vector containing the mea-

surements of the p original variables. The group member-

ship of the new sample is therefore straightforward to

determine since the decision is only depending on the sign

of the predicted value: if bynew is negative the sample

belongs to the group encoded as -1 and vice versa. Con-

trary to CART, PLS uses a linear combination of variables

and hence PLS-DA models often perform better than

classification trees. However, this gain in performance is

obtained at the price of simplicity since interpretation of

PLS regression coefficients might be complex or mislead-

ing in some circumstances.

Moreover, no variable selection is performed during the

construction of a PLS model and hence prediction for a

new sample requires measuring all the parameters which

were used during calibration. This is a rather unattractive

Eur Food Res Technol (2009) 230:31–45 33

123

Page 4: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

feature since the analysis of 63 parameters to authenticate

every wine sample is economically unrealistic. Therefore, a

variable selection method known as partial least squares-

uninformative variable elimination (PLS-UVE) [4, 11] is

applied during this work. PLS-UVE actually identifies the

variables for which the PLS regression coefficients are low

and/or unstable, meaning that the information carried is

likely to be very low and therefore those parameters should

not be kept in the final model. The decision whether a

variable should be retained is made with the help of a

criterion calculated as the ratio between the mean value of

the variable regression coefficient and the standard devia-

tion of this regression coefficient, both parameters being

estimated with a re-sampling approach. The ratio deter-

mined in this way is then compared to a threshold value

which corresponds to the maximum value of the criterion

observed for some artificially generated variables which

are hence known to be uninformative. Therefore, a variable

of which the ratio is below the critical value is considered

to be uninformative and is thus discarded. Since the

number of variables by the PLS-UVE method is often still

large an additional selection step is introduced, the aim of

which is to select a small set of variables that is still able to

obtain a good discrimination [4]. The selection procedure is

iterative and it removes the parameters with the lowest

regression coefficients.

All computations for PLS were performed with a Matlab

program, written in Matlab 6.5 (The MathWorks, Natick,

MA) on a computer running Microsoft Windows XP.

Error rate estimation

Discriminant models are built and error rates are evaluated

based on the re-substitution method as well as on cross-

validation (CV) either by the classical leave-one-out

method (CV-LOO) [6, 7] or by tenfold cross-validation

(CV-10) [6]. The re-substitution rate is the percentage of

samples used to build the model that are correctly classified.

For RDA, additionally, an error rate based on the simulation

of 6,000 wine samples for each country [3] is given.

The optimal model chosen using the re-substitution and

CV-procedure is further validated using independent sam-

ples. Indeed, the prediction error estimated by re-substitu-

tion is in general too optimistic, and by cross-validation it

might be too optimistic too. The performance of the dis-

crimination model therefore should also be tested with

independent samples. These can be obtained by splitting

the data into training and a test set using the duplex-algo-

rithm [6, 12]. The discrimination model is then built with

the training set and used to predict the classification of the

independent samples from the test set. The percentage of

samples from the test set that is correctly classified is called

the prediction rate. It is generally lower than the re-sub-

stitution rate and sometimes too pessimistic, because of the

reduced sample size of the training set, as a result of the

splitting of the data.

The composition of the training and the test set is given

in Table 1.

Application of multivariate methods to the wine data

Classification and regression trees

Because of the simplicity and interpretability of the model

obtained, CART is the first method applied. Authentic and

commercial samples are modeled separately as well. This

is also the case for red and white wines, the discrimination

of samples with a different color being straightforward.

Table 1 Composition of the training and the test set

Training set Test set

Authentic samples Red wines White wines Total Authentic samples Red wines White wines Total

Hungary 30 67 97 Hungary 14 33 47

Czech Republic 25 74 99 Czech Republic 13 39 52

Romania 32 67 99 Romania 16 35 51

South Africa 27 73 100 South Africa 12 37 49

R 114 281 395 R 55 144 199

Commercial samples Red wines White wines Total Commercial samples Red wines White wines Total

Hungary 34 66 100 Hungary 17 33 50

Czech Republic 26 75 101 Czech Republic 13 38 51

Romania 32 62 94 Romania 16 32 48

South Africa 26 73 99 South Africa 13 38 51

R 118 276 394 R 59 141 200

34 Eur Food Res Technol (2009) 230:31–45

123

Page 5: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

Discrimination and classification of the four countries

(HU, CZ, RO, SA)

Authentic white wines The classification tree from Fig. 2

is built with the 281 available training samples, and 144

representative test samples are used to estimate the pre-

dictive ability of the model.

The optimal CART model retains three parameters out

of the 63 present in the database, namely the isotopic ratio

Ethanol(D/H)1, the rare earths ratio Yb/La and the content

of U. The most interesting result, which confirms earlier

observations [2] certainly is that South African samples are

easily discriminated from the Eastern European wines. The

authentication of South African white wines can almost

perfectly be done based on a single variable namely Eth-

anol(D/H)1. From the boxplot in part II, Fig. 8 it follows

that South African wines are indeed characterized by a

higher value of this parameter. The discrimination of

Hungarian wines from Czech and Romanian wines appears

to be more difficult.

The performance of this classification tree is summa-

rized in Table 2. The prediction errors (about 15%) are

mainly between Hungary and Czechia.

Authentic red wines A first classification tree (Fig. 3a) is

built with the 114 calibration samples available and 55

representative test samples are used to estimate the pre-

dictive ability of the model.

Similar to the discrimination of white wines, the CART

model selects only three variables [V, Ethanol(D/H)2 and

Ethanol(D/H)1] from the 63 variables, but South African

wines are not separated after the first split and hence V and

Ethanol(D/H)1 are required to authenticate South African

authentic red samples. However, when the content of

Vanadium is not taken into account during model con-

struction, the tree shown in Fig. 3b is obtained. It follows

that with the isotopic ratio Ethanol(D/H)1 it is possible to

discriminate South African wines from Eastern European

wines. Note that in the tree V is replaced by La, making the

discrimination of Romanian samples slightly more diffi-

cult, although the discrimination of Hungarian and Czech

wines is still responsible for most of the classification

errors observed in Table 2.

Commercial white wines A classification tree is built with

276 calibration samples available and the predictive power

of the model is assessed with the help of 141 representative

test samples. The performance of this tree is summarized in

Table 2. The CART model built is relatively complex since

eight variables are retained [Ethanol(D/H)1, Ti, Cl, Sr, U,

Wined18O, Zn, and the ratio Gd/La], and the prediction

errors (about 17%) are larger than for the authentic wines.

As it was the case for authentic wines, South African

EtDH1 < 103.5

U < -1.76

Yb/La < -0.95

2

73

63

8142

48

2 3

66

EtDH1 < 103.5

U < -1.76

Yb/La < -0.95

2

73

2

73

63

814

63

8142

48

2

48

2 3

66

2 3

66

Fig. 2 Classification tree

developed with all parameters

for the authentic white wines.

Bars in boxes stand for

Hungarian, Romanian, Czech,

and South African samples,

respectively

Eur Food Res Technol (2009) 230:31–45 35

123

Page 6: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

samples are easily separated using only the Ethanol(D/H)1

isotopic ratio. Notice that the splitting values of Ethanol(D/

H)1 for the authentic white wines and commercial white

wines are very similar, 103.5 and 103.3, respectively. This

confirms earlier conclusions based on an evaluation of the

samples collected during the first year of the project [2].

South African samples can be identified by comparison of

the Ethanol(D/H)1 value with a reference value chosen

equal to for instance 103. Therefore, for this discrimination

the authentic wines seem to be a good model for the

commercial wines

Commercial red wines The CART model is built with

118 calibration samples and it is tested by means of 59

representative test samples. This tree uses only three

parameters namely Ethanol(D/H)1, B and P but the pre-

diction ability of the model is relatively low (see Table 2).

However, South African wines are again perfectly classi-

fied solely based on the analysis of the isotopic ratio Eth-

anol(D/H)1. Here too the splitting value of this variable

Table 2 Performance (% re-substitution and % prediction) of CART

models for the discrimination of Hungarian, Czech, Romanian and

South African wines

Number of

variables in

the model

Re-substitution Prediction

(CV-10)

Prediction

(CV-test)

Authentic

White 3 89.0 83.3 84.8

Red 3 93.0 85.1 87.3

Commercial

White 8 89.9 77.9 82.9

Red 3 85.6 76.2 72.9

V <0.49

EtDH2 < 127.5 EtDH1 < 103.8

2 122

2 122 28

328

33131 2727

EtDH1 < 103.8

La < -0.96

EtDH2 < 127.5

27

29

281 32 2

22

EtDH1 < 103.8

La < -0.96

EtDH2 < 127.5

2727

2929

281 3

281 32 2

222 2

22

(a)

(b)

Fig. 3 CART for the authentic

red wines a the classification

tree built with the 63 parameters

available; b the classification

tree obtained when V is

discarded. Bars in boxes stand

for Hungarian, Romanian,

Czech, and South African

samples, respectively

36 Eur Food Res Technol (2009) 230:31–45

123

Page 7: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

(102.9) is comparable to that of the authentic red wines

(103.8). This points to the fact that also for the discrimi-

nation of South African red wines authentic wines are a

good model for the commercial wines.

Discrimination and classification of the European samples

(HU, CZ, RO)

It is important to realize that the discrimination of South

African samples being straightforward, the estimated pre-

dictive ability of the previous models is likely to be opti-

mistic concerning the classification of Eastern European

wines.

Therefore, additional CART models were constructed;

those trees focusing on the discrimination of Hungarian,

Romanian and Czech samples only. The predictive power

of those models (Table 3) is sensibly lower than what has

been found previously, which was expected since the

classification errors observed with the first set of trees are

systematically caused by misclassification of European

wines.

Authentic white wines The obtained classification tree

uses two variables, U and Yb/La. The former is important

to discriminate the Czech samples from the others while

the latter mainly discriminates between Hungarian and

Romanian wines. This confirms earlier univariate obser-

vations. The boxplot of U (part II, Fig. 8) shows that Czech

samples indeed have lower U concentrations, and from the

Fisher weights (part II, Table 4) it follows that Yb/La is

one of the most important variables to separate Hungarian

and Romanian wines.

Authentic red wines Only two variables, V and Etha-

nol(D/H)2 are used in the obtained CART model. Vana-

dium discriminates between Romania and the two other

countries while the isotopic ratio mainly separates Hungary

from Czechia. This again confirms earlier observations

since from the boxplots (part II, Fig. 8) it follows that the

concentration of V in the Romanian samples is higher than

in the other Eastern European countries. Moreover, from

the Fisher weights (part II, Table 4) it follows that Etha-

nol(D/H)2 is important for the separation of Hungarian and

Czech samples.

Commercial white wines Eight variables (Ti, Cl, Sr, Li,

Wine d18O, Zn, Gd/La, and Na-Excess) are used in the

model. The relatively low prediction rates (about 73%)

confirm earlier conclusions from among others PCA (part

II, Fig. 6), i.e., commercial samples show a larger overlap

than the authentic wines.

Commercial red wines Only two variables (B and P) are

used in the obtained CART model, but the prediction rates

(about 65%) are very low. They confirm the low discrim-

inating power of these parameters observed from the Fisher

weights (part II, Table 4).

Conclusions

From the CART analysis, the most obvious conclusion is

that South African samples are very easily classified by

means of Ethanol(D/H)1, whether authentic, commercial,

red or white wines are considered. Therefore, it is impor-

tant to focus on the discrimination of the three European

countries for which the classification trees performances

are not satisfactory. Indeed, only 73% of the commercial

European wines can be correctly classified in the most

favorable case. The CART analysis also underlines that

classification of commercial samples is more difficult to

achieve than the classification of authentic samples. This

confirms earlier conclusions from PCA (see part II, Fig. 6).

Finally, the most important parameters are V, Ethanol(D/

H)1, Ethanol(D/H)2, U, Yb/La to discriminate authentic

wines while Ethanol(D/H)1, B, P, Ti, Cl, Sr, U, Wine d18O,

Zn, Gd/La are most important for the discrimination of the

commercial samples.

Regularized discriminant analysis

Following the conclusions of the CART analysis, that

South African samples can be classified very easily by

means of Ethanol(D/H)1; it has been decided that the RDA

method will be used for discriminating only Hungarian,

Czech, and Romanian wine samples. The application of

RDA has the aim to find ‘‘good’’ discriminant models

which contain the most important variables for explaining

differences between the countries with minimal misclassi-

fication rates. The interactive model-building process is

described in [3]. The performance of RDA models for the

discrimination of the three East European countries is

summarized in Table 4.

Table 3 Performance (% re-substitution and % prediction) of CART

models for the discrimination of Hungarian, Czech and Romanian

wines

Number of

variables in

the model

Re-substitution Prediction

(CV-10)

Prediction

(CV-test)

Authentic

White 2 86.1 77.9 79.4

Red 2 93.1 86.2 83.7

Commercial

White 8 88.7 75.4 72.8

Red 2 81.6 68.5 65.2

Eur Food Res Technol (2009) 230:31–45 37

123

Page 8: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

The discrimination of wines based on our preferred

models (M1), the preference being based on high-perfor-

mance rates and a minimal number of variables in the

model, is illustrated in Fig. 4.

Authentic white wines At first models were built using all

authentic white wines (N = 315) from the three East Euro-

pean countries. Five good RDA-models, including 10–15

different variables were built. The following variables were

important in the five models for discriminating the wines

from the three East European countries; the variables in bold

have shown a strong discriminating power in most (at least

four of the five) models:

M1: V, La, Cd, Si, Al, U, Tartaric acid, Li, Y, Na,

Ethanol(D/H)2, Wine d18O, Ca and Ni

M2: V, Cd, Si, K, Al, Mn, Zn, Sr, Y, U, Wine d18O and

Er/La

M3: Tartaric acid, Si, K, Li, Cr, Mn, Ni, Rb, Sr, Y, Cd,

Ba, Pb, Ethanol(D/H)2 and Yb/La

M4: V, La, U, Tartaric acid, Mg, Ca, Li, Mn, Fe, As,

Cs, and Wine d18O

M5: V, Tartaric acid, Mg, Si, Mn, Fe, Cd, Ba, U and

Yb/La.

The variables in the models are ordered according to

their discriminating power during the interactive model

building process. In most cases Vanadium was the variable

with the largest discriminating power, followed by either

Cadmium, Lanthanum, Tartaric acid, Silicon or Uranium.

For these models error rates between 0.3 and 1.6% (re-

substitution) and between 2.8 and 5.1% (CV-LOO and

simulation) were estimated. That means, with all models

maximal 5.1% of wines were not correctly assigned to that

country in which the wine was produced. The excellent

correct classification rates of our preferred RDA-model M1

(k = 0.47) with 14 variables were 99.7% (re-substitution),

95.9% (simulation, 6000 samples per country), and 95.6%

(CV-LOO).

To realize the same misclassification rates the models

obtained with the linear (k = 0) and quadratic (k = 1)

discriminant analysis would require much more

variables.

To estimate the error rate independently of the model

the training (N = 281) and test sets (N = 144) from

Table 1 were used. Models were built again, now based on

the training set and in the same way as for the whole data

set correct classification rates were estimated by re-sub-

stitution and CV-LOO. Afterwards, the independent wine

samples from the test set were classified using the models

of the training set (CV-test). In this case of authentic white

wines our preferred model M1 of all wine data was also a

good model for the training set (k = 0.47), which means

that the selected variables are very stable in this model. The

performance rates (re-substitution: 100%, simulation:

96.4%, CV-LOO: 96.6% and CV-test: 98.1%) compare

very well with those based on all data.

For the authentic red wines as well as for the commer-

cial white and red wines, discussed in the following sec-

tions, only the correct classification rates of our preferred

RDA-models M1 for the complete data set and for the

training set will be given. Here again the variables in bold

have shown a high discriminating power in different

models.

To compare the results of CART, RDA, and PLS-DA,

only performance rates for models built with the training

set are summarized in Table 4.

Authentic red wines Using all authentic red wines

(N = 130) the following good RDA-model with 8 vari-

ables could be estimated:

M1: V, Yb/La, Ethanol(D/H)2, Methanol, Cl, Cr, Pb and

Shikimic acid.

With this RDA-model (k = 0.9) 100% of the wines

could be classified correctly by re-substitution, 98.5% by

simulation, and 93.9% by CV-LOO.

Table 4 Performance (% re-substitution, % prediction and % simulation) of the preferred RDA models M1 for the discrimination of Hungarian,

Czech, and Romanian wines

Number of variables

in the model

Re-substitution Prediction

(CV-LOO)

Prediction

(CV-test)

Simulation

Authentic

White 14 (10–16) 100 (99.0–99.7) 96.6 (93.7–98.1) 98.1 (94.1–98.1) 96.3 (96.3–99.1)

Red 6 (6–8) 100 (100) 97.7 (97.3) 93.0 (88.4–93.0) 99.4 (98.5–99.4)

Commercial

White 17 (17–21) 100 (100) 90.2 (87.7–90.2) 89.4 (86.4–89.4) 97.1 (97.1–98.6)

Red 10 (9–10) 100 (97.8–100) 94.6 (87.0–94.6) 89.1 (82.6–95.6) 96.3 (95.1–97.0)

The performance of all good RDA models is summarized between brackets

38 Eur Food Res Technol (2009) 230:31–45

123

Page 9: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

A comparable good RDA-model (k = 0.8) with only six

variables could be found, when only the training set

(N = 87) was considered:

M1: V, Cd, Ethanol(D/H)2, S, Cr, and As.

The corresponding re-substitution, simulation, and pre-

diction (CV-LOO and CV-test) rates were with 100, 99.4,

97.7, and 93.1%, respectively, a little bit higher. In both

cases V, Ethanol(D/H)2 and Cr have been selected as

important variables.

Commercial white wines In the case of commercial white

wines (N = 306), more variables were necessary to find a

good RDA-model in comparison to authentic white wines;

our preferred RDA-model based on all wine data contained

21 variables:

Authentic white winesRDA-Model 1 (14 var.)

Parameter λλ=0.466

-8 -6 -4 -2 0 2 4 6

Canon. discriminant function 1

-6

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

Can

onic

al d

iscr

imin

ant f

unct

ion

2

Commercial red winesModel 1 (10 var.)Parameter λ = 0.9

-8 -6 -4 -2 0 2 4 6

Canonical discriminant function 1

-4

-3

-2

-1

0

1

2

3

4

5

6

7C

anon

ical

dis

crim

inan

t fun

ctio

n 2

Commercial white winesRDA-Model 1 (21 var.)

Parameter λ = 0.92

-6 -4 -2 0 2 4 6

Canonical discriminant function 1

-4

-3

-2

-1

0

1

2

3

4

5

6

7

Can

onic

al d

iscr

imin

ant f

unct

ion

2

Commercial red winesModel 1 (10 var.)Parameter λ = 0.9

-4

-3

-2

-1

0

1

2

3

4

5

6

7

Can

onic

al d

iscr

imin

ant f

unct

ion

2

-8 -6 -4 -2 0 2 4 6

Canonical. discriminant function 1

Czech Republic Hungary Romania

Czech Republic Hungary Romania

Czech Republic Hungary Romania

Czech Republic Hungary Romania

Fig. 4 Discriminating plots of

the preferred RDA-models M1

for authentic and commercial

white, and red wines

Eur Food Res Technol (2009) 230:31–45 39

123

Page 10: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

M1: Cl, Li, Zn, Rb, Ca, Invert sugar, Ti, Putrescine,

Mg, Er/La, Cu, Cd, Ethanol(D/H)2, Ethanol-

amine, Br, Ni, Shikimic acid, Na, Pb, Co and

2-Methylbutan-1-ol.

With this RDA-model (k = 0.9) 100% of the wines

could be classified correctly by re-substitution, 97.5% by

simulation and 90.8% by CV-LOO. The discriminating plot

of the commercial white wines in Fig. 4 shows some

overlap between the wines of the three East European

countries.

The chosen RDA-model (k = 0.8) based on the learning

set (N = 203) contained only 17 variables:

M1: Cd, Zn, Ca, Rb, Sr, Wine d18O, Putrescine, Mg, Si,

Invertsugar, Li, Cu, Cr, P, Ba, Shikimic acid and

Tartaric acid.

For this model classification and prediction rates (re-

substitution, simulation, CV-LOO and CV-test) of 100,

97.1, 90.2 and 89.4%, respectively, could be obtained,

which are comparable with the above rates. Some impor-

tant variables appear in both M1-models, but there are also

some differences. That means, these models are not so

stable and are characterized by a greater variability.

Commercial red wines Using all commercial red wines

(N = 138) the following good RDA-model with 10 vari-

ables could be estimated:

M1: K, B, Ti, Sr, Tartaric acid, Ethanol(D/H)2, La, Fe,

Zn, Shikimic acid

With this RDA-model (k = 0.9) again 100% of the

wines could be classified correctly by re-substitution,

97.2% by simulation and 94.9% by CV-LOO method.

Based on the N = 92 commercial red wines of the

learning set the same variables were chosen as important

for our preferred RDA-model, but the best performance

rates could be obtained when using k = 0.6 as regulari-

zation parameter. The correct classification rates were:

100% by re-substitution, 97.1% by simulation, 90.2% by

CV-LOO and 89.4% by CV-test.

Conclusions

1. While good discrimination models could be found with

10–14 variables for authentic white wines and models

with 6–9 variables for red wines models, the necessary

number of variables in the models for commercial wines

(for comparable error rates) was larger. For commercial

white wines good models should include 17–22 vari-

ables and in the case of red wines 8–10 variables.

2. The following variables had a very high (highlighted)

or a high discriminating power in different model

variants:

• Auth. white wines: Tartaric acid, Si, Mg, K, Ca,

Li, Al, V, Mn, Fe, Ni, Sr, Y, Cd, U, La or Yb/La or

Er/La and Ethanol(D/H)2 or Wined18O

• Auth. red wines: L-Lactic acid, Shikimic acid,

Methanol, 3-Methylbutan-1-ol, S, Ca, Cl, Putres-

cine, Ethylamine, V, Cr, Zn, As, Rb, Cd, Pb, La

or Yb/La and Ethanol(D/H)2

• Comm. white wines: Invert sugar, Shikimic acid,

2-Methylbutan-1-ol, Mg, Si, Cl, Ca, Ethanol-

amine, Putrescine, Li, Ti, Cr, Cu, Zn, Br, Rb, Sr,

Cd, Pb, La or Er/La and Ethanol(D/H)2

• Comm. red wines: Tartaric acid, Shikimic acid,

1-Propanol, K, B, Ti, Sr, La or Er/La and

Ethanol(D/H)2

3. If instead of taking the whole data set only the training

set of authentic wines is used for building discriminant

models, generally the same variables have a high

discriminating power. That means that many models,

especially in the case of authentic white wines, are very

stable. In the case of commercial white wines some

good model variants based on the training set included

also some other variables, such as Tartaric acid, Na, P,

Ba and Wined18O, which seem to play an important

role for the discriminating process. Indeed the ‘‘best’’

discriminant models based on the whole data set are not

always identical with the ‘‘best’’ models from the

training set. However, in general the performance rates

(re-substitution, CV-LOO) of the ‘‘best’’ models in the

whole set and in the training set were comparable.

In the group of commercial red wines, besides our

preferred model M1 with excellent performance rates in

both data sets, another model with lower rates (re-sub-

stitution: 100%, simulation: 95.1%, CV-LOO: 87.0%,

CV-test: 84.6%) could be found, which included the

variables B, Ti, P, V, Si, Ba, 2-Methyl-1-propanol, and

Invert sugar. These variables were also identified as

being important by using PLS-DA method in the fol-

lowing section.

4. The good separation of the wines from the three East

European countries by our preferred models for authentic

as well as for commercial wines is illustrated in Fig. 4.

Partial least square-discriminant analysis (PLS-DA)

and partial least square-uninformative variable

elimination

Here too we will only focus on the discrimination of the

Hungarian, Czech, and Romanian samples. Each of the

discrimination situations is modeled in three steps. A

classical PLS-DA model is developed in a first time and

then PLS-UVE is applied. Eventually, a manual variable

40 Eur Food Res Technol (2009) 230:31–45

123

Page 11: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

selection is performed on the variables retained by PLS-

UVE in order to decrease even further the number of

parameters required to authenticate a wine sample. This

final selection is done iteratively, the parameter with the

smallest regression coefficient being discarded each time.

This procedure is repeated until the removal of a parameter

results in an unacceptable increase of the prediction error

[12]. As mentioned earlier the discrimination of one

country versus the two other countries is considered here,

which is not illogical since the aim of the project is to

verify the origin of a given wine sample. If a wine is

claimed, e.g., to be Czech, it is important to discriminate

between Czech samples on the one hand and Hungarian

plus Romanian samples on the other hand. For practical

reasons, the most parsimonious PLS and PLS-UVE models

are discussed and their performances are summarized in

Tables 5 and 6.

Authentic white wines To build discriminating PLS

models 208 training samples are available while 107 rep-

resentative samples are used to test the predictive ability of

the developed models.

Hungary versus {Romania ? Czech Republic}

A first PLS model is built using all 63 parameters. This model

uses three factors and the re-substitution and prediction

(CV-LOO and CV-test) rates are 96.7, 92.3 and 91.6%,

respectively. PLS-UVE allows decreasing the number of

variables to 27. A final model is built with the 13 most

discriminating parameters. These are eight trace elements

(As, Fe, Ba, Rb, Li, Si, Cu, Mn), two classical parameters

(Tartaric Acid, Ethylacetate), two macro elements (Na, P)

and one isotopic ratio (Ethanol(D/H)2). As follows from

Table 5 this model which uses only 2 PLS factors has pre-

diction abilities which are very similar to those of the model

using all parameters. The number of variables in the model as

well as the performance summarized in Table 5 also indi-

cates that the discrimination of Hungarian wine samples

from Czech and Romanian wines, compared to the other

discriminations, is the most difficult one.

Romania versus {Hungary ? Czech Republic}

The PLS model using all 63 parameters requires 7 factors

and the re-substitution and prediction (CV-LOO and

CV-test) rates are 100, 96.2 and 95.4%. The PLS-UVE

approach retains 16 variables but a further reduction is

possible since finally a 2-factor PLS-model using only 8

parameters is built with very similar performances (see

Table 5). The selected variables are Y, V, Fe, Si, Pb, Cu,

Ethylacetate, and As.

The selection of Y and V as most discriminating is

logical when looking at the boxplots (not shown) since

their concentration in the Romanian samples is consider-

ably higher than in the Hungarian and Czech samples.

Czech Republic versus {Hungary ? Romania}

A final 4-factor PLS-model, built with only six parameters

(U, V, Cd, Pb, Mn, and Er) has the same performance

characteristics as the 4-factor model using the whole set of

parameters. The re-substitution, prediction (CV-LOO and

CV-test) rates for the latter are 98.6, 95.9 and 94.4% which

compares very well with the figures for the final model in

Table 5. The importance of U in the discrimination is

expected from the boxplots (part II, Fig. 8) and the CART

tree built on the authentic white samples.

Authentic red wines The PLS models were developed on

the 87 training samples available and they were tested on

43 representative samples.

Hungary versus {Romania ? Czech Republic}

The PLS model built with all parameters uses three PLS

factors and the re-substitution and prediction rates (CV-

LOO and CV-test) are 100, 96.4 and 93.1%, respectively,

Table 5 Performance

(% re-substitution and %

prediction) of the final PLS

models for the discrimination of

Hungarian, Czech and

Romanian authentic wines

Number of

variables

in the model

Re-substitution Prediction

(CV-LOO)

Prediction

(CV-test)

Authentic

White

Hun. versus (Rom. ? Cze.) 13 94.7 91.9 88.8

Rom. versus (Hun. ? Cze.) 8 96.6 96.1 96.4

Cze. versus (Hun. ? Rom.) 6 98.1 96.2 95.3

Red

Hun. versus (Rom. ? Cze.) 9 98.9 97.1 95.4

Rom. versus (Hun. ? Cze.) 5 100 99.4 97.7

Cze. versus (Hun. ? Rom.) 8 100 98.7 97.7

Eur Food Res Technol (2009) 230:31–45 41

123

Page 12: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

which is very similar to the performance of the final

3-factor model using nine parameters (Cd, S, Methanol, Cr,

La, Ethanol(D/H)2, V, 2-Methylbutan-1-ol, and Mn) in

Table 5. V and Ethanol(D/H)2 are also found in the CART

tree built for the authentic red wines. The importance of Cd

and S is not expected from the univariate or PCA analysis

(part II, Fig. 6). As explained earlier [4], the selection of

unexpected variables might be due to the fact that the PLS

factors try to find a compromise between the discrimination

of Hungary and Czechia on the one hand and on the dis-

crimination of Hungary and Romania on the other hand.

Romania versus {Hungary ? Czech Republic}

Excellent performance (re-substitution, prediction (CV-

LOO and CV-test) rates of 100, 98.9 and 100%, respec-

tively, are obtained with the 2-factor model using all 63

parameters. As follows from Table 5 the performance

hardly changes with the final model based on only five

parameters (V, La, Er/La, Cl and Cr). The boxplots for V

and La (part II, Fig. 8), and Er/La (not shown) confirm the

high discriminating power of these variables and the

selection of V is also expected from the CART tree for

authentic red wines.

Czech Republic versus {Hungary ? Romania}

The final 2-factor PLS model using only eight parameters

(Cd, Ethanol(D/H)2, S, 2-Methylbutan-1-ol, Pb, V, U, and

Mn) performs very well (see Table 5) although the selec-

tion of Cd as most discriminating variable is unexpected.

Commercial white wines The models were built using 203

training samples and tested on 103 representative samples.

Hungary versus {Romania ? Czech Republic}

As for the authentic white wines it follows from Table 6

that this discrimination is the most different one. Indeed the

final 2-factor PLS model using eight parameters has pre-

diction (CV-LOO and CV-test) rates, of respectively, 90.1

and 86.4% which is not satisfying. The parameters selected

are Sr, Ca, Er/La, Ethylacetate, Rb, Wined18O, Ti, and

Ethanol(D/H)2). Sr, Ti, and Wined18O are also retained in

the CART tree. The importance of Er/La can be explained

by its correlation with Yb/La.

Increasing the number of variables hardly improves the

performance. Indeed the prediction rates obtained with

PLS-UVE using 26 parameters is 90.8% (prediction (CV-

LOO)) and 88.4% (prediction (test)). Including all 63

parameters does not reduce the prediction errors.

Romania versus {Hungary ? Czech Republic}

The PLS model using all 63 parameters requires three

factors and has prediction (CV-LOO and CV-test) of 91.8

and 97.1%, respectively. The final 2-factor model has

comparable performance (Table 6) but still retains 20

variables: Zn, Sr, Cd, Si, Li, Invert sugar, Cu, Ba, Cl, Er/

La, Cr, Mg, K, V, Pb, Rb, As, Na_Exc, 2-Methyl-1-pro-

panol, and Y. Several of the most discriminating variables

(Zn, Sr, Li, Cl) are also found in the CART tree and the

importance of Er/La can again be explained by its corre-

lation with Yb/La. The importance of several others is

logical when looking at the boxplots (not shown). The

concentration of Cd and Invert sugar, e.g., obviously is

higher in the Romanian samples.

Czech Republic versus {Hungary ? Romania}

The final 2-factor PLS model is built with 12 parameters

(Putrescine, Cl, Zn, Cd, Ca, Ti, Wined18O, Cr, Gd/La, Cu,

Ethanol(D/H)2, and Rb). As follows from Table 6 it per-

forms at least as well as the 4-factor model using all 63

parameters (prediction (CV-LOO and CV-test) rates of

91.4 and 93.2%, respectively). The importance of Putres-

cine, Cd, Ca, and Cr is not expected from the CART tree

Table 6 Performance (% re-

substitution and % prediction)

of the final PLS models for the

discrimination of Hungarian,

Czech and Romanian

commercial wines

Number of

variables in

the model

Re-substitution Prediction

(CV-LOO)

Prediction

(CV-test)

Commercial

White

Hun. versus (Rom. ? Cze.) 8 82.1 90.1 86.4

Rom. versus (Hun. ? Cze.) 20 96.6 93.6 98.1

Cze. versus (Hun. ? Rom.) 12 94.1 93.1 93.1

Red

Hun. versus (Rom. ? Cze.) 11 94.6 91.8 91.3

Rom. versus (Hun. ? Cze.) 10 100 98.4 100

Cze. versus (Hun. ? Rom.) 10 97.9 95.9 87.0

42 Eur Food Res Technol (2009) 230:31–45

123

Page 13: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

but the boxplots (not shown) indicate that they indeed have

some discriminating power.

Commercial red wines The models were developed on

the 92 training samples available and tested on 46 repre-

sentative samples.

Hungary versus {Romania ? Czech Republic}

The prediction error of the 2-factor PLS-UVE model using

21 parameters is relatively high (8%). A further reduction

of the number of variables to 11 (P, Si, Sr, Rb, Ethanol(D/

H)2, Ca, B, Wined18O, Gluconic acid, Tartaric acid, and U)

hardly influences the performance (see Table 6). P and B

are the two (only) variables also retained in the CART tree

built for the commercial red wines. The boxplots (not

shown) indicate that several others in the final PLS model

indeed have at least some discriminating power.

Romania versus {Hungary ? Czech Republic}

As for the authentic red wines, excellent performance

(prediction (CV-LOO and CV-test) rates are 98.4 and

100%, respectively) is obtained with the 2-factor final PLS

model using only ten parameters (B, K, Sr, Tartaric acid,

Shikimic acid, Ethanolamine, Ba, Methylpropanol, Invert

sugar, and V). The most discriminating variable, B, is also

retained in the CART tree to separate Romania from the

two other Eastern European countries. The discriminating

power of several others in the final PLS model is logical

when looking at the boxplots (not shown): concentrations

of Sr, Shikimic acid, and Invert sugar are highest in the

Romanian samples while the concentrations of Tartaric

acid and Ethanolamine are lowest in these samples. The

boxplot of V (part II, Fig. 8) shows that this parameter

mainly separates Romanian (and Hungarian) wines from

the Czech ones.

Czech Republic versus {Hungary ? Romania}

The PLS model built with all parameters uses three factors

and the prediction (CV-LOO and CV-test) rates are 93.2

and 95.7%, respectively. There is an important decrease of

the prediction (CV-test) rate when the number of variables

is reduced to ten (see Table 6). The variables retained in

this final model are P, Ethanol(D/H)2, Fe, Ba, Zn, Cu, Ti,

Original Malic acid, Invert sugar, and Al. P, which is most

discriminating, is also retained by CART. The selection of

most of the other parameters seems logical when consid-

ering the boxplots (not shown). They show that especially

Ethanol(D/H)2, Fe, Ba, Zn, Ti, and Invert sugar have some

discriminating power.

Conclusions

PLS modelling confirms that it is easier to discriminate the

authentic samples than the commercial ones. Although the

PLS models built on the 63 parameters have the best per-

formances, it is interesting to notice that models developed

with a smaller set of parameters have satisfying predictive

abilities.

The discrimination of white samples also seems more

difficult to perform than the discrimination of red samples.

Indeed, the percentage of correct re-substitution and pre-

diction is usually lower for white samples than for the red

ones. However, it should be noticed that much less red

samples are available, which can also be a reason for this

difference.

Romanian wines are most easily separated from the

other Eastern European wines whether authentic, com-

mercial, red or white wines are considered. This is, how-

ever, most pronounced for the commercial wines.

Concerning the importance of the variables for the dis-

crimination, there are some differences between authentic

and commercial samples, though trace elements as well as

isotopic ratios and rare earth ratios have a lot of discrimi-

nating power. The biogenic amines Putrescine and Etha-

nolamine are useful for the discrimination of commercial

samples. Finally, Ethylacetate and Tartaric acid are the

most useful classical parameters to classify authentic white

samples while Invert sugar is the most useful classical

parameter to discriminate commercial, white as well as red,

samples.

General conclusions

Considering the results for the different methods (Table 7)

the following conclusions can be drawn:

1. With only one variable, namely the isotopic ratio

Ethanol(D/H)1 (or Ethanol(D/H)2 which leads to a

similar discrimination as Ethanol(D/H)1), South Afri-

can white as well as red wines can be perfectly

separated from East European wines by CART.

Therefore, we focused on the discrimination of the

East European wines for which the results of CART

are less good. Indeed, only 65–75% of the commercial

European wines are correctly classified (see Table 7),

but it should be taken into account that these simple

CART models are based on a very reduced set of only

2–8 variables.

2. RDA and PLS-DA perform equally well. In general

RDA and PLS-DA models which include the following

variables in different compositions have a high

Eur Food Res Technol (2009) 230:31–45 43

123

Page 14: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

discriminating power and achieve satisfying predictive

abilities:

• Isotopic ratio: Ethanol(D/H)2

• Rare earth elements and ratios: La or Er or Yb/La

or Er/La, Y

• Trace elements: V, U, Cr, Ti, Cd, Pb, Rb, Mn, Fe,

Zn, Cu, Sr, As, B, Ba

• Macro elements: Si, P, S, Cl, Ca, Mg, Na, K

• Classical parameters: Tartaric acid, Shikimic acid,

Ethylacetate, 2-or 3-Methylbutan-1-ol or 2-Methyl-

1-propanol, Methanol, Invert sugar

• Biogenic amines: Putrescine, Ethanolamine.

The application of multivariate methods such as RDA

and PLS-DA leads to discriminating models, which

allow a correct classification of the wines from the

East European countries with rates between 88 and

100%. Of course, the number of variables given above

can be reduced if there is prior information about the

type and the colour of the wines.

The relationship between the geographic origin and the

first four groups of parameters could be expected.

Together with these parameters also classical param-

eters and biogenic amines which are influenced by the

variety of the wine and the winemaking process are

playing an important role for discriminating the three

East European wines concerning their origin.

3. Considering the authentic and commercial wines sep-

arately, there are some differences, e.g., rare earth

elements and the trace elements V and Mn are much

more important as discriminating variables in the

group of authentic wines. On the other hand, Titanium

and Invert sugar play an important role for differenti-

ating commercial wines.

4. Differences can also be found in the groups of white

and red wines. While Lithium and Sodium, respec-

tively, Sodium-Excess, are important variables for

separating white wines, most models for red wines

include variables such as Methanol, 2-or 3-Methyl-

butan-1-ol or 2-Methyl-1-propanol and Shikimic acid.

5. In RDA as well as in PLS-DA the following variables

are important in the four subgroups of wines:

• Authentic white wines: V, La or Er or Yb/La, Cd,

Si, U, Tartaric acid, Li,Y, Ethanol(D/H)2, and Mn

• Authentic red wines: V, Yb/La or Er/La or La,

Ethanol(D/H)2, Methanol, Cl, Cr, Pb, 2-or 3-

Methylbutan-1-ol, S and Cd

Table 7 Comparison of CART, RDA, and PLS for the discrimination of wines from the three East European countries (Hu, Cz, and, Ro)

Wine type CART RDA PLS

Variables*

in models

Prediction

rates (%)

Variables* in modelsa Prediction

rates (%)

Variables* in modelsb Prediction

rates (%)

Authentic

white

U, Yb/La 78–79 V, La, Cd, Si, Al, U, Tartaric acid,

Li, Y, Na, Ethanol(D/H)2,

Wined18O, Ca, Mg, K, Mn,

Fe, Ni, Sr, Yb/La

95–98 As, Fe, Ba, Rb, Li, Si, Cu, Mn,

Tartaric acid, Ethylacetate, Na, P,

Ethanol(D/H)2, Y, V, Pb, U, Cd, Er

88–96

Authentic

red

V, Ethanol(D/

H)1

(Ethanol(D/H)2)

84–86 V, Yb/La, Ethanol(D/H)2, Methanol,Cl, Cr, Pb, Shikimic acid, L-Lactic

acid, 3-Methylbutan-1-ol, S, Ca,

Putrescine, Ethylacetate, Zn,

As, Rb, Cd, La

88–94 Cd, S, Methanol, Cr, La, Ethanol(D/H)2, V, 2-Methylbutan-1-ol, Mn, Er/

La, Cl, Pb, U

95–99

Commercial

white

Ti, Cl, Sr, Li,Wined18O,

Zn, Gd/La,

Na-Excess

73–75 Cl, Li, Zn, Rb, Ca, Invert sugar, Ti,Putrescine, Mg, Er/La, Cu, Cd,

Ethanol(D/H)2, Ethanolamine, Br,

Ni, Shikimic acid, Na, Pb, Co, 2-

Methylbutan-1-ol, Tartaric acid, P,

Ba, Wined18O

86–91 Sr, Ca, Er/La, Ethylacetate, Rb,

Wined18O, Ti, Ethanol(D/H)2, Zn,

Cd, Si, Li, Invert sugar, Cu, Ba, Cl,Cr, Mg, K, V, Pb, As, Na-Excess, 2-

Methyl-1-propanol, Y, Putrescine

86–98

Commercial

red

B, P 65–69 K, B, Ti, Sr, Tartaric acid,

Ethanol(D/H)2, La, Fe, Zn,

Shikimic acid, 1-Propanol, Er/La V,

P, Si, Ba, 2-Methyl-1-propanol,Invert sugar

89–95 P, Si, Sr, Rb, Ethanol(D/H)2, Ca, B,

Wined18O, Gluconic acid, Tartaricacid, U, K, Shikimic acid,

Ethanolamine, Ba, 2-Methyl-1-propanol, Invert sugar, V, Fe, Zn,

Cu, Ti, Malic acid, Al

87–100

* Variables with a high discriminating power in all three methods are highlighted bold and those selected important by RDA and PLS are boldand italica Variables in the different efficient RDA model variantsb For each wine type variables in the three ‘‘one country against the other countries’’ comparisons

44 Eur Food Res Technol (2009) 230:31–45

123

Page 15: Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods

• Commercial white wines: Cl, Li, Zn, Rb, Ca, Invert

sugar, Ti, Putrescine, Mg, Er/La orGd/La, Cu, Cd,

Ethanol(D/H)2, Pb, 2- Methylbutan-1-ol, Ba, and

Wined18O,

• Commercial red wines: K, B, Ti, Sr, Ethanol(D/

H)2, Fe, Zn, Shikimic acid, V, P, 2-Methylbutan-1-

ol, and Invert sugar.

The bold highlighted variables were also selected with

CART.

6. Comparing the RDA and PLS-DA models with similar

performance rates, RDA-models contain fewer vari-

ables than PLS-DA models, because PLS-DA allows

only two-group comparisons, e.g., the comparison of

each of the three European Countries with the two

others. The following minimal numbers of chemical

parameters are therefore necessary to differentiate the

three countries:

• Authentic white wines: RDA: 10-14 PLS-DA: 19

• Authentic red wines: RDA: 6-9 PLS-DA: 13

• Commercial white wines: RDA: 17-22 PLS-DA:

26

• Commercial red wines: RDA: 8-10 PLS-DA: 24.

7. In all cases the differentiation of the commercial wines

from Hungary, Czech Republic, and Romania requires

more parameters than the authentic wines and white

wines more than red wines. These results can be taken

as a basis for analysing further problems of traceability

of wines from different countries.

Acknowledgments The authors acknowledge the contributions of

the European commission for the financial support of this work,

which was carried out in the framework of the specific research and

technological development program ‘‘Competitive and Sustainable

Growth’’ (Contract G6RD-CT-2001-00676). The authors are solely

responsible for the content of this research article and the European

Community is not responsible for any use that might be made of the

data appearing therein. This article is dedicated to Professor D.L.

Massart, who initially took the lead of the chemometrics/statistics

group in the project. Unfortunately, he passed away during the project

on 26 December 2005. Finally, the authors would like to thank all the

partners of the European Wine DB project and especially the ones

who collected the different wine samples, performed the micro-

vinification of the authentic samples and did the analytical measure-

ments of all the parameters necessary to carry out this study.

References

1. Schlesier K, Fauhl-Hassek C, Forina M, Cotea V, Kocsi E,

Schoula R, van Jaarsveld F, Wittkowski R (2009) Characteriza-

tion and determination of the geographical origin of wines. Part I:

overview. Eur Food Res Technol (submitted)

2. Smeyers-Verbeke J, Jager H, Lanteri S, Brereton P, Jamin E,

Fauhl-Hassek C, Forina M, Romisch U (2009) Characterization

and determination of the geographical origin of wines. Part II:

descriptive and inductive univariate statistics. Eur Food Res

Technol (submitted)

3. Romisch U, Vandev D, Zur K (2006) Application of interactive

regularized discriminant analysis to wine data. Aust J Stat

35(1):45–55

4. Capron X, Smeyers-Verbeke J, Massart DL (2007) Multivariate

determination of the geographical origin of wines from four

different countries. Food Chem 101:1608–1620

5. Breiman LJ, Freidman R, Olsen R, Stone C (1984) Classification

and regression trees. Wadworth, Pacific Grove

6. Vandeginste BGM, Massart DL, Buydens LMC, de Jong S, Lewi

PJ, Smeyers-Verbeke J (1998) Handbook of chemometrics and

qualimetrics. Part B. Elsevier Amsterdam, Nederland, p 238

7. Mc Lachlan GJ (1992) Discriminant analysis and statistical pat-

tern recognition. Wiley, NY

8. Friedman JH (1989) Regularized discriminant analysis. J Am Stat

Assoc 84:165–175

9. Vandev D (2004) Interactive stepwise discriminant analysis in

MATLAB. Pliska Stud Math Bulg 16:291–298

10. Geladi P, Kowalski BR (1986) Partial least-squares regression: a

tutorial. Anal Chim Acta 185:1–17

11. Centner V, Massart DL, de Noord OE, de Jong S, Vandeginste

BM, Sterna C (1996) Elimination of uninformative variables for

multivariate calibration. Anal Chem 68:3851–3858

12. Snee RD (1977) Validation of regression models: method and

examples. Technometrics 19:415–428

Eur Food Res Technol (2009) 230:31–45 45

123