Upload
ute-roemisch
View
216
Download
3
Embed Size (px)
Citation preview
ORIGINAL PAPER
Characterization and determination of the geographical originof wines. Part III: multivariate discrimination and classificationmethods
Ute Romisch • Henry Jager • Xavier Capron •
Silvia Lanteri • Michele Forina •
Johanna Smeyers-Verbeke
Received: 13 February 2009 / Revised: 9 July 2009 / Accepted: 16 August 2009 / Published online: 2 October 2009
� Springer-Verlag 2009
Abstract The aim of the European wine project was to
test the possibility of determining the country of origin of
wines based on their chemical composition. The results of
descriptive and inductive univariate methods of data anal-
ysis are discussed in part II of this series of papers. Here the
results of some selected multivariate methods of discrimi-
nation and classification such as classification and regres-
sion trees (CART), regularized discriminant analysis
(RDA), and partial least squares discriminant analysis
(PLS-DA) are compared and discussed. Special attention is
paid to the development of models that are efficient both in
terms of predictive performance and number of required
variables. Using CART South African wines could be
separated very easily from those of the East European
countries by only one isotopic parameter, but it gives less
good results for the discrimination of the East European
wines. The application of RDA and PLS-DA, and its
uninformative variable elimination variant (PLS-UVE) lead
to discriminant models, which allow a correct classification
of the wines from the East European countries with rates
between 88 and 100%. Comparing RDA and PLS-DA,
RDA-models contain somewhat fewer variables than PLS-
DA-models, because PLS-DA is constrained to two-group
comparisons (‘‘one country against the other countries’’).
Keywords Wine discrimination �Classification and regression trees (CART) �Regularized discriminant analysis (RDA) �Partial least squares-uninformative variable elimination
(PLS-UVE)
Introduction
The identification of the geographical origin of wines on the
basis of a minimal number of the most important chemical-
analytical parameters was the main aim of the European
project ‘‘Establishing of a wine data bank for analytical
parameters from Third Countries’’ (G6RD-CT-2001-
00646-WINE-DB). A wine data base containing about 600
authentic and 600 commercial white and red wines from
four countries was created over a period of 3 years during
2001–2004. Sixty-three chemical parameters were consid-
ered for each of those samples. An introduction to this
project can be found in part I [1] of this series of papers.
The statistical data analysis involved methods of uni-
variate descriptive and explorative data analysis as well as
multivariate methods. While the results of applying uni-
variate methods to the wine data were analyzed in part II
[2], here mainly the results of the used multivariate dis-
crimination and classification methods are discussed.
Project participants R. Wittkowski, BfR, Germany; C. Fauhl-Hassek, BfR, Germany; K. Schlesier, BfR, Germany; P. Brereton,CSL, United Kingdom; M.Baxter, CSL, United Kingdom; E. Jamin,Eurofins, France; X. Capron, VUB, Belgium; J. Smeyers-Verbeke,VUB, Belgium; C. Guillou, JRC, Italy; M. Forina, UGOA, Italy;U. Romisch, TU Berlin, Germany; V. Cotea, UIASI.VPWT.LO,Romania; E. Kocsi, NIWQ, Hungary; R. Schoula, CTL, CzechRepublic; F. van Jaarsveld, ARC Infruitec-Nietvoorbij, South Africa;Jan Booysen, Winetech, South Africa.
U. Romisch (&) � H. Jager
Technische Universitat Berlin, Fak. III,
Gustav- Meyer- Allee 25, 13355 Berlin, Germany
e-mail: [email protected]
X. Capron � J. Smeyers-Verbeke
Vrije Universiteit Brussel, Farmaceutisch Instituut,
Laarbeeklaan 103, 1090 Brussels, Belgium
S. Lanteri � M. Forina
Dipartimento di Chimica e Tecnologie Farmaceutiche ed
Alimentari, Via Brigata Salerno 13, 16147 Genoa, Italy
123
Eur Food Res Technol (2009) 230:31–45
DOI 10.1007/s00217-009-1141-x
In this paper, we consider three methods of discrimi-
nation and classification of multivariate data: the classifi-
cation and regression trees (CART), the regularized
discriminant analysis (RDA) including the linear (LDA)
and quadratic (QDA) case, and the partial least squares-
discriminant analysis (PLS-DA) with its uninformative
variables elimination (PLS-UVE).
At the beginning our attention was focussed on discrim-
inating the four countries in order to confirm our expectation
from the univariate evaluation, PCA [2], and from the
analyses of single years [3, 4], that authentic as well as
commercial South African wines seem to be discriminated
very easily from those from East European countries. The
application of CART and RDA on the wine data of the three
East European countries has confirmed that the discrimina-
tion of wines between Hungary, Czech Republic, and
Romania was much more difficult because of their geo-
graphical location. Finally PLS-classification methods were
used to build ‘‘one versus all other’’ discriminant models.
Data
The data set consists of wine samples from four different
countries: Hungary, Czech Republic, Romania, and South
Africa. For each country authentic and commercial wine
samples were collected and analyzed over a period of
3 years. The sampling strategy, the used analytical meth-
ods and the data, including sample sizes for the different
countries, and types of wines are described in part I [1] and
part II [2] of this series of papers.
Description of multivariate statistical methods
Classification and regression trees
Classification and regression trees (CART) is a well-known
method in statistics which is thoroughly documented in the
literature [5] and hence the principle of this method is only
shortly illustrated with the help of Fig. 1. In this situation
three groups (circles, squares, triangles) described by two
variables (x1 and x2) must be discriminated. CART is a
partition method whose goal is to find critical values for x1
and x2 so that the combination of simple binary splits will
form a decision tree. This tree will then be used to deter-
mine the class membership of a new incoming sample.
Looking at Fig. 1a, circles are clearly characterized by
a value of x1 higher than 0.6 and hence their discrimi-
nation is straightforward. On the other hand, squares and
triangles both have their x1 value below 0.6 and hence
cannot be separated yet. However, triangles are easily
distinguished from squares with the help of the second
variable since they are characterized by a value of x2
which is lower than 0.4. Therefore, the model finally
obtained by CART (Fig. 1b) is a decision tree of which
the first split compares the value of x1 to 0.6 and the
second split checks whether the value of x2 is lower or
higher than 0.4. Following this tree, it is possible to
perfectly discriminate the three classes present in the data.
In practice, the construction of a CART model is per-
formed in three steps of which the first consists in
building a tree which will perfectly describe the training
data. However, the predictive ability of the model might
be poor due to over-fitting and hence the next step is
to ‘‘prune’’, i.e., to successively cut, the last branches of
this over-large tree. It is then necessary to determine
which of the smaller trees obtained after pruning is
optimal in terms of predictive power. This last step is
usually achieved by tenfold cross-validation, a re-sam-
pling technique which uses the training data for valida-
tion and therefore does not require additional independent
samples [6]. The most attractive feature of CART cer-
tainly is the simplicity of the model obtained, which
makes interpretation straightforward. Indeed, only few
original variables and their corresponding critical values
have to be known to determine the class membership of a
new sample.
x1
x 2
0.6
0.4
x
x 2
0.6
0.4
X1<0.6
X2<0.4
X1
X2
(b)(a)Fig. 1 CART a the initial space
is divided into more pure
subspaces; b the corresponding
decision tree
32 Eur Food Res Technol (2009) 230:31–45
123
All computations for CART were performed with a
Matlab program, written in Matlab 6.5 (The MathWorks,
Natick, MA) on a computer running Microsoft Windows
XP.
Regularized discriminant analysis
The discriminant analysis [6, 7] is used to analyze differ-
ences of two or more groups (or classes) with respect to a
set of variables measured on the objects of these groups.
The influence of those independent variables on the groups
is to be investigated. Discriminant functions, which contain
significant variables, are estimated and objects will be
classified on the basis of the estimated discriminant model.
‘‘Good’’ discriminant models contain the most important
variables for explaining differences between the groups
with minimal misclassification rates.
Under the assumption of Gaussian distribution of the
p-dimensional feature vector Xk in the kth group, Xk *N(lk, Rk), (k = 1,…, K), where lk denotes the group means
and Rk the group covariance matrices, quadratic discrimi-
nant analysis (QDA) minimizes the misclassification rate
and separates the disjoint regions of the feature space
corresponding to each group assignment by quadratic
boundaries. Linear discriminant analysis (LDA) is used
under the hypothesis of identical group covariance matri-
ces, such that the rule that minimizes the misclassification
rate leads to a linear separation of the groups.
Regularized discriminant analysis (RDA) [8] was
introduced as a compromise between linear and quadratic
discriminant analyses, when the number of parameters to
be estimated is comparable or even larger then the sample
size. In the regularization step the estimated group
covariance matrix Rk is stabilized by
RkðkÞ ¼ kRk þ ð1� kÞR
The regularization parameter k [ [0, 1] controls the
degree of shrinkage of the group covariance matrix
estimates toward the pooled estimate. The limiting cases
correspond to LDA (k = 0) and QDA (k = 1). To
determine the optimal value of this parameter k, the error
rate estimations has to be minimized during the model
building process. Rates of misclassification are estimated
based on methods as re-substitution, cross validation
(leave-one-out), and prediction from a test set (data are
split into learning and test set).
The Matlab-program ‘‘ldagui’’ [9], which allows build-
ing models interactively step by step in dependence on a
minimal classification error (methods of re-substitution,
cross validation-leave-one-out and simulation) and an
optimal choice of the regularization parameter k was used.
The strategy of model building is described in more detail
in [3].
Partial least square-discriminant analysis and partial
least squares-uninformative variable elimination
Partial least square-discriminant analysis (PLS-DA) takes
advantage of the fact that any regression algorithm such as,
e.g., PLS can be applied to a discrimination problem, the
group (or class) membership of samples being encoded as a
number. However, such an approach is limited to the dis-
crimination of two groups at a time [10], the first group
being identified, e.g., as -1 while the second group is
encoded as ?1. Moreover, the number of models necessary
to discriminate samples from three or more groups differs
whether ‘‘one group versus one group’’ or ‘‘one group
versus all other groups’’ models are constructed. In the last
case, fewer models have to be built and hence only ‘‘one
versus all’’ PLS-DA models are considered during this
study.
The principle of PLS consists in finding linear combi-
nations of the original variables which maximize the
covariance between X and y. Each of those factors explains
as much as possible of the linear relationship existing
between the independent variables, i.e., the chemical con-
tent of a wine sample in this case, and the dependent
variable y which here represents the country of origin of
the sample. As it was the case with CART, over-fitting is
an issue which has to be taken into account and the PLS
solution to this problem consists in incorporating only a
limited number of factors in the model, the optimal number
often being determined with the help of cross-validation
approaches [6]. Once the optimal complexity of the model
has been assessed, the vector of regression coefficients b
can be determined and the prediction of the group mem-
bership of a new sample can easily be computed with the
following relationship:
bynew ¼ xTnewb
where bynewð1� 1Þ is the y predicted value for the new
sample and xnew (p 9 1) is a vector containing the mea-
surements of the p original variables. The group member-
ship of the new sample is therefore straightforward to
determine since the decision is only depending on the sign
of the predicted value: if bynew is negative the sample
belongs to the group encoded as -1 and vice versa. Con-
trary to CART, PLS uses a linear combination of variables
and hence PLS-DA models often perform better than
classification trees. However, this gain in performance is
obtained at the price of simplicity since interpretation of
PLS regression coefficients might be complex or mislead-
ing in some circumstances.
Moreover, no variable selection is performed during the
construction of a PLS model and hence prediction for a
new sample requires measuring all the parameters which
were used during calibration. This is a rather unattractive
Eur Food Res Technol (2009) 230:31–45 33
123
feature since the analysis of 63 parameters to authenticate
every wine sample is economically unrealistic. Therefore, a
variable selection method known as partial least squares-
uninformative variable elimination (PLS-UVE) [4, 11] is
applied during this work. PLS-UVE actually identifies the
variables for which the PLS regression coefficients are low
and/or unstable, meaning that the information carried is
likely to be very low and therefore those parameters should
not be kept in the final model. The decision whether a
variable should be retained is made with the help of a
criterion calculated as the ratio between the mean value of
the variable regression coefficient and the standard devia-
tion of this regression coefficient, both parameters being
estimated with a re-sampling approach. The ratio deter-
mined in this way is then compared to a threshold value
which corresponds to the maximum value of the criterion
observed for some artificially generated variables which
are hence known to be uninformative. Therefore, a variable
of which the ratio is below the critical value is considered
to be uninformative and is thus discarded. Since the
number of variables by the PLS-UVE method is often still
large an additional selection step is introduced, the aim of
which is to select a small set of variables that is still able to
obtain a good discrimination [4]. The selection procedure is
iterative and it removes the parameters with the lowest
regression coefficients.
All computations for PLS were performed with a Matlab
program, written in Matlab 6.5 (The MathWorks, Natick,
MA) on a computer running Microsoft Windows XP.
Error rate estimation
Discriminant models are built and error rates are evaluated
based on the re-substitution method as well as on cross-
validation (CV) either by the classical leave-one-out
method (CV-LOO) [6, 7] or by tenfold cross-validation
(CV-10) [6]. The re-substitution rate is the percentage of
samples used to build the model that are correctly classified.
For RDA, additionally, an error rate based on the simulation
of 6,000 wine samples for each country [3] is given.
The optimal model chosen using the re-substitution and
CV-procedure is further validated using independent sam-
ples. Indeed, the prediction error estimated by re-substitu-
tion is in general too optimistic, and by cross-validation it
might be too optimistic too. The performance of the dis-
crimination model therefore should also be tested with
independent samples. These can be obtained by splitting
the data into training and a test set using the duplex-algo-
rithm [6, 12]. The discrimination model is then built with
the training set and used to predict the classification of the
independent samples from the test set. The percentage of
samples from the test set that is correctly classified is called
the prediction rate. It is generally lower than the re-sub-
stitution rate and sometimes too pessimistic, because of the
reduced sample size of the training set, as a result of the
splitting of the data.
The composition of the training and the test set is given
in Table 1.
Application of multivariate methods to the wine data
Classification and regression trees
Because of the simplicity and interpretability of the model
obtained, CART is the first method applied. Authentic and
commercial samples are modeled separately as well. This
is also the case for red and white wines, the discrimination
of samples with a different color being straightforward.
Table 1 Composition of the training and the test set
Training set Test set
Authentic samples Red wines White wines Total Authentic samples Red wines White wines Total
Hungary 30 67 97 Hungary 14 33 47
Czech Republic 25 74 99 Czech Republic 13 39 52
Romania 32 67 99 Romania 16 35 51
South Africa 27 73 100 South Africa 12 37 49
R 114 281 395 R 55 144 199
Commercial samples Red wines White wines Total Commercial samples Red wines White wines Total
Hungary 34 66 100 Hungary 17 33 50
Czech Republic 26 75 101 Czech Republic 13 38 51
Romania 32 62 94 Romania 16 32 48
South Africa 26 73 99 South Africa 13 38 51
R 118 276 394 R 59 141 200
34 Eur Food Res Technol (2009) 230:31–45
123
Discrimination and classification of the four countries
(HU, CZ, RO, SA)
Authentic white wines The classification tree from Fig. 2
is built with the 281 available training samples, and 144
representative test samples are used to estimate the pre-
dictive ability of the model.
The optimal CART model retains three parameters out
of the 63 present in the database, namely the isotopic ratio
Ethanol(D/H)1, the rare earths ratio Yb/La and the content
of U. The most interesting result, which confirms earlier
observations [2] certainly is that South African samples are
easily discriminated from the Eastern European wines. The
authentication of South African white wines can almost
perfectly be done based on a single variable namely Eth-
anol(D/H)1. From the boxplot in part II, Fig. 8 it follows
that South African wines are indeed characterized by a
higher value of this parameter. The discrimination of
Hungarian wines from Czech and Romanian wines appears
to be more difficult.
The performance of this classification tree is summa-
rized in Table 2. The prediction errors (about 15%) are
mainly between Hungary and Czechia.
Authentic red wines A first classification tree (Fig. 3a) is
built with the 114 calibration samples available and 55
representative test samples are used to estimate the pre-
dictive ability of the model.
Similar to the discrimination of white wines, the CART
model selects only three variables [V, Ethanol(D/H)2 and
Ethanol(D/H)1] from the 63 variables, but South African
wines are not separated after the first split and hence V and
Ethanol(D/H)1 are required to authenticate South African
authentic red samples. However, when the content of
Vanadium is not taken into account during model con-
struction, the tree shown in Fig. 3b is obtained. It follows
that with the isotopic ratio Ethanol(D/H)1 it is possible to
discriminate South African wines from Eastern European
wines. Note that in the tree V is replaced by La, making the
discrimination of Romanian samples slightly more diffi-
cult, although the discrimination of Hungarian and Czech
wines is still responsible for most of the classification
errors observed in Table 2.
Commercial white wines A classification tree is built with
276 calibration samples available and the predictive power
of the model is assessed with the help of 141 representative
test samples. The performance of this tree is summarized in
Table 2. The CART model built is relatively complex since
eight variables are retained [Ethanol(D/H)1, Ti, Cl, Sr, U,
Wined18O, Zn, and the ratio Gd/La], and the prediction
errors (about 17%) are larger than for the authentic wines.
As it was the case for authentic wines, South African
EtDH1 < 103.5
U < -1.76
Yb/La < -0.95
2
73
63
8142
48
2 3
66
EtDH1 < 103.5
U < -1.76
Yb/La < -0.95
2
73
2
73
63
814
63
8142
48
2
48
2 3
66
2 3
66
Fig. 2 Classification tree
developed with all parameters
for the authentic white wines.
Bars in boxes stand for
Hungarian, Romanian, Czech,
and South African samples,
respectively
Eur Food Res Technol (2009) 230:31–45 35
123
samples are easily separated using only the Ethanol(D/H)1
isotopic ratio. Notice that the splitting values of Ethanol(D/
H)1 for the authentic white wines and commercial white
wines are very similar, 103.5 and 103.3, respectively. This
confirms earlier conclusions based on an evaluation of the
samples collected during the first year of the project [2].
South African samples can be identified by comparison of
the Ethanol(D/H)1 value with a reference value chosen
equal to for instance 103. Therefore, for this discrimination
the authentic wines seem to be a good model for the
commercial wines
Commercial red wines The CART model is built with
118 calibration samples and it is tested by means of 59
representative test samples. This tree uses only three
parameters namely Ethanol(D/H)1, B and P but the pre-
diction ability of the model is relatively low (see Table 2).
However, South African wines are again perfectly classi-
fied solely based on the analysis of the isotopic ratio Eth-
anol(D/H)1. Here too the splitting value of this variable
Table 2 Performance (% re-substitution and % prediction) of CART
models for the discrimination of Hungarian, Czech, Romanian and
South African wines
Number of
variables in
the model
Re-substitution Prediction
(CV-10)
Prediction
(CV-test)
Authentic
White 3 89.0 83.3 84.8
Red 3 93.0 85.1 87.3
Commercial
White 8 89.9 77.9 82.9
Red 3 85.6 76.2 72.9
V <0.49
EtDH2 < 127.5 EtDH1 < 103.8
2 122
2 122 28
328
33131 2727
EtDH1 < 103.8
La < -0.96
EtDH2 < 127.5
27
29
281 32 2
22
EtDH1 < 103.8
La < -0.96
EtDH2 < 127.5
2727
2929
281 3
281 32 2
222 2
22
(a)
(b)
Fig. 3 CART for the authentic
red wines a the classification
tree built with the 63 parameters
available; b the classification
tree obtained when V is
discarded. Bars in boxes stand
for Hungarian, Romanian,
Czech, and South African
samples, respectively
36 Eur Food Res Technol (2009) 230:31–45
123
(102.9) is comparable to that of the authentic red wines
(103.8). This points to the fact that also for the discrimi-
nation of South African red wines authentic wines are a
good model for the commercial wines.
Discrimination and classification of the European samples
(HU, CZ, RO)
It is important to realize that the discrimination of South
African samples being straightforward, the estimated pre-
dictive ability of the previous models is likely to be opti-
mistic concerning the classification of Eastern European
wines.
Therefore, additional CART models were constructed;
those trees focusing on the discrimination of Hungarian,
Romanian and Czech samples only. The predictive power
of those models (Table 3) is sensibly lower than what has
been found previously, which was expected since the
classification errors observed with the first set of trees are
systematically caused by misclassification of European
wines.
Authentic white wines The obtained classification tree
uses two variables, U and Yb/La. The former is important
to discriminate the Czech samples from the others while
the latter mainly discriminates between Hungarian and
Romanian wines. This confirms earlier univariate obser-
vations. The boxplot of U (part II, Fig. 8) shows that Czech
samples indeed have lower U concentrations, and from the
Fisher weights (part II, Table 4) it follows that Yb/La is
one of the most important variables to separate Hungarian
and Romanian wines.
Authentic red wines Only two variables, V and Etha-
nol(D/H)2 are used in the obtained CART model. Vana-
dium discriminates between Romania and the two other
countries while the isotopic ratio mainly separates Hungary
from Czechia. This again confirms earlier observations
since from the boxplots (part II, Fig. 8) it follows that the
concentration of V in the Romanian samples is higher than
in the other Eastern European countries. Moreover, from
the Fisher weights (part II, Table 4) it follows that Etha-
nol(D/H)2 is important for the separation of Hungarian and
Czech samples.
Commercial white wines Eight variables (Ti, Cl, Sr, Li,
Wine d18O, Zn, Gd/La, and Na-Excess) are used in the
model. The relatively low prediction rates (about 73%)
confirm earlier conclusions from among others PCA (part
II, Fig. 6), i.e., commercial samples show a larger overlap
than the authentic wines.
Commercial red wines Only two variables (B and P) are
used in the obtained CART model, but the prediction rates
(about 65%) are very low. They confirm the low discrim-
inating power of these parameters observed from the Fisher
weights (part II, Table 4).
Conclusions
From the CART analysis, the most obvious conclusion is
that South African samples are very easily classified by
means of Ethanol(D/H)1, whether authentic, commercial,
red or white wines are considered. Therefore, it is impor-
tant to focus on the discrimination of the three European
countries for which the classification trees performances
are not satisfactory. Indeed, only 73% of the commercial
European wines can be correctly classified in the most
favorable case. The CART analysis also underlines that
classification of commercial samples is more difficult to
achieve than the classification of authentic samples. This
confirms earlier conclusions from PCA (see part II, Fig. 6).
Finally, the most important parameters are V, Ethanol(D/
H)1, Ethanol(D/H)2, U, Yb/La to discriminate authentic
wines while Ethanol(D/H)1, B, P, Ti, Cl, Sr, U, Wine d18O,
Zn, Gd/La are most important for the discrimination of the
commercial samples.
Regularized discriminant analysis
Following the conclusions of the CART analysis, that
South African samples can be classified very easily by
means of Ethanol(D/H)1; it has been decided that the RDA
method will be used for discriminating only Hungarian,
Czech, and Romanian wine samples. The application of
RDA has the aim to find ‘‘good’’ discriminant models
which contain the most important variables for explaining
differences between the countries with minimal misclassi-
fication rates. The interactive model-building process is
described in [3]. The performance of RDA models for the
discrimination of the three East European countries is
summarized in Table 4.
Table 3 Performance (% re-substitution and % prediction) of CART
models for the discrimination of Hungarian, Czech and Romanian
wines
Number of
variables in
the model
Re-substitution Prediction
(CV-10)
Prediction
(CV-test)
Authentic
White 2 86.1 77.9 79.4
Red 2 93.1 86.2 83.7
Commercial
White 8 88.7 75.4 72.8
Red 2 81.6 68.5 65.2
Eur Food Res Technol (2009) 230:31–45 37
123
The discrimination of wines based on our preferred
models (M1), the preference being based on high-perfor-
mance rates and a minimal number of variables in the
model, is illustrated in Fig. 4.
Authentic white wines At first models were built using all
authentic white wines (N = 315) from the three East Euro-
pean countries. Five good RDA-models, including 10–15
different variables were built. The following variables were
important in the five models for discriminating the wines
from the three East European countries; the variables in bold
have shown a strong discriminating power in most (at least
four of the five) models:
M1: V, La, Cd, Si, Al, U, Tartaric acid, Li, Y, Na,
Ethanol(D/H)2, Wine d18O, Ca and Ni
M2: V, Cd, Si, K, Al, Mn, Zn, Sr, Y, U, Wine d18O and
Er/La
M3: Tartaric acid, Si, K, Li, Cr, Mn, Ni, Rb, Sr, Y, Cd,
Ba, Pb, Ethanol(D/H)2 and Yb/La
M4: V, La, U, Tartaric acid, Mg, Ca, Li, Mn, Fe, As,
Cs, and Wine d18O
M5: V, Tartaric acid, Mg, Si, Mn, Fe, Cd, Ba, U and
Yb/La.
The variables in the models are ordered according to
their discriminating power during the interactive model
building process. In most cases Vanadium was the variable
with the largest discriminating power, followed by either
Cadmium, Lanthanum, Tartaric acid, Silicon or Uranium.
For these models error rates between 0.3 and 1.6% (re-
substitution) and between 2.8 and 5.1% (CV-LOO and
simulation) were estimated. That means, with all models
maximal 5.1% of wines were not correctly assigned to that
country in which the wine was produced. The excellent
correct classification rates of our preferred RDA-model M1
(k = 0.47) with 14 variables were 99.7% (re-substitution),
95.9% (simulation, 6000 samples per country), and 95.6%
(CV-LOO).
To realize the same misclassification rates the models
obtained with the linear (k = 0) and quadratic (k = 1)
discriminant analysis would require much more
variables.
To estimate the error rate independently of the model
the training (N = 281) and test sets (N = 144) from
Table 1 were used. Models were built again, now based on
the training set and in the same way as for the whole data
set correct classification rates were estimated by re-sub-
stitution and CV-LOO. Afterwards, the independent wine
samples from the test set were classified using the models
of the training set (CV-test). In this case of authentic white
wines our preferred model M1 of all wine data was also a
good model for the training set (k = 0.47), which means
that the selected variables are very stable in this model. The
performance rates (re-substitution: 100%, simulation:
96.4%, CV-LOO: 96.6% and CV-test: 98.1%) compare
very well with those based on all data.
For the authentic red wines as well as for the commer-
cial white and red wines, discussed in the following sec-
tions, only the correct classification rates of our preferred
RDA-models M1 for the complete data set and for the
training set will be given. Here again the variables in bold
have shown a high discriminating power in different
models.
To compare the results of CART, RDA, and PLS-DA,
only performance rates for models built with the training
set are summarized in Table 4.
Authentic red wines Using all authentic red wines
(N = 130) the following good RDA-model with 8 vari-
ables could be estimated:
M1: V, Yb/La, Ethanol(D/H)2, Methanol, Cl, Cr, Pb and
Shikimic acid.
With this RDA-model (k = 0.9) 100% of the wines
could be classified correctly by re-substitution, 98.5% by
simulation, and 93.9% by CV-LOO.
Table 4 Performance (% re-substitution, % prediction and % simulation) of the preferred RDA models M1 for the discrimination of Hungarian,
Czech, and Romanian wines
Number of variables
in the model
Re-substitution Prediction
(CV-LOO)
Prediction
(CV-test)
Simulation
Authentic
White 14 (10–16) 100 (99.0–99.7) 96.6 (93.7–98.1) 98.1 (94.1–98.1) 96.3 (96.3–99.1)
Red 6 (6–8) 100 (100) 97.7 (97.3) 93.0 (88.4–93.0) 99.4 (98.5–99.4)
Commercial
White 17 (17–21) 100 (100) 90.2 (87.7–90.2) 89.4 (86.4–89.4) 97.1 (97.1–98.6)
Red 10 (9–10) 100 (97.8–100) 94.6 (87.0–94.6) 89.1 (82.6–95.6) 96.3 (95.1–97.0)
The performance of all good RDA models is summarized between brackets
38 Eur Food Res Technol (2009) 230:31–45
123
A comparable good RDA-model (k = 0.8) with only six
variables could be found, when only the training set
(N = 87) was considered:
M1: V, Cd, Ethanol(D/H)2, S, Cr, and As.
The corresponding re-substitution, simulation, and pre-
diction (CV-LOO and CV-test) rates were with 100, 99.4,
97.7, and 93.1%, respectively, a little bit higher. In both
cases V, Ethanol(D/H)2 and Cr have been selected as
important variables.
Commercial white wines In the case of commercial white
wines (N = 306), more variables were necessary to find a
good RDA-model in comparison to authentic white wines;
our preferred RDA-model based on all wine data contained
21 variables:
Authentic white winesRDA-Model 1 (14 var.)
Parameter λλ=0.466
-8 -6 -4 -2 0 2 4 6
Canon. discriminant function 1
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
Can
onic
al d
iscr
imin
ant f
unct
ion
2
Commercial red winesModel 1 (10 var.)Parameter λ = 0.9
-8 -6 -4 -2 0 2 4 6
Canonical discriminant function 1
-4
-3
-2
-1
0
1
2
3
4
5
6
7C
anon
ical
dis
crim
inan
t fun
ctio
n 2
Commercial white winesRDA-Model 1 (21 var.)
Parameter λ = 0.92
-6 -4 -2 0 2 4 6
Canonical discriminant function 1
-4
-3
-2
-1
0
1
2
3
4
5
6
7
Can
onic
al d
iscr
imin
ant f
unct
ion
2
Commercial red winesModel 1 (10 var.)Parameter λ = 0.9
-4
-3
-2
-1
0
1
2
3
4
5
6
7
Can
onic
al d
iscr
imin
ant f
unct
ion
2
-8 -6 -4 -2 0 2 4 6
Canonical. discriminant function 1
Czech Republic Hungary Romania
Czech Republic Hungary Romania
Czech Republic Hungary Romania
Czech Republic Hungary Romania
Fig. 4 Discriminating plots of
the preferred RDA-models M1
for authentic and commercial
white, and red wines
Eur Food Res Technol (2009) 230:31–45 39
123
M1: Cl, Li, Zn, Rb, Ca, Invert sugar, Ti, Putrescine,
Mg, Er/La, Cu, Cd, Ethanol(D/H)2, Ethanol-
amine, Br, Ni, Shikimic acid, Na, Pb, Co and
2-Methylbutan-1-ol.
With this RDA-model (k = 0.9) 100% of the wines
could be classified correctly by re-substitution, 97.5% by
simulation and 90.8% by CV-LOO. The discriminating plot
of the commercial white wines in Fig. 4 shows some
overlap between the wines of the three East European
countries.
The chosen RDA-model (k = 0.8) based on the learning
set (N = 203) contained only 17 variables:
M1: Cd, Zn, Ca, Rb, Sr, Wine d18O, Putrescine, Mg, Si,
Invertsugar, Li, Cu, Cr, P, Ba, Shikimic acid and
Tartaric acid.
For this model classification and prediction rates (re-
substitution, simulation, CV-LOO and CV-test) of 100,
97.1, 90.2 and 89.4%, respectively, could be obtained,
which are comparable with the above rates. Some impor-
tant variables appear in both M1-models, but there are also
some differences. That means, these models are not so
stable and are characterized by a greater variability.
Commercial red wines Using all commercial red wines
(N = 138) the following good RDA-model with 10 vari-
ables could be estimated:
M1: K, B, Ti, Sr, Tartaric acid, Ethanol(D/H)2, La, Fe,
Zn, Shikimic acid
With this RDA-model (k = 0.9) again 100% of the
wines could be classified correctly by re-substitution,
97.2% by simulation and 94.9% by CV-LOO method.
Based on the N = 92 commercial red wines of the
learning set the same variables were chosen as important
for our preferred RDA-model, but the best performance
rates could be obtained when using k = 0.6 as regulari-
zation parameter. The correct classification rates were:
100% by re-substitution, 97.1% by simulation, 90.2% by
CV-LOO and 89.4% by CV-test.
Conclusions
1. While good discrimination models could be found with
10–14 variables for authentic white wines and models
with 6–9 variables for red wines models, the necessary
number of variables in the models for commercial wines
(for comparable error rates) was larger. For commercial
white wines good models should include 17–22 vari-
ables and in the case of red wines 8–10 variables.
2. The following variables had a very high (highlighted)
or a high discriminating power in different model
variants:
• Auth. white wines: Tartaric acid, Si, Mg, K, Ca,
Li, Al, V, Mn, Fe, Ni, Sr, Y, Cd, U, La or Yb/La or
Er/La and Ethanol(D/H)2 or Wined18O
• Auth. red wines: L-Lactic acid, Shikimic acid,
Methanol, 3-Methylbutan-1-ol, S, Ca, Cl, Putres-
cine, Ethylamine, V, Cr, Zn, As, Rb, Cd, Pb, La
or Yb/La and Ethanol(D/H)2
• Comm. white wines: Invert sugar, Shikimic acid,
2-Methylbutan-1-ol, Mg, Si, Cl, Ca, Ethanol-
amine, Putrescine, Li, Ti, Cr, Cu, Zn, Br, Rb, Sr,
Cd, Pb, La or Er/La and Ethanol(D/H)2
• Comm. red wines: Tartaric acid, Shikimic acid,
1-Propanol, K, B, Ti, Sr, La or Er/La and
Ethanol(D/H)2
3. If instead of taking the whole data set only the training
set of authentic wines is used for building discriminant
models, generally the same variables have a high
discriminating power. That means that many models,
especially in the case of authentic white wines, are very
stable. In the case of commercial white wines some
good model variants based on the training set included
also some other variables, such as Tartaric acid, Na, P,
Ba and Wined18O, which seem to play an important
role for the discriminating process. Indeed the ‘‘best’’
discriminant models based on the whole data set are not
always identical with the ‘‘best’’ models from the
training set. However, in general the performance rates
(re-substitution, CV-LOO) of the ‘‘best’’ models in the
whole set and in the training set were comparable.
In the group of commercial red wines, besides our
preferred model M1 with excellent performance rates in
both data sets, another model with lower rates (re-sub-
stitution: 100%, simulation: 95.1%, CV-LOO: 87.0%,
CV-test: 84.6%) could be found, which included the
variables B, Ti, P, V, Si, Ba, 2-Methyl-1-propanol, and
Invert sugar. These variables were also identified as
being important by using PLS-DA method in the fol-
lowing section.
4. The good separation of the wines from the three East
European countries by our preferred models for authentic
as well as for commercial wines is illustrated in Fig. 4.
Partial least square-discriminant analysis (PLS-DA)
and partial least square-uninformative variable
elimination
Here too we will only focus on the discrimination of the
Hungarian, Czech, and Romanian samples. Each of the
discrimination situations is modeled in three steps. A
classical PLS-DA model is developed in a first time and
then PLS-UVE is applied. Eventually, a manual variable
40 Eur Food Res Technol (2009) 230:31–45
123
selection is performed on the variables retained by PLS-
UVE in order to decrease even further the number of
parameters required to authenticate a wine sample. This
final selection is done iteratively, the parameter with the
smallest regression coefficient being discarded each time.
This procedure is repeated until the removal of a parameter
results in an unacceptable increase of the prediction error
[12]. As mentioned earlier the discrimination of one
country versus the two other countries is considered here,
which is not illogical since the aim of the project is to
verify the origin of a given wine sample. If a wine is
claimed, e.g., to be Czech, it is important to discriminate
between Czech samples on the one hand and Hungarian
plus Romanian samples on the other hand. For practical
reasons, the most parsimonious PLS and PLS-UVE models
are discussed and their performances are summarized in
Tables 5 and 6.
Authentic white wines To build discriminating PLS
models 208 training samples are available while 107 rep-
resentative samples are used to test the predictive ability of
the developed models.
Hungary versus {Romania ? Czech Republic}
A first PLS model is built using all 63 parameters. This model
uses three factors and the re-substitution and prediction
(CV-LOO and CV-test) rates are 96.7, 92.3 and 91.6%,
respectively. PLS-UVE allows decreasing the number of
variables to 27. A final model is built with the 13 most
discriminating parameters. These are eight trace elements
(As, Fe, Ba, Rb, Li, Si, Cu, Mn), two classical parameters
(Tartaric Acid, Ethylacetate), two macro elements (Na, P)
and one isotopic ratio (Ethanol(D/H)2). As follows from
Table 5 this model which uses only 2 PLS factors has pre-
diction abilities which are very similar to those of the model
using all parameters. The number of variables in the model as
well as the performance summarized in Table 5 also indi-
cates that the discrimination of Hungarian wine samples
from Czech and Romanian wines, compared to the other
discriminations, is the most difficult one.
Romania versus {Hungary ? Czech Republic}
The PLS model using all 63 parameters requires 7 factors
and the re-substitution and prediction (CV-LOO and
CV-test) rates are 100, 96.2 and 95.4%. The PLS-UVE
approach retains 16 variables but a further reduction is
possible since finally a 2-factor PLS-model using only 8
parameters is built with very similar performances (see
Table 5). The selected variables are Y, V, Fe, Si, Pb, Cu,
Ethylacetate, and As.
The selection of Y and V as most discriminating is
logical when looking at the boxplots (not shown) since
their concentration in the Romanian samples is consider-
ably higher than in the Hungarian and Czech samples.
Czech Republic versus {Hungary ? Romania}
A final 4-factor PLS-model, built with only six parameters
(U, V, Cd, Pb, Mn, and Er) has the same performance
characteristics as the 4-factor model using the whole set of
parameters. The re-substitution, prediction (CV-LOO and
CV-test) rates for the latter are 98.6, 95.9 and 94.4% which
compares very well with the figures for the final model in
Table 5. The importance of U in the discrimination is
expected from the boxplots (part II, Fig. 8) and the CART
tree built on the authentic white samples.
Authentic red wines The PLS models were developed on
the 87 training samples available and they were tested on
43 representative samples.
Hungary versus {Romania ? Czech Republic}
The PLS model built with all parameters uses three PLS
factors and the re-substitution and prediction rates (CV-
LOO and CV-test) are 100, 96.4 and 93.1%, respectively,
Table 5 Performance
(% re-substitution and %
prediction) of the final PLS
models for the discrimination of
Hungarian, Czech and
Romanian authentic wines
Number of
variables
in the model
Re-substitution Prediction
(CV-LOO)
Prediction
(CV-test)
Authentic
White
Hun. versus (Rom. ? Cze.) 13 94.7 91.9 88.8
Rom. versus (Hun. ? Cze.) 8 96.6 96.1 96.4
Cze. versus (Hun. ? Rom.) 6 98.1 96.2 95.3
Red
Hun. versus (Rom. ? Cze.) 9 98.9 97.1 95.4
Rom. versus (Hun. ? Cze.) 5 100 99.4 97.7
Cze. versus (Hun. ? Rom.) 8 100 98.7 97.7
Eur Food Res Technol (2009) 230:31–45 41
123
which is very similar to the performance of the final
3-factor model using nine parameters (Cd, S, Methanol, Cr,
La, Ethanol(D/H)2, V, 2-Methylbutan-1-ol, and Mn) in
Table 5. V and Ethanol(D/H)2 are also found in the CART
tree built for the authentic red wines. The importance of Cd
and S is not expected from the univariate or PCA analysis
(part II, Fig. 6). As explained earlier [4], the selection of
unexpected variables might be due to the fact that the PLS
factors try to find a compromise between the discrimination
of Hungary and Czechia on the one hand and on the dis-
crimination of Hungary and Romania on the other hand.
Romania versus {Hungary ? Czech Republic}
Excellent performance (re-substitution, prediction (CV-
LOO and CV-test) rates of 100, 98.9 and 100%, respec-
tively, are obtained with the 2-factor model using all 63
parameters. As follows from Table 5 the performance
hardly changes with the final model based on only five
parameters (V, La, Er/La, Cl and Cr). The boxplots for V
and La (part II, Fig. 8), and Er/La (not shown) confirm the
high discriminating power of these variables and the
selection of V is also expected from the CART tree for
authentic red wines.
Czech Republic versus {Hungary ? Romania}
The final 2-factor PLS model using only eight parameters
(Cd, Ethanol(D/H)2, S, 2-Methylbutan-1-ol, Pb, V, U, and
Mn) performs very well (see Table 5) although the selec-
tion of Cd as most discriminating variable is unexpected.
Commercial white wines The models were built using 203
training samples and tested on 103 representative samples.
Hungary versus {Romania ? Czech Republic}
As for the authentic white wines it follows from Table 6
that this discrimination is the most different one. Indeed the
final 2-factor PLS model using eight parameters has pre-
diction (CV-LOO and CV-test) rates, of respectively, 90.1
and 86.4% which is not satisfying. The parameters selected
are Sr, Ca, Er/La, Ethylacetate, Rb, Wined18O, Ti, and
Ethanol(D/H)2). Sr, Ti, and Wined18O are also retained in
the CART tree. The importance of Er/La can be explained
by its correlation with Yb/La.
Increasing the number of variables hardly improves the
performance. Indeed the prediction rates obtained with
PLS-UVE using 26 parameters is 90.8% (prediction (CV-
LOO)) and 88.4% (prediction (test)). Including all 63
parameters does not reduce the prediction errors.
Romania versus {Hungary ? Czech Republic}
The PLS model using all 63 parameters requires three
factors and has prediction (CV-LOO and CV-test) of 91.8
and 97.1%, respectively. The final 2-factor model has
comparable performance (Table 6) but still retains 20
variables: Zn, Sr, Cd, Si, Li, Invert sugar, Cu, Ba, Cl, Er/
La, Cr, Mg, K, V, Pb, Rb, As, Na_Exc, 2-Methyl-1-pro-
panol, and Y. Several of the most discriminating variables
(Zn, Sr, Li, Cl) are also found in the CART tree and the
importance of Er/La can again be explained by its corre-
lation with Yb/La. The importance of several others is
logical when looking at the boxplots (not shown). The
concentration of Cd and Invert sugar, e.g., obviously is
higher in the Romanian samples.
Czech Republic versus {Hungary ? Romania}
The final 2-factor PLS model is built with 12 parameters
(Putrescine, Cl, Zn, Cd, Ca, Ti, Wined18O, Cr, Gd/La, Cu,
Ethanol(D/H)2, and Rb). As follows from Table 6 it per-
forms at least as well as the 4-factor model using all 63
parameters (prediction (CV-LOO and CV-test) rates of
91.4 and 93.2%, respectively). The importance of Putres-
cine, Cd, Ca, and Cr is not expected from the CART tree
Table 6 Performance (% re-
substitution and % prediction)
of the final PLS models for the
discrimination of Hungarian,
Czech and Romanian
commercial wines
Number of
variables in
the model
Re-substitution Prediction
(CV-LOO)
Prediction
(CV-test)
Commercial
White
Hun. versus (Rom. ? Cze.) 8 82.1 90.1 86.4
Rom. versus (Hun. ? Cze.) 20 96.6 93.6 98.1
Cze. versus (Hun. ? Rom.) 12 94.1 93.1 93.1
Red
Hun. versus (Rom. ? Cze.) 11 94.6 91.8 91.3
Rom. versus (Hun. ? Cze.) 10 100 98.4 100
Cze. versus (Hun. ? Rom.) 10 97.9 95.9 87.0
42 Eur Food Res Technol (2009) 230:31–45
123
but the boxplots (not shown) indicate that they indeed have
some discriminating power.
Commercial red wines The models were developed on
the 92 training samples available and tested on 46 repre-
sentative samples.
Hungary versus {Romania ? Czech Republic}
The prediction error of the 2-factor PLS-UVE model using
21 parameters is relatively high (8%). A further reduction
of the number of variables to 11 (P, Si, Sr, Rb, Ethanol(D/
H)2, Ca, B, Wined18O, Gluconic acid, Tartaric acid, and U)
hardly influences the performance (see Table 6). P and B
are the two (only) variables also retained in the CART tree
built for the commercial red wines. The boxplots (not
shown) indicate that several others in the final PLS model
indeed have at least some discriminating power.
Romania versus {Hungary ? Czech Republic}
As for the authentic red wines, excellent performance
(prediction (CV-LOO and CV-test) rates are 98.4 and
100%, respectively) is obtained with the 2-factor final PLS
model using only ten parameters (B, K, Sr, Tartaric acid,
Shikimic acid, Ethanolamine, Ba, Methylpropanol, Invert
sugar, and V). The most discriminating variable, B, is also
retained in the CART tree to separate Romania from the
two other Eastern European countries. The discriminating
power of several others in the final PLS model is logical
when looking at the boxplots (not shown): concentrations
of Sr, Shikimic acid, and Invert sugar are highest in the
Romanian samples while the concentrations of Tartaric
acid and Ethanolamine are lowest in these samples. The
boxplot of V (part II, Fig. 8) shows that this parameter
mainly separates Romanian (and Hungarian) wines from
the Czech ones.
Czech Republic versus {Hungary ? Romania}
The PLS model built with all parameters uses three factors
and the prediction (CV-LOO and CV-test) rates are 93.2
and 95.7%, respectively. There is an important decrease of
the prediction (CV-test) rate when the number of variables
is reduced to ten (see Table 6). The variables retained in
this final model are P, Ethanol(D/H)2, Fe, Ba, Zn, Cu, Ti,
Original Malic acid, Invert sugar, and Al. P, which is most
discriminating, is also retained by CART. The selection of
most of the other parameters seems logical when consid-
ering the boxplots (not shown). They show that especially
Ethanol(D/H)2, Fe, Ba, Zn, Ti, and Invert sugar have some
discriminating power.
Conclusions
PLS modelling confirms that it is easier to discriminate the
authentic samples than the commercial ones. Although the
PLS models built on the 63 parameters have the best per-
formances, it is interesting to notice that models developed
with a smaller set of parameters have satisfying predictive
abilities.
The discrimination of white samples also seems more
difficult to perform than the discrimination of red samples.
Indeed, the percentage of correct re-substitution and pre-
diction is usually lower for white samples than for the red
ones. However, it should be noticed that much less red
samples are available, which can also be a reason for this
difference.
Romanian wines are most easily separated from the
other Eastern European wines whether authentic, com-
mercial, red or white wines are considered. This is, how-
ever, most pronounced for the commercial wines.
Concerning the importance of the variables for the dis-
crimination, there are some differences between authentic
and commercial samples, though trace elements as well as
isotopic ratios and rare earth ratios have a lot of discrimi-
nating power. The biogenic amines Putrescine and Etha-
nolamine are useful for the discrimination of commercial
samples. Finally, Ethylacetate and Tartaric acid are the
most useful classical parameters to classify authentic white
samples while Invert sugar is the most useful classical
parameter to discriminate commercial, white as well as red,
samples.
General conclusions
Considering the results for the different methods (Table 7)
the following conclusions can be drawn:
1. With only one variable, namely the isotopic ratio
Ethanol(D/H)1 (or Ethanol(D/H)2 which leads to a
similar discrimination as Ethanol(D/H)1), South Afri-
can white as well as red wines can be perfectly
separated from East European wines by CART.
Therefore, we focused on the discrimination of the
East European wines for which the results of CART
are less good. Indeed, only 65–75% of the commercial
European wines are correctly classified (see Table 7),
but it should be taken into account that these simple
CART models are based on a very reduced set of only
2–8 variables.
2. RDA and PLS-DA perform equally well. In general
RDA and PLS-DA models which include the following
variables in different compositions have a high
Eur Food Res Technol (2009) 230:31–45 43
123
discriminating power and achieve satisfying predictive
abilities:
• Isotopic ratio: Ethanol(D/H)2
• Rare earth elements and ratios: La or Er or Yb/La
or Er/La, Y
• Trace elements: V, U, Cr, Ti, Cd, Pb, Rb, Mn, Fe,
Zn, Cu, Sr, As, B, Ba
• Macro elements: Si, P, S, Cl, Ca, Mg, Na, K
• Classical parameters: Tartaric acid, Shikimic acid,
Ethylacetate, 2-or 3-Methylbutan-1-ol or 2-Methyl-
1-propanol, Methanol, Invert sugar
• Biogenic amines: Putrescine, Ethanolamine.
The application of multivariate methods such as RDA
and PLS-DA leads to discriminating models, which
allow a correct classification of the wines from the
East European countries with rates between 88 and
100%. Of course, the number of variables given above
can be reduced if there is prior information about the
type and the colour of the wines.
The relationship between the geographic origin and the
first four groups of parameters could be expected.
Together with these parameters also classical param-
eters and biogenic amines which are influenced by the
variety of the wine and the winemaking process are
playing an important role for discriminating the three
East European wines concerning their origin.
3. Considering the authentic and commercial wines sep-
arately, there are some differences, e.g., rare earth
elements and the trace elements V and Mn are much
more important as discriminating variables in the
group of authentic wines. On the other hand, Titanium
and Invert sugar play an important role for differenti-
ating commercial wines.
4. Differences can also be found in the groups of white
and red wines. While Lithium and Sodium, respec-
tively, Sodium-Excess, are important variables for
separating white wines, most models for red wines
include variables such as Methanol, 2-or 3-Methyl-
butan-1-ol or 2-Methyl-1-propanol and Shikimic acid.
5. In RDA as well as in PLS-DA the following variables
are important in the four subgroups of wines:
• Authentic white wines: V, La or Er or Yb/La, Cd,
Si, U, Tartaric acid, Li,Y, Ethanol(D/H)2, and Mn
• Authentic red wines: V, Yb/La or Er/La or La,
Ethanol(D/H)2, Methanol, Cl, Cr, Pb, 2-or 3-
Methylbutan-1-ol, S and Cd
Table 7 Comparison of CART, RDA, and PLS for the discrimination of wines from the three East European countries (Hu, Cz, and, Ro)
Wine type CART RDA PLS
Variables*
in models
Prediction
rates (%)
Variables* in modelsa Prediction
rates (%)
Variables* in modelsb Prediction
rates (%)
Authentic
white
U, Yb/La 78–79 V, La, Cd, Si, Al, U, Tartaric acid,
Li, Y, Na, Ethanol(D/H)2,
Wined18O, Ca, Mg, K, Mn,
Fe, Ni, Sr, Yb/La
95–98 As, Fe, Ba, Rb, Li, Si, Cu, Mn,
Tartaric acid, Ethylacetate, Na, P,
Ethanol(D/H)2, Y, V, Pb, U, Cd, Er
88–96
Authentic
red
V, Ethanol(D/
H)1
(Ethanol(D/H)2)
84–86 V, Yb/La, Ethanol(D/H)2, Methanol,Cl, Cr, Pb, Shikimic acid, L-Lactic
acid, 3-Methylbutan-1-ol, S, Ca,
Putrescine, Ethylacetate, Zn,
As, Rb, Cd, La
88–94 Cd, S, Methanol, Cr, La, Ethanol(D/H)2, V, 2-Methylbutan-1-ol, Mn, Er/
La, Cl, Pb, U
95–99
Commercial
white
Ti, Cl, Sr, Li,Wined18O,
Zn, Gd/La,
Na-Excess
73–75 Cl, Li, Zn, Rb, Ca, Invert sugar, Ti,Putrescine, Mg, Er/La, Cu, Cd,
Ethanol(D/H)2, Ethanolamine, Br,
Ni, Shikimic acid, Na, Pb, Co, 2-
Methylbutan-1-ol, Tartaric acid, P,
Ba, Wined18O
86–91 Sr, Ca, Er/La, Ethylacetate, Rb,
Wined18O, Ti, Ethanol(D/H)2, Zn,
Cd, Si, Li, Invert sugar, Cu, Ba, Cl,Cr, Mg, K, V, Pb, As, Na-Excess, 2-
Methyl-1-propanol, Y, Putrescine
86–98
Commercial
red
B, P 65–69 K, B, Ti, Sr, Tartaric acid,
Ethanol(D/H)2, La, Fe, Zn,
Shikimic acid, 1-Propanol, Er/La V,
P, Si, Ba, 2-Methyl-1-propanol,Invert sugar
89–95 P, Si, Sr, Rb, Ethanol(D/H)2, Ca, B,
Wined18O, Gluconic acid, Tartaricacid, U, K, Shikimic acid,
Ethanolamine, Ba, 2-Methyl-1-propanol, Invert sugar, V, Fe, Zn,
Cu, Ti, Malic acid, Al
87–100
* Variables with a high discriminating power in all three methods are highlighted bold and those selected important by RDA and PLS are boldand italica Variables in the different efficient RDA model variantsb For each wine type variables in the three ‘‘one country against the other countries’’ comparisons
44 Eur Food Res Technol (2009) 230:31–45
123
• Commercial white wines: Cl, Li, Zn, Rb, Ca, Invert
sugar, Ti, Putrescine, Mg, Er/La orGd/La, Cu, Cd,
Ethanol(D/H)2, Pb, 2- Methylbutan-1-ol, Ba, and
Wined18O,
• Commercial red wines: K, B, Ti, Sr, Ethanol(D/
H)2, Fe, Zn, Shikimic acid, V, P, 2-Methylbutan-1-
ol, and Invert sugar.
The bold highlighted variables were also selected with
CART.
6. Comparing the RDA and PLS-DA models with similar
performance rates, RDA-models contain fewer vari-
ables than PLS-DA models, because PLS-DA allows
only two-group comparisons, e.g., the comparison of
each of the three European Countries with the two
others. The following minimal numbers of chemical
parameters are therefore necessary to differentiate the
three countries:
• Authentic white wines: RDA: 10-14 PLS-DA: 19
• Authentic red wines: RDA: 6-9 PLS-DA: 13
• Commercial white wines: RDA: 17-22 PLS-DA:
26
• Commercial red wines: RDA: 8-10 PLS-DA: 24.
7. In all cases the differentiation of the commercial wines
from Hungary, Czech Republic, and Romania requires
more parameters than the authentic wines and white
wines more than red wines. These results can be taken
as a basis for analysing further problems of traceability
of wines from different countries.
Acknowledgments The authors acknowledge the contributions of
the European commission for the financial support of this work,
which was carried out in the framework of the specific research and
technological development program ‘‘Competitive and Sustainable
Growth’’ (Contract G6RD-CT-2001-00676). The authors are solely
responsible for the content of this research article and the European
Community is not responsible for any use that might be made of the
data appearing therein. This article is dedicated to Professor D.L.
Massart, who initially took the lead of the chemometrics/statistics
group in the project. Unfortunately, he passed away during the project
on 26 December 2005. Finally, the authors would like to thank all the
partners of the European Wine DB project and especially the ones
who collected the different wine samples, performed the micro-
vinification of the authentic samples and did the analytical measure-
ments of all the parameters necessary to carry out this study.
References
1. Schlesier K, Fauhl-Hassek C, Forina M, Cotea V, Kocsi E,
Schoula R, van Jaarsveld F, Wittkowski R (2009) Characteriza-
tion and determination of the geographical origin of wines. Part I:
overview. Eur Food Res Technol (submitted)
2. Smeyers-Verbeke J, Jager H, Lanteri S, Brereton P, Jamin E,
Fauhl-Hassek C, Forina M, Romisch U (2009) Characterization
and determination of the geographical origin of wines. Part II:
descriptive and inductive univariate statistics. Eur Food Res
Technol (submitted)
3. Romisch U, Vandev D, Zur K (2006) Application of interactive
regularized discriminant analysis to wine data. Aust J Stat
35(1):45–55
4. Capron X, Smeyers-Verbeke J, Massart DL (2007) Multivariate
determination of the geographical origin of wines from four
different countries. Food Chem 101:1608–1620
5. Breiman LJ, Freidman R, Olsen R, Stone C (1984) Classification
and regression trees. Wadworth, Pacific Grove
6. Vandeginste BGM, Massart DL, Buydens LMC, de Jong S, Lewi
PJ, Smeyers-Verbeke J (1998) Handbook of chemometrics and
qualimetrics. Part B. Elsevier Amsterdam, Nederland, p 238
7. Mc Lachlan GJ (1992) Discriminant analysis and statistical pat-
tern recognition. Wiley, NY
8. Friedman JH (1989) Regularized discriminant analysis. J Am Stat
Assoc 84:165–175
9. Vandev D (2004) Interactive stepwise discriminant analysis in
MATLAB. Pliska Stud Math Bulg 16:291–298
10. Geladi P, Kowalski BR (1986) Partial least-squares regression: a
tutorial. Anal Chim Acta 185:1–17
11. Centner V, Massart DL, de Noord OE, de Jong S, Vandeginste
BM, Sterna C (1996) Elimination of uninformative variables for
multivariate calibration. Anal Chem 68:3851–3858
12. Snee RD (1977) Validation of regression models: method and
examples. Technometrics 19:415–428
Eur Food Res Technol (2009) 230:31–45 45
123