10
J. agric. Engng Res. (2000) 76, 409}418 doi:10.1006/jaer.2000.0557, available online at http://www.idealibrary.com on Adaptive Classi"cation*a Case Study on Sorting Dates Matti Picus; Kalman Peleg Department of Agricultural Engineering, Technion Israel Institute of Technology, Haifa 32000, Israel, e-mail of corresponding author: matti@picus.org (Received 18 August 1998; accepted in revised form 23 March 2000) The probability of classi"er errors in automated grading of fruits is much greater than in traditional well-de"ned and highly separated, static classi"cation tasks. Presently, operators of conventional sizers and colour sorters adjust the class boundaries manually based on observations of obvious misclassi"cation trends in the packed fruit, with the goal of minimizing the classi"cation errors. However, the new sorting machines utilize many features to reach the grade decision. A human operator is unable to control the multitude of parameters under control. Estimating the between-class discriminant functions requires estimation of the a priori class probabilit- ies (&priors') and the class-conditional probability densities. The time-varying nature of the priors and the probability densities result in unsatisfactory classi"er performance. To solve these problems, an adaptive grading approach by &prototype populations' is proposed. The produced stream is classi"ed into a discrete number of prototype streams or populations by a global &population classi"er'. For each unique prototype population a separate, optimal &grade classi"er' is designed for sorting individual fruits. The global &population classi"er' utilizes a "nite-length stack of features continuously updated from the most recently sorted produce. The statistical attributes of the features sample in the stack are analysed to determine which produce population is currently passing through the system. When the population classi"er determines that the stack contents have originated from a di!erent prototype population, it changes the active &grade classi"er' to the most appropriate one for the current fruit population. An example of simulated adaptive versus conventional train-once, sort-many, grading is presented on data sets obtained from a system to sort dates by machine vision. The example demonstrates that adaptive grading by prototype populations yields lower misclassi"cation rates in comparison to conventional sorting. ( 2000 Silsoe Research Institute 1. Introduction Grading agricultural produce has so far eluded full automation. Automated grading of fruit and vegetables di!ers from traditional classi"cation tasks in four major areas. (1) The criteria for classi"cation are set with respect to a &reference standard,' which may be either a &refer- ence sensor which measures some objective physical properties of the fruit, or a subjective panel of human experts. Sometimes, the precision and repeatability of grade determination by these &reference sensors' may be worse than those of the machine &estimator sensors', especially in the case of the human expert sorters. Nevertheless, the goal on an automatic classi- "er that uses estimator sensors, is to imitate the human judgement and minimize classi"cation errors with respect to the human expert, reference sensor, in spite of the &noisy' grade judgement. (2) The on-line inspection system utilizes an &estimated standard' from a set of features derived from one or more &estimating sensors'. The estimated standard at best correlates highly to the reference one, and at worst can di!er greatly from it. (3) The class membership feature vectors do not cluster naturally in the feature space. The boundary between classes is usually "xed by marketing or packaging con- cerns, and often this boundary will pass through an area of the feature space densely "lled with feature points. (4) The a priori class probabilities and the class condi- tional distributions of the measured features change with time, growth conditions and locations, seasonal variations and storage time. The classi"er must reli- ably grade hundreds of thousands of items per day under time-varying conditions. 0021-8634/00/080409#10 $35.00/0 409 ( 2000 Silsoe Research Institute

Adaptive Classification—a Case Study on Sorting Dates

Embed Size (px)

Citation preview

J. agric. Engng Res. (2000) 76, 409}418doi:10.1006/jaer.2000.0557, available online at http://www.idealibrary.com on

Adaptive Classi"cation*a Case Study on Sorting Dates

Matti Picus; Kalman Peleg

Department of Agricultural Engineering, Technion Israel Institute of Technology, Haifa 32000, Israel,e-mail of corresponding author: [email protected]

(Received 18 August 1998; accepted in revised form 23 March 2000)

The probability of classi"er errors in automated grading of fruits is much greater than in traditional well-de"nedand highly separated, static classi"cation tasks. Presently, operators of conventional sizers and colour sortersadjust the class boundaries manually based on observations of obvious misclassi"cation trends in the packedfruit, with the goal of minimizing the classi"cation errors. However, the new sorting machines utilize manyfeatures to reach the grade decision. A human operator is unable to control the multitude of parameters undercontrol. Estimating the between-class discriminant functions requires estimation of the a priori class probabilit-ies (&priors') and the class-conditional probability densities. The time-varying nature of the priors and theprobability densities result in unsatisfactory classi"er performance. To solve these problems, an adaptivegrading approach by &prototype populations' is proposed. The produced stream is classi"ed into a discretenumber of prototype streams or populations by a global &population classi"er'. For each unique prototypepopulation a separate, optimal &grade classi"er' is designed for sorting individual fruits. The global &populationclassi"er' utilizes a "nite-length stack of features continuously updated from the most recently sorted produce.The statistical attributes of the features sample in the stack are analysed to determine which produce populationis currently passing through the system. When the population classi"er determines that the stack contents haveoriginated from a di!erent prototype population, it changes the active &grade classi"er' to the most appropriateone for the current fruit population. An example of simulated adaptive versus conventional train-once,sort-many, grading is presented on data sets obtained from a system to sort dates by machine vision. Theexample demonstrates that adaptive grading by prototype populations yields lower misclassi"cation rates incomparison to conventional sorting. ( 2000 Silsoe Research Institute

1. Introduction

Grading agricultural produce has so far eluded fullautomation. Automated grading of fruit and vegetablesdi!ers from traditional classi"cation tasks in four majorareas.

(1) The criteria for classi"cation are set with respect toa &reference standard,' which may be either a &refer-ence sensor which measures some objective physicalproperties of the fruit, or a subjective panel of humanexperts. Sometimes, the precision and repeatability ofgrade determination by these &reference sensors' maybe worse than those of the machine &estimatorsensors', especially in the case of the human expertsorters. Nevertheless, the goal on an automatic classi-"er that uses estimator sensors, is to imitate thehuman judgement and minimize classi"cation errors

0021-8634/00/080409#10 $35.00/0 409

with respect to the human expert, reference sensor, inspite of the &noisy' grade judgement.

(2) The on-line inspection system utilizes an &estimatedstandard' from a set of features derived from one ormore &estimating sensors'. The estimated standard atbest correlates highly to the reference one, and atworst can di!er greatly from it.

(3) The class membership feature vectors do not clusternaturally in the feature space. The boundary betweenclasses is usually "xed by marketing or packaging con-cerns, and often this boundary will pass through an areaof the feature space densely "lled with feature points.

(4) The a priori class probabilities and the class condi-tional distributions of the measured features changewith time, growth conditions and locations, seasonalvariations and storage time. The classi"er must reli-ably grade hundreds of thousands of items per dayunder time-varying conditions.

( 2000 Silsoe Research Institute

410 M. PICUS; K. PELEG

Using estimator sensors to divide a feature space intoarbitrary categories leads to classi"cation errors. Theerrors are concentrated at the category borders. Thedensity of the feature space at the category borders deter-mine the misclassi"cation rates. These errors can beminimized using the Bayes rule for "xing the class dis-criminant boundaries, but the errors remain "niteand measurable. Any estimate of Bayesian discriminantfunctions requires estimation of the a priori classprobabilities (&priors') and the class-conditional probabil-ity densities. The time-varying nature of the priorsand the probability densities together with the "nitesensor errors have, up to now, resulted in unsatisfactoryclassi"er performance of automated agricultural producesorting.

If the produce stream is viewed as a signal (in statisticalterms), it can be classi"ed into a "nite number of proto-type streams or populations. Then for each unique proto-type population a separate, optimal &grade classi"er' canbe designed o!-line for sorting individual fruits (Peleg& Ben-Hanan, 1993). A global &population classi"er' de-cides to which prototype population the current streamof produce belongs, and switches to the appropriategrade classi"er. The population classi"er utilizes a "nite-length stack of features continuously updated from themost recently sorted produce. The statistical features ofthe stack are analysed to determine which population iscurrently passing through the system. When the popula-tion classi"er determines that the stack contents no lon-ger faithfully represent a sample taken from the currentprototype population, it changes the active grade classi-"er to a more appropriate one. The adaptive classi"ca-tion by prototype population scheme may use some or allof the features used by the grade classi"ers, as well as&metafeatures' * statistics calculated from the currentstack features. Examples of metafeatures of a data set arethe mean vector of the current stack values, the class-dependent covariance matrices calculated from the stack,or the sample of the cumulative distribution function(CDF) represented by the data in the stack. Use of theCDF in theory and practice is described, as it is the mostappropriate of the metafeatures. The metafeatures maybe treated as any other feature in the prototype popula-tion classi"cation task. This attempt to use the metafea-tures in classi"cation tasks assumes that a "nite numberof prototype populations exist in the data with di!erentmetafeatures.

In Section 2, the algorithm is brie#y outlined, with thegoal of introducing the notation used in the rest of thepaper and providing tools for comparing di!erent classi-"er performance. Then the population classi"er is de-scribed more in depth in Section 3. A software simulatorthat implements the algorithm is presented in Section 4.Section 5 demonstrates an application of the algorithm

to sorting dates, and Section 6 compares the results fromthe di!erent sorting strategies.

2. Theory of adaptive sorting by prototype populations

The theory of adaptive sorting by prototype popula-tions was published in detail by Picus and Peleg (1999).A discrete process is sampled and the feature vectorscalculated for each item. The actual feature extractionitself is beyond the scope of this paper. Typical featureextraction techniques can be found in Davanel et al.(1988), or Alchanantis et al. (1993). The set of sampledfeature vectors X is modelled as

X"Mx1, x

2,2 , x

NN (1)

where xi3Rp is a feature vector in the p dimensional

space of real values R for a single item. In the currentapplication, the features are determined using a commer-cial machine for sorting dates, with the full cooperationof the machine manufacturer.

All classi"cation schemes rely on a &reference sensor tolabel a training set with one of C grade labels u

i, i"1,

2,2, C. The reference sensor will most often be a humanexpert or panel of experts in agricultural applications asno hard and fast rules for the grade labels exist. Thereference sensor may use the feature space Rp but morecommonly will use a totally di!erent set of features.Traditional non-adaptive classi"cation schemes use anysupervised classi"cation techniques to design a gradeclassi"er function u"g (x) for the N vectors in X, each ofthese vectors being derived from the reading of the esti-mating sensors of the automatic sorting machine.

In the on-line process, this grade classi"er is used toclassify the items x

i3X. The design of the grade classi"er

is dependent on the training set used. Any variations inthe actual data that are not presented in the training setX, if they are even noticed while the classi"er is beingused, require an adjustment of the classi"er to compen-sate for the new data. While solutions to this problem inone- or two dimensions exist (for example, Gutman et al.,1994), modern sorting machines with multiple sensorsoverwhelm the capacity of the human operator to com-pensate for changes in the produce stream.

In the adaptive sorting process, the data set X ispartitioned into K a priori known, prototype popula-tions X

k, k"1, 2,2, K, each of length n

k:

Xk"Mx

m`1, x

m`2,2 , x

m`nkN, k"1, 2,2, K,

m"

j/k~1+j/1

nj, X"Z

k

Xk

(2)

For each prototype population, a grade classi"er isdesigned:

u"gk(x) ∀x3X

k(3)

ADAPTIVE CLASSIFICATION 411

The parameters for each classi"er are stored. A popu-lation classi"er function k"G(X) is the designed. Usu-ally, the feature vectors themselves do not su$ce todiscriminate between prototype populations, and somefunction of the feature set f (X) must be calculated. Thepopulation classi"er them becomes

k"G ( f (X)), k3M1, 2,2, KN (4)

where f (X) is a function of a set of features of X and G isa function that returns a population label k. This workrelies upon the cumulative distribution function (CDF)for f (X) and an original use of the Kolmogorov}Smirnovgoodness-of-"t test (Press et al., 1992) as the populationclassi"er G.

In the on-line operation of the sorter, the past Nqdata

are stored in a "rst in "rst out (FIFO) stack. After thestack is "lled, it can be modelled as

X"Mxm, x

m`1,2, x

m`Nq~1N (5)

where Nq

is the length of the FIFO stack and m"1,2,2, N

qis the index of the sorted items,

m3M1, 2, . . . ,N!NqN.

After a certain number of new items have been classi-"ed, the features of the stack are calculated and thepopulation classi"er is used to determine the currentpopulation. The appropriate grade classi"er g

k(x) is then

used for subsequent grade classi"cation.To evaluate the e!ectiveness of the sorting techniques,

the actual grade label uris compared to a vector of labels

that is obtained by applying the three types of classi"erfunctions discussed.

u"[ua, u

p, u

t, u

1, u

2,2 , u

K]T (6)

where

ua"g

k(x) D x3X

ku

p"g

k(x) D x3X

m, k"G ( f (X

m))

ut"g

k(x) D x3X u

k"g

k(x) D x3X, k"1, 2,2, K

and where up

is the adaptive classi"er with an ideal,non-erring population classi"er; u

ais the adaptive

classi"er with an erring population classi"er, utis the

non-adaptive classi"er trained on a sample from allpopulations, u

k: k3M1, 2,2 , KN are the K non-adaptive

classi"ers, each trained on single population then used onall the data. The non-erring population classi"er used theknown population label of the inspected item to choosea grade classi"er, and thus #awlessly identi"ed the properpopulation. It can be used only in laboratory conditions.

In the general case, all the populations are not avail-able at the time the training sets are chosen for gradeclassi"er design. Thus, u

tis an optimistic measure of the

capabilities of the non-adaptive classi"er. The di!erencebetween u

aand u

pdepends on many factors. The sorting

simulations of dates described in the following sections

show a clear reduction in misclassi"cation rate for ua

compared to the other strategies.The adaptive classi"cation scheme must also recognize

new prototype populations. To this end, the initial classi-"er design utilizes a statistical hypothesis test to checkthe current prototype population. When the populationclassi"er returns a &not-recognized' label for the datasample X.

k"G(X), kNM1, 2,2, KN (7)

the machine will know that currently in the stack there isan unknown population. Either the stack is in transitionbetween two populations, or indeed the process hasencountered a distinctly new population. The Kol-mogorov}Smirnov two-sample test is one example ofa non-parametric statistical test particularly suited to thetask, and is described below.

Note that, while the prototype populations may bedistinguishable, they may not require unique grade clas-si"ers. Reduction in the number of prototype popula-tions and subsequent simpli"cation of the populationclassi"er may improve the robustness of the system andreduce overall misclassi"cation rates. This can be ac-complished by comparing the performance of the gradeclassi"er on the &wrong' populations:

up"g

k(x) : x3X

j, jOk (8)

where the symbols are as in Eqn (6), and using themisclassi"cation rates as an indication of the &distance'between the populations. Those populations, where themisclassi"cation rates are not signi"cantly di!erent, canbe combined. The issue of de"ning &signi"cantly di!erent'is covered in depth in Picus and Peleg (1999).

3. Population classi5er design

The population classi"er must return as outputs thecurrent population label k"G(X

s) where X

sis the set of

features in the stack and a con"dence measure a,0)a)1, of the decision that X

sindeed comes from

population k.The population classi"er can err in two cases. As a new

population enters the stack, an optimal population clas-si"er would immediately recognize the data as belongingto a new population and change the grade classi"eraccordingly. However, there is some latency in the identi-"cation of a new population. The machine continues toclassify the grade by the old grade classi"er, until itnotices that something has changed. Subsequently, thestack continues to "ll with the second population. Thepopulation classi"er may classify the mix as a thirdpopulation, and sort the items according to its decision.As the "rst population empties out of the stack, the

412 M. PICUS; K. PELEG

population classi"er should settle on the correct identi-"cation of the new population.

Optimization of the stack length q was studied pre-viously in Picus and Peleg (1999) where a minimum errorcriteria was developed. A shorter stack length leads toless &lag' in the decision to change the grade classi"er,results in fewer items belonging to the new populationclassi"ed by the grade classi"er of the previous popula-tion. However, too short a stack does not enable thepopulation classi"er to con"dently identify the actualprototype population. As described in the previous study,rather than using a "xed length q for all populations,q can be allowed to vary dynamically, according to thecurrent con"dence level of the population classi"er deci-sion. The frequency of population transitions in actualpacking house conditions remain to be tested.

3.1. Identifying prototype populations with a cumulativedistribution function

The cumulative distribution function (CDF) incorpor-ates within it the class-dependent probability distribu-tions and the a priori probability of that class in the data.It therefore re#ects the parameters needed to calculatethe a posterior probabilities. If a classi"er can be designedthat accurately estimates the a posterior class-dependentprobabilities, it will approach the optimal Bayes classi"er(Duda & Hart, 1973). Any changes in the data thatmandate a change in the Bayes classi"er must be ex-pressed in the a posterior probabilities, and will changethe CDF. Therefore, the CDF is a good starting point fora population classi"er statistic.

The sample estimate of the CDF (created by samplingthe random variable and for each value calculating theproportion of data with that value or less) is also muchless noisy than its derivative, the probability densityfunction (PDF) most often estimated with historgramtechniques. The CDF must be a monotonically increas-ing function, and its steps are in increments of 1/N whereN is the number of data points in the sample. The PDFestimate, however, is dependent on the di!erence be-tween consecutively ordered values of the sample. Itusually requires some signal-processing type of smooth-ing to provide useful results.

The CDF is visually less satisfying, since it is notobvious from the S-shaped curve whether the distribu-tion is unimodal, skewed, or contains outliers. But a widevariety of theoretical non-parametric techniques foranalysing di!erence between CDFs are known in theliterature (Duda & Hart, 1973; Fukunaga, 1990).

Peleg and Ben-Hanan (1993) trained a neural networkto identify the CDF for the population. They testeda two-population data set collected on apples, with lim-

ited success. The neural network was built in sucha fashion that it could return not only the distance metricbut a measure of the con"dence in its decision, as de-scribed in the previous paragraph. In the present study,the Kolmogorov}Smirnov goodness-of-"t test was foundto be more useful, and is described in detail below.

3.2. Population classi,cation by the Kolmogorov}Smirnovstatistics

The Kolmogorov}Smirnov two-sides statistic ¹, asdescribed in Press et al. (1992) was proposed by Kol-mogorov and tabulated by Smirnov. It is de"ned as themaximum di!erence between two cumulative distribu-tion functions:

T"max DS (X1)!S (X

2) D (9)

where S (Xn) is the sample cumulative distribution from

a prototype population Xnfor n3M1, 2N.

Under the null hypothesis H0

that the underlyingdistributions P of the two populations are equal:P(X

1)"P(X

2), the distribution of ¹ can be tabulated as

a function of the number of samples in X1

and X2. An

analytical expression to calculate the signi"cance of the¹ value was proposed by Stephens (Press et al., 1992).The probability that the population distributions are thesame given a particular value of ¹ is

P (H0)"P [P(X

1)"P (X

2) D¹, N

1, N

2"]"f

Stev(j)

j"A¹CJNe#0)12#

0)11

JNeDB, (10)

fStev

(j)"2=+j/1

(!1)j~1 e!2j2 (j)2

Ne"

N1N

2N

1#N

2

where j is a derived factor, fStev

( ) is an in"nite series, Neis

the e!ective number of points, N1is the number of points

in sample X1, and N

2is the number of points in sample

X2.Figure 1 shows the expected value of the probability of

the null hypothesis with di!ering values of j. The factorj is calculated from ¹, N

1, N

2, and then from it the value

of the con"dence level a"P (H0)"f

Stev(j) is calculated.

Note that a decreases monotonically, and is dependenton the product of N

eand ¹. Given two known CDFs

representing two prototype populations, and a thirdCDF based on a sample from one of the prototypepopulations, it is possible to calculate the j values be-tween the sample and the known populations. The

Fig. 1. The expected value of the analytical function fStev

(j)

ADAPTIVE CLASSIFICATION 413

sample is most likely to have been drawn from the proto-type population with the smaller j.

The j statistic can act as a population classi"er, return-ing a population label k for a given sample of featurevectorsX

sand a training set of K prototype populations:

k"min (jn) ∀n3M1, 2,2 , KN

jn"¹

nNqe

,

Nqe"

NqN

kNq#N

k

. (11)

¹n"max DS (X

n)!S (X

s) D

where S (Xs) is the CDF for the current sample, S(X

n) is

the CDF for population n Nqis the stack length, and N

kis

the number of data in the training set for population k.The probability a"f

Stev(j) indicates that the CDFs are,

indeed, drawn from the same prototype population.

4. Simulation of the adaptive sorting algorithm

Adoption of the adaptive sorting algorithm makes newdemands on sorting machine users and manufacturers.A user-friendly proof of the advantages of the adaptivetechniques is necessary to convince commercial interests.The possible reduction in classi"cation errors must beclearly denonstrated in actual packing house conditions.A software tool was developed for this purpose thatpermit use of the adaptive algorithm while using a uni"eddata input and presentation interface. Both syntheticdata and actual data may be used with the program.

The simulation was built in Matlab (Mathworks,1994), using Graphical User Interface (GUI) capabilities

to display the data and signi"cant results, while allowingthe user to easily change the simulation environment.Data are entered into the simulator in the form of a Mat-lab "le containing the data itself as a matrix, the referenceclass labels, and the known population labels. The algo-rithm used for determining the optimal number of proto-type populations is external to the simulator. The userchooses the "le name from a dialog box, then selects thevariables representing the data using a GUI front end.

After entering the data, the training set is chosen fromteh data, using either a "xed fraction of each populationor a "xed number of data points from each population.Then the grade and population classi"ers must betrained.

Various grade and population classi"ers can be chosenthrough menus. The simulator currently includes k-nearest neighbour (Duda & Hart, 1973) and linear dis-criminant (SAS, 1985) grade classi"ers. The populationclassi"ers currently included are the Kolmogorov}Smir-nov goodness-of-"t test and a classi"er based on thedi!erences in stack mean values, based on control chartstrategies.

Once the user chooses the stack size, the simulator canbegin running through the test data. As the simulationruns, it displays a series of &sorting matrices' (Michieet al., 1994). These compare the actual grade labels u

ras

determined by a reference sensor to those determined byvarious adaptive classi"er techniques as described pre-viously in Eqn (6), and display them in a matrix formatwith the reference lable assigned to the rows and theestimated label assigned to the columns.

After processing the entire test data, the program dis-plays information about the performance of the strat-egies, including the number of items processed and thenumber of errors made by the population classi"er usedin the adaptive technique, as well as the weighted classi-"cation error measure Cu , described in Peleg (1985):

Cu"1!Pu"1!+i

Pui=

i(12)

where Pu is the opposite of Cu and is termed &WeightedGrade Purity Index, while Pu

iis the pure products frac-

tion of grade i, resulting from the sorting operation. Theweighting function =

i"K

iP

i/+ K

iP

iis composed of

the relative costs or penalties Kifor misclassifying grade

i while the a priori grade probabilities Piare determined

by the proportion of grade i in the sorted produce stream.In practice, larger penalties K

iare associated with the

higher priced grades but in the context of this paperequal penalties for all grades, K

i"1 i3M1, 2, 3,2 , CN

are assumed.The user can then change parameters and repeat the

run, or save all the data and parameters for a future

Fig. 2. Schematic description of hierarchical classixcation sys-tem for the classixcation of features into two tags F1 and F2, andthe subsequent agglomeration of the tags to form an item grade u

414 M. PICUS; K. PELEG

session. The simulator uses the breadth of the GUI capa-bilities of Matlab version 4 and is more fully described inPicus (1998).

5. Adaptive sorting of dates

Sorting experiments were conducted with arti"cialdata sets. These were useful for studying the structureand operation of adaptive sorting by prototype popula-tions. They were furthermore very useful for exploringsensitivities to the system's parameter changes. For thesake of brevity only experiments with real data are re-ported, e.g. from a commercial sorting operation ofdates.

This experiment demonstrates that global featurespaces of commercial produce grading systems can bebene"cially subdivided into prototype populations. Fur-thermore, these prototype populations are shown to besu$ciently dissimilar, whereby a population classi"ercan be designed to distinguish between them and byusing separate grade classi"ers for each, the overall mis-classi"cation rate can be signi"cantly reduced.

Dates are currently sorted by hand, a labour-intensiveoperation that is error-prone and expensive. The fruit isspread across a conveyor belt, and the human sortermust separate the fractions by moving each item to theproper partition. Most of the sorting criteria are visual:colour, size, shape, blemishes, folds and blistering in theskin, shininess, and uniformity. However, the sorter alsofeels the date for softness and, while moving it, canchange the classi"cation.

A machine for sorting dates was recently installed inthe Ardom Regional Packing House, in the southernArava region of Israel. The machine development wassupervised by the authors. It sorts dates by machinevision at speeds of 20 dates/second. The manufacturers ofthe machine agreed to modify the software so that themachine saves the features of the last 24 dates sorted onceevery 5 min. The packing house tracks the source of thedate lots through the packing process to properly com-pensate the growers, thus a priori population labels areavailable and were stored as well as the time the sortingtook place and the output of the sorting machine. Withthese modi"cations, the machine ful"lls the necessaryconditions for utilizing the adaptive sorting algorithm.The raw data were supplied in ASCII "le format. Sincethe processing platform of choice in Matlab, a number ofconversions were necessary to input the data. The datawere parsed into a matrix form, with each row represent-ing a data vector. Unreasonable data were rejected:datum where the output grade label re#ected the inabilityof the machine to reach a decision or clear outlier datumon any feature that represented &bugs' in the sorting

machine software, such as bu!er over#ow or division byzero errors.

The resulting data set contains approximately 20 000data vectors, arranged in a matrix with a row for eachpoint, with 31 columns. The "rst represents the time thedate passed through the machine. The second is theshipment identi"cation (ID). Every incoming shipment ofdates receives a unique label which is used to track itsprogress through the packing house and to compensatethe grower. The third is the machine that sorted the date(there are two machines working in parallel). Columns4}28 are features calculated on-line for each date by thesorting machine. The manufacture requested that thedescription of these features not be revealed. An auxiliarygrade label termed F1 is in column 29, and a secondauxiliary grade label named F2 appears in column 30.The actual grade quality label is in column 31.

A unique demi-population label was constructed fromthe following: time and date sorted; shipment ID;machine sorted on (1 or 2).

5.1. Feature selection and date grade classi,ers design

The general structure of the hierarchical classi"er usedby the machine is depicted in Fig. 2. In classi"cationphase 1, a "rst subset of the 25 features set is used tocompute the "rst auxilary grade label F1 (an integer 1}5).In the second classi"cation phase, a linear discriminantclassi"er uses a second subset of the 25 features to com-pute a secondary score F2 (an integer 1}4). In the third

ADAPTIVE CLASSIFICATION 415

phase the scores F1 and F2 are combined in various waysto yield quality grade labels. This combination of the F1and F2 values and even the number of quality gradesitself is "xed by the destination market supply and de-mand situations and time of the year.

Level F1 is based on a single feature, while F2 isobtained from all 24 features. Although the values of F1and F2 are supplied in the data set, the feature composi-tion of F2 was not disclosed by the manufacturer. Thelabels F1 and F2 were adopted as measurements bya reference sensor and the auxiliary grades labels of themachine as reference labels. The task for the classi"er isto reconstruct F1 and F2 given the data feature vector. Asthe estimation of F1 based on a classi"ed trained onfeature 4 alone was quite close (less than 1% misclassi-"cation), it was clear that there would be no furtherimprovement by adaptively estimating the F1 lebel. Con-sequently, the goal of the sorting simulation is limited toadaptive estimating F2.

In the machine, the third phase is performed by thesorting line operator. Using a touch screen for input,the operator chooses combinations of F1 and F2 toform particular quality grade. The actual mapping of F1and F2 to a grade label depends on the destinationof the "nished product and the market forces. Thisprocess is beyond the realm of the adaptive sortingalgorithm.

Through consultation with the machine manufacturer,and after considerable experimentation, estimatingfeature subsets to approximate F1 and F2 and a hier-archical grade classi"er similar to the one used in themachine (see Fig. 2) were designed. In the "rst phase, F1is used to determine one of "ve scores. According to the"rst phase score, a linear classi"er is chosen for thesecond phase of classi"cation and a label for F2 assigned.Then the labels F1 and F2 are aggregated according tomarket and production demands to a grade label inphase 3.

After F1 divided the dates into subgroups, linear dis-criminant classi"ers as described in the SAS packageDISC determined the F2 estimate FK

2:

FK2"gF1 (x)"min(Du

ic (x)) (13)

where

Duic (x)"(x!XM u

ic)T S (Xu

ic)~1 (x!XM u

ic)Plog

%E S (Xu

ic) E

!2 log%P (Xu

ic) (14)

that c3MF1N"M1, 2, 3, 4, 5N and where Xuic is the data in

the training set for the grade uiwith F1"c, X1 u

ic is the

mean vector of Xuic , S (m) is the convariance matrix of

m and P(m) is the prior probability of m.

For each population there are "ve di!erent dis-criminant functions Du

ic (x), one for each value of F1.

This classi"er is based on multivariate normal (Gaus-sian) distributions. However, the data are decidedly non-normal. Each feature was examined for skew (the degreeof asymmetry around the mean). As some of the featureswere highly skewed a sequence of Tukey ladder functions(2x~2, x~1, x~1@2, x~1@4, log(x), x1@4, x1@2, x, x2,2) wasapplied to each feature until the skew was less than 1)0(Tukey, 1977). While the grade classi"er construction hasbeen described at some length here, the adaptive sortingalgorithm is external to the particular grade classi"er andassumes the machine manufacturer or the packing housesta! can provide a suitable grade classi"er. The speci"ccase in-hand demonstrates the ability of the adaptiveclassi"er to incorporate even a complicated grade classi-"er.

A blind stepwise process (patterned after the STEP-WISE procedure in SAS) of feature selection was used tochoose the features to be used for estimating F2. Anupper limit of nine features yielded acceptable perfor-mance. Using a global training set, this classi"er pro-vided an overall misclassi"cation rate of 35%, versus thereference labels given by the machine, typical for anagricultural produce sorting task.

5.2. Partitioning the data into prototype populations

The "rst step in partitioning the data into prototypepopulations involves choosing an optimal feature forcharacterizing prototype populations by CDFs and the¹ statistic. Some consideration was given to use a two- orhigher-dimensional form of the ¹ statistics. These havebeen developed by Smallwood (1996), Feltz and Goldin(1991), and others. However, one of the prime advantagesof the ¹ statistic is its non-parametric character: the testdoes not depend on the underlying distribution of thedata. In the tests developed for higher dimensions, thischaracteristic is not preserved. In addition, the &curse ofdimensionality' as described in Scott (1992), requiresmore data to adequately describe the two-dimensionaldistribution than the one-dimensional distribution.

The fact that the population classi"er utilizes only onefeature may not be a severe handicap. In most cases, theactual classi"cation task relies on a main feature, withnuances supplied by the others. Many of the acquiredfeatures are highly interdependent. Any change in themakeup of the produce stream that has repercussions forsorting, and will manifest itself in all the interdependentfeatures, including the one chosen for population classi-"cation.

Even with the one-dimensional CDFs, optimal par-titioning of the data into prototype populations is a

Fig. 4. Population classixer performance as a function of stacklength

416 M. PICUS; K. PELEG

di$cult task. A necessary pre-condition for a feasiblepartition is that the prototype populations must be &iden-ti"able', i.e. the di!erences between the resulting CDFsmust be su$cently large. One measure of the identi"-abilty of the prototype populations is the statistic a,another is the misclassi"cation rate in the resultant clas-si"er. Obviously, there may be many identi"able parti-tions which invariably provide a &payo!' in terms ofincreasing the weighted grade purity index Pu versusnon-adaptive sorting, wherein the entire data set com-prises just one prototype population.

Among the various identi"able partitions, the optimalpartition should provide the maximal &payo!' max(Pu).However, the partition with the minimal population mis-classi"cation rate is not necessarily the one which pro-vides the maximal payo! max(Pu). A full exploration ofthese "ne details, for "nding optimal prototype popula-tion will be included in a future paper. In the context ofthis work, a semi-optimal partition of the data set wasdetermined by choosing an optimal feature for generat-ing CDFs and reducing the number of a priori prototypepopulations, as follows.

The maximum distance in the training-set sampleCDFs between all possible binary combinations ofa priori known data populations was determined. Thenan agglomerative process (based on the CLUSTER pro-cess of SAS) of combining populations based on thisdistance measure was carried out. This resulted in a dendro-gram depicted in Fig. 3, which shows the estimatedcon"dence that the population are di!erent. The dendro-gram for each feature was examined, and the feature withthe maximum separation of the a priori populations waschosen.

The dendrogram also provided a rationale for combin-ing populations. The original 20 populations were re-duced to 16 by combining those with a low ((0)6)probability of coming from di!erent populations.

Fig. 3. Dendrogram of the probability of diwerences between thea priori populations

6. Results

After designing the classi"ers and identifying thesource populations, the sorting algorithm was applied tothe data. The hypothesis that population could be identi-"ed in the data was tested by training a populationclassi"er. The hypothesis that by dividing the data intopopulations and tuning a grade classi"er for each popu-lation would reduce misclassi"cation rates was alsoexamined. Finally, the overall classi"cation strategies(adaptive and non-adaptive) were compared.

6.1. Population classi,er performance

The performance of the population classi"er was testedby dividing the data into a training and test set. The testset data were formed into a indexed sequence of data, andthe data fed into a FIFO stack for classi"cation. Thelength of the stack was varied. The results are depicted inFig. 4. It may be seen that the population classi"erexhibits the expected behaviour. For a given training setsize an optimum stack length is obvious.

The misclassi"cation rate has a minimum as a functionof the stack size, since on the one hand increasing thestack length improves the accuracy of the classi"cation,but it also increases the number of classi"cations madewhile the stack is in transition between two populations.

6.2. Adaptive versus non-adaptive grade classi,erperformance

The prepared data and grade classi"er were input tothe adaptive sorting simulator. Table 1 compares theweighted grade contamination index Cu for the gradeclassi"ers. There are 17 grade classi"ers, one for each ofthe 16 prototype populations, and another global onetrained on data from all the populations together.

Table 1Non-adpative grade classi5er performance training set size is

30% of each population

Weighted grade contamination index

Training set source Test set from trained Test set frompopulation population only entire data set

1 22)4 39)22 29)9 45)83 20)4 43)24 23)6 43)65 15)8 39)56 13)5 49)97 23)3 64)98 16)2 39)89 27)9 36)1

10 19)5 51)911 29)3 47)312 21)0 43)813 11)6 48)214 31)6 51)615 24)3 46)716 21)8 58)4All 33)3

ADAPTIVE CLASSIFICATION 417

Each classi"er was applied to a test set taken from theappropriate population, and another test set taken fromthe entire data set. The test set data were classi"ed andthe estimated quality value compared to the known,reference value. The incorrect decisions were countedand the weighted grade contamination index calculated.It is obvious that each classi"er performed best on thepopulation for which it was trained. In addition, each ofthe single-population classi"ers performed worse on theentire data set than the global classi"er.

The size of the training set used for the various classi-"ers was altered while all other parameters were heldconstant. The size of the stack used in the populationclassi"er and the size of the data set were varied indepen-dently while all other parameters were held constant.

The test set data were classi"ed and the estimatedquality value compared to the known, reference value.

Table 2Overall classi5er weighted grade contamination index Cu; stack

length at 100 and training set size at 30%

Classixed strategy Misclassixcation rate

Worst case non-adaptive 64)9Best case non-adaptive (u

t) 33)3

Adaptive with non-erringpopulation classi"er (u

p) 22)6

Adaptive with erring populationclassi"er (u

a) 26)2

Various sizes of training sets were run through the pro-cess. The classi"er was also trained and tested on eacha priori population separately. Again, the incorrect deci-sions were counted as a percent of the data. Table 1 con-solidates the misclassi"cation rates of the grade classi"er,by training and test set.

The misclassi"cation rates clearly demonstrate the ad-vantage of adaptive sorting. Using the same training set,a clear reduction in misclassi"cation can be obtained bytailoring the classi"er to the particular feature spacecurrently being sorted. The table also emphasizes theprice of a mistake in the population classi"er. Using thewrong grade classi"er for any extended period will undoall the reduction in overall misclassi"cation.

Table 2 presents the overall comparison of the di!erentclassi"cation strategies. The two optimal classi"ers arecompared to the adaptive classi"er. In real-life situations,the non-adaptive classi"er performance will be signi"-cantly worse, as the training set cannot anticipateall possible variation in the produce. The adaptive clas-si"er avoids this problem through the identi"cationof new populations. It can notify the packing housesta! that something has changed and adjustments arenecessary much before post-sorting quality control (if itexists).

7. Conclusions

A detailed description of an adaptive classi"cationmethod based on observations of human assessment offruit quality was presented. At the root of the simulationis the concept of prototype populations. This conceptmimics the reaction of the experienced human sorter asthe produce stream changes. The description was aug-mented by an example of sorting date fruits, and thee!ect of changing various parameters in the sorting pro-cess on the misclassi"cation rate was explored. While thescope of the project was to develop the theory behind theadaptive sorting scheme, the actual simulation imple-mentation takes into account the need to convince non-technical sorting line managers of the superiority of theadaptive approach. The simulation software can be usedwith almost any type of grade classi"er. The adaptivealgorithm itself can easily be incorporated into an exist-ing sorting machine, with little intervention in the currentsorting software.

References

Alchanantis V; Peleg K; Ziv M (1993). Classi"cation of tissueculture segments by color machine vision. Journal of Agricul-tural Engineering Research, 55, 299}311

418 M. PICUS; K. PELEG

Davanel A; Guizard T; Labarre T; Sevila F (1988). Automaticdetection of surface defects on fruit using a vision system.Journal of Agricultural Engineering Research, 41, 1}9

Duda R; Hart P (1973). Pattern Classi"cation and SceneAnalysis. New York

Feltz C; Goldin G (1991). Generalization of the Kol-mogorov}Smirnov goodness-of-"t test, using group invari-ance. DIMACS Technical Report, pp 91}125

Fukunaga K (1990). Introduction to Statistical Pattern Recog-nition, 2nd Edn. Academic Press, San Diego CA

Gutman P; Peleg K; Ben-Hanan U (1994). Classi"cation byvarying features with an erring sensor. Automatica, 30,1943}1948

Mathworks Inc. (1994). Matlab User Guide Ver. 4. The Math-words, Natick, MA

Michie D; Speigelhalter J; Taylor C (1994). Machine Learning,Neural and Statistical Classi"cation. Ellis Horkwood, NewYork

Peleg K (1985). Produce Handling Packaging and Distribution.Avi Publishing Co., Westport CN, USA

Peleg K; Ben Hanan U (1993). Adaptive classi"cation byneutral net based prototype populations. InternationalJournal of Pattern Recognition and Arti"cial Intelligence, 7,917}993

Picus M (1998). Optimal adaptive classi"cation by prototypepopulations. Doctoral Thesis, Technion, Haifa, Israel

Picus M; Peleg K (1999). Optimal adaptive classi"cation ofagricultural produce. Computers and Electronics in Agricul-ture, 22, 11}27

Press W; Teukolsky S; Vettering W; Flannery B (1992). Numer-ical Recipes in Fortran. 2nd Edn. Cambridge University,Cambridge

SAS Institute Inc. (1985) SAS User's Guide: Statistics. Version5 Ed. SAS Institute Inc., Cary NC

Scott D (1992). Multivariate Density Estimation. Wiley,New York

Smallwood R (1996). A two-dimensional Kolmogorov}Smirnovtest for binned data. Physical Medical Biology, 41, 125}135

Tukey J (1997). Exploratory Data Analysis. Addison-Wesley,Reading, MA