Adaptive sorting by prototype populations

Pattern Recognition Letters 15 (1994) 111-123 February 1994 North-Holland

PATREC 1157

Adaptive sorting by prototype populations Kalman Peleg and Uri Ben-Hanan Agricultural Engineering Department, Technion Israel Institute of Technology. Haifa 32000, Israel

Received 23 October 1992

A bstract

Peleg, K. and U. Ben-Hanan, Adaptive sorting by prototype populations, Pattern Recognition Letters 15 (1994) 111-123.

An algorithm for unsupervised adaptive sorting is presented, based on a finite number of 'prototype populations', with distinctly different feature distributions, each representing a typically different source population of the inspected products. Updated feature distributions, of samples collected from the currently sorted products, are compared to the distributions of the stored prototype populations, and accordingly the system switches to the most appropriate classifier. Although the goal is similar to the objectives of previously proposed 'Decision Directed' adaptive classification algorithms, the present algorithm is particularly suitable for automatic inspection and classification on a production line, when the inspected items may come from different sources. The practical feasibility of the approach is demonstrated by two synthetic examples, using Bayes classifiers. This is followed by an applied example, wherein two prototype populations of apples are sorted by size, derived by machine vision. It is shown that misclassification by adaptive classification is reduced, in comparison to non-adaptive classification.

Keywords. Adaptive classification, agriculture, automatic sorting, machine vision.

1. Introduction

Automatic sorting on a production line in general, and particularly sorting and sizing of agricultural products, poses two special problems:

(a) The class conditional distributions of the characterizing features which serve as classification in- dexes, may change with time, because the products come from various sources, e.g. different strains, growth conditions and locations, seasonal variations, etc. Also, the a priori probabilities are usually not constant, because the relative composition of the grades in the product stream input to the classification system may change with time. An optimal classification algorithm must be able to detect and adapt to these changes.

(b) The criterion for optimality is the misclassification with respect to a reference standard, which may

be set by a 'reference sensor' or subjectively by a panel of human expert sorters, while the characterizing features are measured by an 'estimator sensor', of the on-line inspection system. The measurement scale of the 'estimator sensor' and the physical property it measures, are often quite different from the 'reference sensor' or the judgments of the human sorters.

The meaning of 'reference sensor' and 'estimator sensor' in the context of this paper, will be presented by an example, wherein apples are sized by machine vision, while the fruit volume is chosen as the measure of size, Ben-Hanan (1992). By immersing each fruit in a calibrated beaker and noting the rise of the water level, we can measure its volume precisely. The fruit volume may also be measured automatically by machine vision, albeit with less precision. Here, the standard fruit volume threshold, for sizing the apples into 'small' and 'large' categories is set by a precise

0167-8655/94/$07.00 © 1994 -- Elsevier Science B.V. All rights reserved SSD1 0167-8655(93)E0017-I

111

Volume 15, Number 2 PATTERN RECOGNITION LETTERS February 1994

'reference sensor' while the 'estimator sensor' is clearly less accurate.

Classical pattern recognition strives to classify objects in a scene, by feature vectors, which tend to cluster naturally in feature space. In the above example, there are no natural clusters of 'small' and 'large' apples, that can be identified. Here, the true class membership is determined by a standard threshold, set by a 'reference sensor', in one feature space, while the task of the 'estimator sensor' is to form a similar threshold in another feature space, such that the misclassification about the standard threshold is minimized. The standard threshold may be measured explicitly by a precise sensor, or it may be set implicitly by expert sorters.

A solution to these problems is proposed herein, by an unsupervised adaptive classification scheme, based on a finite number of 'prototype populations', with distinctly different feature distributions, stored in the computer, each representing a typical source population of the inspected product. Updated feature distributions, in statistically representative samples collected from the currently sorted products, are compared to the distributions of the stored prototype populations, and accordingly the system switches to the most appropriate classifier.

The goal of this approach is similar to the objectives of previously proposed 'Decision Directed' adaptive classification algorithms, as for example in Kazakos and Davisson (1980), McAulay and Den- linger (1973), Nguyen and Lee (1989), Stirling and Swindlehurst (1987), Young and Farjo (1972), or Widrow and Winter (1988). However, our solution is particularly suitable for on-line automatic inspection and classification on a production line, when the inspected items may come from different sources.

In some 'Decision Directed' adaptive classification algorithms, the system may 'run away', i.e., a se- quence of errors may result in a degradation of performance, see for example Kazakos and Davisson (1980). Since our system simply switches between a finite set of classifiers, a 'runaway' situation is virtually impossible.

The practical feasibility of the approach is first demonstrated by two simple synthetic examples, using Bayes classifiers. This is followed by an applied example, wherein two prototype populations of ap-

pies are recognized and sorted by size, derived by machine vision.

2. Continuously adaptive classification

In this section we introduce the concept of adaptive sorting by prototype populations and show how a continuously self-adapting classifier can be constructed, for a special group of multi-feature, two-class systems. Most practical produce-sorting systems use just one feature, such as size or mean color, whence the feature vector is reduced to a scalar x, while the optimal discriminant functions, are just thresholds, tj ( j= 1, 2, ... is the prototype population number). Our task is to recognizej prototype populations {A j}, with feature frequency distributions Fj(x), while sorting each into two categories {alj} and {a2j}, with respective class conditional feature frequency distributions ~(xJ 1 ) andfj(xl 2), about optimal thresholds x = t j .

It is well known that by Bayes Rule (Duda and Hart (1973) ), the optimal thresholds tj are determined by the intersections of the class conditional frequency distributions fj(xl 1 ) and fj(x] 2), composed of the products of the class conditional probability density distributions and the respective priors.

In some practical produce-sorting systems, the class conditional frequency distributions of the characterizing feature may be assumed to be approximately normal, whence given N(/z~j, au) and N(/t2j, 0"2j), w e

have 2

1 ( 1 [ ~ ] ) ~-~O.U exp --

( 1 1 exp - S L ~ A I (I)

- - N / / ~ O.2j

This can be readily solved for ts, by equating the natural logarithms of both sides of ( 1 ) and solving the resultant quadratic equation:

2 2 2 2 (r~22j~71j_/.llj172j 1 "2J--/u a2,s-a2j2 .J/-RL alj__O2 j J

2 2 2 2 2 2 1/2 _ ,u2ja u - St ua~ j - 2aci ja2j In ( a v / a 2 j ) "~

2 2 O" l j - - O'2y /]

(2)

112


In some cases, e.g. when the misclassification is large, the two curves will intersect in two places corresponding to t]j and t2j, but normally there is just one intersection tj that fulfills kqj<~tj<~#2j, assuming

] ~ l j < ]~2j"

An interesting special case is when the variances of the two classes are equal, that is ~--~,,j, which allows continuous unsupervised adaptive sorting. We shall examine this case in some detail, because it can be easily treated analytically, and the line taken will point the way to adaptive sorting in more complicated systems. It is easy to see that when a - CTmj, equation ( 1 ) leads directly to

O=(lqj+ltz j ) /2=kt j (if O'mj = 0") . (3)

Clearly, ~ ( x ) =f j (x l 1) + £ ( x l 2), but generally Fj(x) is not normal, even when £ ( x l l ) and £ ( x l 2) are normal; however the means/zj of {A j} are always equal to tj, if amj=a.

In Figure 1A a prototype population is depicted with Fl(x ) composed o f f~ (x l 1 ) = N ( - 5 , 3) and ft (xl 2 ) = N ( 1, 3 ). Note that the area below F~ (x) is normalized to 1, while fl ( x l 1 ) and fz (xl 2 ) quantify both the respective probability density distributions and the priors, i.e., they represent actual training sets. By equation (3), the optimal threshold is t l= ( - 5 + 1 ) /2 = - 2, which minimizes the misclassifi-

0.01 [ - - ~

All= I

0.008 b f~ (x I I) - ~ .

/ / 0 . 0 0 6 / : ' " ' ,

q" /' / 00O4 /

o.oo, /

A I I

= - 2

~ , , ~ . j ~ F, (x)

----] 0 . 0 1 - -

0 5 10

0.008

-t 0.006 fl (x I 2) ; ~ ,

• 0.004

15

0.002

i

B r - - - i i - - ~ . . . . . . ]

I AJ2= 1' 2 = 4 i

-10 -5 0 5 10 15

X

C 0.01 - - - - T - , , , , ! 10,

/

AJu= tu= I

fu (x I I [ ...... " I=.. i " , , j u(X) J

. .,, . 12) I

-10 -5 0 5 10 15

0.008

0.006

0.004

0.002

0 -15

X

D

~ , CL I

- - t ' "1 : - . 1 ,35

!¢!;1i' .t 00 2 / 4 6 8 10

tz(XY) x

Figure 1. Continuously adaptive, two-class sorting systems, when the class conditional frequency distributions of the features are Gauss- Jan with equal variances. (A) F~(x)=N(-5, 3)uN(1, 3). (B) F2(x)=N(1, 3)•N(7, 3). (C) Fu(x)=F~(x)uF2(x). (D) Example

of a two-dimensional case.

113


cation, quant i fed by the shaded areas about q. I f the mean of the sorted population changes say to

#2=4, as in Figure IB, the threshold tl = - 2 will no longer be optimal and the misclassification will increase dramatically. I f the threshold is updated to t2 = 4, by equation (3), the misclassification will re- main minimal, as depicted by the shaded areas about tl and t2.

Variations of the means & of prototype populations can be easily tracked on-line, by storing a stack SK(x) of a representative sample of the last x classified items. As a new item enters the sorting system, its characterizing feature measurement x is added to the records in the stack, while the measurement of the oldest item is removed. Assuming that the sample of records in the stack adequately represents the currently sorted prototype population {A j}, the mean of the stack #s can be used to adapt the threshold, e.g. #,~-#j=tj, provided that f j (x[1) and ~(x12) are normal and a = (Tmj.

We have thus created (for this simple special case), a continuously self-adapting sorting system, that automatically tracks the means #~ of the source populations {A j}, and adapts the thresholds tj, so as to minimize the misclassifcation at all times. Without adaptation, we may choose any one of the tj possible thresholds (in the above example tl = - 2 or t2=4), to sort all prototype populations, which may result in large misclassification, when the incorrect classifier is used.

A more sensible, but in most cases practically impossible strategy, is to use a sample from the union {Au) of all possible prototype populations {A j}, to train the classifier, which in the two-class case with two prototype populations, is simply:

timal sorting threshold tu is obviously located at the mean Pu = tu of Fu (x), given by

tv = # v = (#~1 +#21 +#12 +#22)/4

= (P,u +#2u) /2 . (6)

In the above example this gives

t u = ( - 5 + 1 + 1 + 7 ) / 4 = ( - 2 + 4 ) / 2 = 1 .

This threshold is a natural compromise between t~ and t2 as demonstrated by comparing the shaded areas in Figures 1A, 1B, and 1C, which quantify the respective misclassifications.

The same line can be taken in the two-class, multivariate normal feature frequency distribution case, when the feature vectors are n-dimensional, and all the covariance matrices @ are diagonal with equal components. For the sake of brevity, we shall only demonstrate the two-dimensional case, illustrated in Figure 1 D, using two characterizing features x and y. Since trmj=Cr, the equal probability contours fmj(x, y) = constant, are concentric circles in the xy- plane, whence an optimal Bayes classifier threshold line tj(x, y) is given by

' ( ' ) 2ha2 exp - ~ [ (X-#x l j )2+ (y_#y~j)2]

1 (y__#y2j)2]) - 2--~a2a2 e x p ( - ~--a2 [ (X-#x2j) 2 +

(7)

with which the equations of the family of threshold lines tj(x, y) of various prototype populations can be easily obtained by equating the exponents. The slopes tg(a j ) of these lines are given by

{Av}={a, l}w{az,}w{a12}w{a=}= U {Aj} J

(4)

so we can write, for the two-class case, while normal- izing the area underneath Fu(x ) to one:

Fu (x ) =fu (x[ 1 ) +fv(x[ 2 )

= [ f , ( x l l ) + A ( x l l )

+A (xl 2) +f2 (x[ 2) ] . (5)

This is demonstrated in Figure 1 C, where the subop-

tg(aA = - #x2j- #~,,j. (8) #y2i - #y2j

In the two prototype populations illustrated in Fig- ure ID, one has #x2~-#~11=7-5, #yz1--#y11=8--6 while in the other #~22- #x~2 = 5 - 3, #~z2- try12 = 2 - 2, so the respective slopes are a l = 135 ° and a2 = 90 °. The threshold lines ti(x, y) are clearly orthogonal to the axes of symmetry CLj, that connect the centers of the respective class circles (#xmj, #ymj).

Note also, that the intersections of tj(x, y) and CLj are equidistant from the centers of the class circles, marking the centers of gravity of the prototype pop-

114

Volume 15, Number 2 PATTERN RECOGNITION LETTERS February ! 994

ulation clusters in the xy-plane, Mxj= (ttxu+ #x2j)/2 and Myj = (~ytj+/~yzj)/2. In the case of the example in Figure 1D this gives Mx1=(5+7)/2=6 and M y ~ = ( 6 + 8 ) / 2 = 7 , so tl(X,y)=x+13-y=O, and Mx2= ( 5 + 3 ) / 2 = 4 , M y 2 ( 2 + 2 ) / 2 = 2 , whence /2(X, y ) = 4 - x = 0 .

Thus, an unsupervised continuously adaptive sorting system, in this two-dimensional feature space can be constructed, similarly to the previously described one-dimensional case. However, here we have to track both ~j and the coordinates of the center of gravity Mxj, M~j, of the prototype population clusters. Vari- ations of c 9 and Mxj, Myj in the prototype populations can easily be tracked on-line, by keeping a two- dimensional stack S~(xy) of a representative sample of the last ~c classified items. As a new item enters the sorting system, its characterizing feature measurement x, y is added to the records in the stack, while the oldest record is removed. Assuming that the sample of records in the stack adequately represents the currently sorted population {Aj), the angle % + 90 °, that the axis of symmetry CLj forms with the coordi- nate axis xy, and the corresponding coordinates of the center of gravity M~j, Myj may be estimated from the records in the stack. There are several ways to do so, which for the sake of brevity will not be discussed here.

Extension of this approach to 3-D and n-D feature spaces follows naturally, whence one can automatically track the discriminant planes or hyperplanes, tj(x) =0, by CLj and the centers of gravity Mj. Note that we can do so, because in this special case there is no need to train the classifier by a separate training set for each prototype population. It is sufficient to observe the external structure of the prototype population clusters in terms of vectors CLj and Mj, to uniquely determine the discriminant hyperplanes tax).

3. Discrete adaptive classification

The effective number of prototype populations in the above special cases was infinite. Implicitly, each stack of the last K classified items was considered as a separate population, for which an optimal discriminant function could be found without a new training set. Unfortunately, this method cannot be simply ex-

tended to multi-class sorting systems, because there, the optimal discriminant function cannot be uniquely deduced by just observing the geometrical shape of the prototype population cluster in fi~ature space. Thus, when the number of classes is more than two, we need a separate training set for each prototype population, in order to determine its optimal partition into the desired categories. Of course, this is also true in the two-class case, when amj~ constant, as demonstrated in Figures 2A and 2B.

In Figure 2A, the prototype population distribution F~ (x) is composed off~ (x] 1 ) =N( - 7, 1.5 ) and f l ( x l 2 ) = N ( - 2 , 3), while in Figure 2B, F2(x) is composed of f 2 ( x l l ) = N ( - 2 . 5 , 2.5) and fz (x12)=N(4 .5 , 3). By equation (2), the corresponding optimal thresholds for these two prototype populations are t~ = - 4 . 8 and 12 = 1.0. These thresholds are optimal Bayes classifiers, which minimize the misclassification, quantified by the shaded areas.

The union of these two prototype populations ~Au } -- (J T {Aj}, with distribution Fu (x t and the constituent class conditional frequency distributions fu(x[ 1 ) , fu (xl 2) are depicted in Figure 2C, as computed by equation (5). For easy comparison, we replotted the two prototype population distributions F,(x), Fz(x) and the distribution of their union Fu(x) , in Figure 2D. In this case, all the variances are not equal, so equation (6) cannot be used for computing the suboptimal threshold tL~ = - 1.5, but tu can be easily found at the intersection offu (xl 1 ) andfu (xl 2). Note that, as in the previous examples, the areas below Fj(x) are normalized to 1, while f j (x[1) and f j (x l2) quantify both the respective probability density distributions and the priors, i.e., they represent the actual feature frequency distributions of the training sets.

In this example, our ability to design an adaptive classifier hinges on automatically detecting when the distribution of the feature x is closer to F~(x) or k) (x), whence the correct optimal threshold t, or t2 may be switched to. From Figure 2D i:t is clear that in this example the distributions of the two prototype populations are quite different, so it ,~hould not be too difficult to design a classifier to distinguish between them.

As in the previous case, we can easily track on-line the distribution of the currently sorted population, by storing a stack S~(x) of a representative sample of

115


0.015

0.01

0.005

°.15

i i

~ ( x ) ~ ~

fl x 1 ,

-10 -5

A

= - 4 . 9

f,(x 1 2)

~,.~ , ~ ,

0 5 10

0.015

0.01

0.005

15 %

B

F2(x) tz=l. o

f2(x I I) ~ ~

-10 -5 0 5 10 15

0.015

0.01

0.005

X

C

tu= - 1.5

fu(x IO,

0.015

F u ( X )

fu(X I Z)

0 5 10

0.01

0.005

0 I 0 " ~ " -15 -10 -5 15 -15 -10

//' , / /'

X

D

;#',

; ~ F u ( X ) /

"~'~,, .... ---.. / F,(X) , X~x\

L'\ L

-5 0 5 10 15

X X

Figure 2. Example of adaptive sorting by discrete prototype populations, when the class conditional frequency distributions are Gaussian but the variances are not equal. (A) FI(x)=N(-7, 1.5) u N ( - 2 , 3). (B) F2(x)=N(-2.5, 2.5) uN(4.5, 3). (C)Fu(x)=Fl(x)uF2(x).

(D) Replotted, prototype population frequency distributions and their union F~ (x) uF2(x).

the last x classified i tems. As a new i tem enters the sorting system, its character iz ing feature measurement x is added to the records in the stack, while the oldest record is removed. Assuming that the sample o f records in the stack adequate ly represents the currently sor ted popula t ion {A j}, the d is t r ibut ion of the stack Fs(x) can be compared to the s tored dis t r ibu- tions Ft (x) and F2(x ). When Fs(x) is closer to F1 (x ) than to F 2 ( x ) , then tl should be used, while t2 will yield less misclassif icat ions when Fs(x) is closer to F2(x) than to F i (x ) .

Unl ike in the previous special case, this classifier is not cont inuously self-adapting, because here we

have to switch between two dis t inct thresholds; however, i f the difference between the pro to type popula- t ions is relat ively large, this is not a big hindrance. As the difference between the pro to type popula t ions decreases, it will be more difficult to dis t inguish between them. But, at the same time, the difference between the corresponding opt imal thresholds tj will also decrease, so it will not make much difference whether the correct populat ion match is found or not. In other words, we need adapt ive classification only when there are significant differences between the prototype populat ions , otherwise we might as well use one classifier for all. Practically, this means that the min-

116


imal difference between the prototype populations should enable more than 50% correct population recognitions.

Note that it is immaterial whether the shapes of the prototype population clusters in feature space vary due to changes of the class conditional feature distributions, or because of changes in the priors. Thus, there is no need to consider changes of priors sepa- rately because each classifier is trained for a distinct prototype population, which is characterized by its class feature distributions and the priors.

4. Adaptive sorting in the probability domain

In most practical sorting systems, the class condi-

tional frequency distributions of the features~ (x lm) and the population frequency distributions Fj (x) are quite noisy. It is therefore advantageous to transform them into the probability domain by:

pj(xlm)= i £ ( r l m ) d r , (9)

P j ( x ) = i Fj(r) dr (10) - - o o

where r is a dummy variable of integration and Pj(Xl m) are the cumulative feature distributions of the m classes in each of the j populations, while Pj (x) are the probabilities of the j populations. The result

"K

A

().4 ~ : ~ \ C t l l X ) - - - ' / / ! (l.3 t - '" ?1 = - 4 . 9 //' 2

Ol = f 2 " "- ", - C21 = . 0 9

r" BO~

-10 -5 0

I

Ct I =.42

%,

10

0.4

0.3

0.2

0.1

0 L -10

B \ \ \ I

C tz='41--T'\ x Ctz(x)

t , ' \ \ / t,=l.0 ~k J

Ct2= .IO . . . . . ~

C22 = . 0 6 : ~

-5 . . . . . 0

r • / / ]

I I /" i

/ i

r '/ I

,/' J / i / I

' C2,,,(x) ! i i i

1 Cja = 0 4

" i 5 10

0.4

0.3

0.2

X

c '• I i i j %

'. t u = - 1.5 / / :3 / ,::,. C t o(x ) /

Ctu=.23, 4 / / ~ _ ~ _ J J - Cau(x)

o. t ! -10 -5

Czu=.15

0.8

I 0.6~

0 . 4 -

- - Ciu = . 0 8 0.2 •

I - - . . . . . I 0

0 5 10 5

X

D T

ff(x)--_..~ ------~--Pu(x)

J

-10 -5 0 5 10 15

X X

Figure 3. Transformation from feature frequency distributions of Figure 2 to probability distributions and plots of cross contamination functions. (A), (B), (C) and (D) are respective transformations of Figure 2.

117


of this transformation, for the example of Figures 2A- 2D, is demonstrated in Figures 3A-3D.

The misclassifications, which were graphically depicted by the shaded areas in Figure 2, can be directly quantified in the probability domain, as illustrated in Figures 3A, 3B and 3C. To this end, we define a cross contamination function Cts(x), for each prototype population j, composed of the individual misclassifications of grades 1 and 2, C2s(x) and Cls(x):

Cts(x) = [Czj(x) + Cu(x) ] . ( 11 )

In general, for any pair of class frequency distribu- tionsffix[ 1 ) , f ( x ] 2), we can compute these sums by

Ctfix)={ i ~(zl2)dr+ i ~ ( z l l ) ]dr } - - o o x

={ps(xl2)+ps(ooll)-&(xll) } (12)

where & ( ~ l 1 ) is the value of the integral (9), for the largest possible value of the feature x--, + ~ . If the feature frequency distributions are known, as in the case of the example in Figure 2A, the cross contamination function Ct 1 ( x ) , when sorting prototype population 1, can be computed by

l{i ~I e x p ( _ l r r - / 2 2 , 1 2 ~ d r Ctl(X)=~ _o~ x ! - , , # 2 , L a2, _J i

1 exp - ~ + x X / / ~ a l l k a l l l ] )

(13)

Ct2(x) can be similarly computed by using the corresponding values of Pro2 and am2 in equation ( 13 ). If the feature frequency distributions are not known, Ctj(x) can be computed by numerical integration of the feature frequency histograms, or graphically, by plotting together Ctj(x), C2j(x)=pj(x[2) and C~j(x) = [pj(oo [ 1 ) -pfixl 1 ) ], as shown in Figures 3A, 3B and 3C. The optimal thresholds are obviously located at x = t s, where Ctj(ti) is minimal, while the intersections of x = t s with C2j(x) and Cu(x ) quantify the minimal misclassifications in grades 1 and 2. In the case of the examples in Figures 2A, 2B and 2C, we computed Ct~( - 4 . 9 ) = 0 . 1 2 , Ct2(1 )=0.10 and Ctu( - 1.5) =0.23.

When the number of classes Mc is more than two, the number of cross contamination functions will be

Me-1 , one for each pair of adjacent classes. The overall performance of the sorting system may then be evaluated by a mean cross contamination index, based on M e - 1 cross contaminations between the adjacent classes, as described in Peleg ( 1985 ).

In the above example, the improvement in classification accuracy by using adaptive sorting is quite dramatic, because the difference between the two prototype populations is quite large. As this difference decreases, so will the difference between the minimal cross contaminations of the grades by adaptive and non-adaptive sorting. In most practical cases one does not have an overview of all possible prototype populations, so the suboptimal threshold tu cannot be derived by a training set from their union. Thus, it is quite possible that tl and t2 may be reversed, to sort prototype populations 2 and 1, effectively raising the cross contaminations to Ctl ( 1 ) =0.42 and Ct2(-4.8) =0.41, as illustrated in Figures 3A and 3B.

Observe also, from Figure 3D, that a population identifier algorithm can be simply designed in the probability domain. That is, for any feature reading x compute the absolute differences As(x), between the cumulative distribution of the stack Ps(x) and the probabilities of the prototype populations Ps (x). Se- lect the most likely prototype population {A j} and the associated classifier, for which the difference Afix) is minimal:

Afix) = min{ I Ps (x) - ~ ( x ) I } • (14)

Occasionally, a stack may be a perfect mean of two adjacent prototype populations, whence the set of differences given by expression (14) may not have a single minimum. For example, in the simple case of two prototype populations, one may obtain:

Fs = (F1 (x) + F2 ( x ) ) / 2 , ( 15 )

in which case:

IFs(x)-Fl(X)l=lF,(x)-Fe(x)l , (16)

and we would not know whether classifier ! or 2 should be used. In practice, this can happen tempo- rarily, during the transition period, between two prototype populations. In this case, we have to retain the currently used classifier, until the new prototype population dominates the distribution of the stack, whence ( 15 ) does not hold anymore.

118

Volume ! 5, Number 2 PATTERN RECOGNITION LETTERS February 1994

Theoretically, expression ( 15 ) may also hold continuously, when F~ (x) ~ F2 (x), but in this case it does not make much difference which classifier is used. Effectively, this means that one of the two prototype populations is redundant. There is also a very remote chance that an unknown new prototype population F3 (x) enters the system, exactly in between F~ (x) and F2(x), such that expression (15) holds continuously, while Fl(X) is significantly different from F2(x), that is:

Fs ~ F3(x) = (F~ (x) + Fz(x) ) / 2 . ( 17 )

This means that a new prototype population and corresponding classifier should be added to the system, whence expression (14) will again yield a distinct minimum.

Admittedly, the efficiency of our approach de- pends on the proper choice of prototype populations. Computation resources will be wasted if there are too many very similar prototype populations. Con- versely, classification accuracy will be reduced, if one uses just a few very dissimilar prototype populations.

In the case of an n-dimensional feature space, there are n stacks, one for each feature. The population identifier can then be constructed as a regular classifier in a 2n-dimensional feature space, composed of n measured components of the feature vector x=x~, x2, x3 .... , xn and the corresponding prototype population probabilities, quantified by n cumulative distributions of the stored stacks Ps,(xn), of the last tc classified items. In practice the dimension of the population identifier will be less than 2n, because in most cases a small part of the features will suffice to recognize the prototype populations.

5. An applied example

In this section we demonstrate how adaptive sorting can be implemented in practice, when the class conditional probability densities cannot be approxi- mated by known analytical distributions. We shall use an example of sizing apples by machine vision in lieu of weight. Some fruit sorting machines size fruits by weight sensors, while others measure fruit diameters by various mechanical devices, as described in Peleg ( 1985, 1989). New machines have been recently in- troduced, incorporating machine vision for sorting

fruits by color, size and to some extent surface blem- ishes as well. However, many markets still require weight classification as a standard. Accordingly, we use in this example a standard fruit weight of S= 140 g to partition two prototype populations of apples, marked '{A}' and '{M}' in Figure 4, into 'small' and 'large' categories. A machine vision algorithm was developed for estimating the mean fruff diameter of each apple by a single image, whence the scatter- grams in Figure 4 depict the weight of each apple ver- sus the computed diameter. The fruit sample from population {A} (marked by + signs) included 1046 apples, while the sample from population {M~ (marked by o signs) contained 1269 apples. Elec- tronic precision scales were the 'reference sensor', in this example, while the computer vision system was the 'estimator sensor'. Misclassifications may be minimized by selecting optimal diameter thresholds, which will be presently shown to be t~ --= 80 mm and t2=83 ram, for populations {A} and {M} respectively. Since S is fixed, the misclassified measure- ments fall within the shaded quadrants, hence we want to position t~ and t2 so that the number of readings therein will be minimized.

Visual comparison clearly indicates the differences between the two populations. In population {A} the fruit size is evenly distributed throughout the weight range, while in population {M} most of the fruit is concentrated below 150 g. The range of readings in population {A} is confined to about 77 to 210 g, while in population {M} the range is between 60 to 295 g.

Such differences in size distributions of apples are quite common among different orchards, so the fruit samples {A} and {M} may be considered as typical examples of prototype populations.

The class conditional frequency histograms of fruit samples {A} and {M} are depicted in Figures 5A and 5B respectively. Similarly to the previously described synthetic example, the prototype population frequency distribution of sample {A}, F~(xL is composed of the class conditional frequency histograms fl (xl 1 ) andf~ (xl 2 ), while the distribution of sample {M}, F2(x), is composed of histogramsf2(xl 1 ) and f2 (xl 2). The intersections of these histograms mark the corresponding optimal thresholds for these two prototype populations, e.g. t~=80 mm and t2=83 mm. These thresholds are optimal Ba~es classifiers, which minimize the misclassifications, quantified by

119


11° /

100

9o~

© 80 i a.l

a2 70 t

6o t 5O

,d e ~

o

110 I !

100

9o

8o

7o

6o

50

POPULATION A i I I

+ :!: ~+~ + * *++ +

+ ~ ' ++ Y + t~-- 8 0 mm

I

100 $=140150 260 2;0

WEIGHT OF APPLE GR.

POPULATION M ~ o ,

A o8 ~ 2 o °

/ / / A °° ° 0 . , ~ , o ~

° / ~ / / / / A o _ ~ ' ~ ~" o L o_ o o o /////// ~_ ~ o o o

i

300

100 S =140 150 200 250 300

WEIGHT OF APPLE GR.

Figure 4. Two prototype populations of apples from two sources, sorted into two size classes.

the shaded areas, about the standard weight S= 140 g.

The union of these two prototype populations with frequency distribution Fu (x ) and the constituent class conditional frequency histograms f u ( x [ l ) , fv(xl 2) are depicted in Figure 5C. For easy comparison, we replotted the two population histograms Fi (x), F2 (x) and the histogram of their union Fu (x) in Figure 5D. The suboptimal threshold, tv = 81 mm, was found at the intersection offu(x[ 1 ) andfu(x] 2). The class conditional frequency histograms fj(x[ m) and the population frequency histograms ~ ( x ) , are evidently quite noisy.

Because the graphs of the histograms in Figure 5D are so intertwined, one may think that it is virtually impossible to construct a classifier for distinguishing

between populations {A} and {M}. However, by con- verting the frequency histograms in Figure 5 into cumulative histograms, as depicted in Figure 6, we may transform them into the probability domain, where most of the noise is effectively eliminated. Similarly to the line taken in the synthetic example, here pj(x[ m) are the cumulative feature distribution histograms of the m grades, while Pj(x) are estimates of the probabilities of the j prototype populations.

In Figure 6D, the difference between populations {A} and {M} becomes apparent, with which a population identifier can be constructed as in the synthetic example, although the difference between these two prototype populations is not quite as large.

As in the synthetic example, this transformation also affords easy determination of the cross contam-

120


0.06

0.05

0.04

~" 0.03 i 0.02

0.01 ] 0 ~

0.06

0.051

0.04 '~-

~" 0.03

0.02

0.01

0

A

:x12}

/A 60 70 80 90

X

c

L x

2

' \ ~ f ~ , . . m _

100 110

tu= 81 Fu(x) "~:i'x ,

fu(X I Ii~\'/ /

/

60 70 80

,-- ~'T 0.06

0.05

0.04

0.03

0.02

0.01

0

B r "v ]

I J

ta= 83 F2(X}~,I _ '

i / " '~ / / ' I :v,,t~N~/~_/~m..:z 60 70 80 90

2)

! I

100 110

' U fu(xl 2}

,," "~:1:,//'

90 100

F~(x)

0.06

0.05

0.04

0.03

0.02

0.01

0

J

110 60 70 80

X

D

... ~F2 (x )

~, :~ Fu(x) /

90 1 oo 110

X X

Figure 5. Class conditional frequency distributions and prototype population frequency distributions of apple diameters. (A) and (B) are from populations {A} and {M} respectively. (C) Unions of distributions of populations {A} and {M}. (D) Replotted, prototype

population feature frequency distributions and their union Fu (x) = F t (x) u F2 (x).

inations by plotting Ctj(x) and Cmj(X), as illustrated in Figures 6A, 6B and 6C. The optimal thresholds in the probability domain are evidently located at x = tj, where the cross contaminat ions are minimal. The intersections o f x = t j with Cu(x ) and C2j(x) quantify the individual minimal misclassifications, Cu(tj) and C2j(tj). The optimal thresholds t~ = 80 m m and t2 = 83 mm for populations {A} and {M}, give Ct, (80) = 0.08 and Ct2 (83) -- 0.06 respectively. Although the difference between these two prototype populations is not large, the improvement in classification accuracy by using adaptive sorting is significant.

In most practical cases one does not have an overview of all possible prototype populations o f apples,

so the suboptimal threshold t c = 8 1 is usually not known. Thus, it is quite possible that without adaptive sorting, tt and t2 may be reversed quite often. In this case, the intersections of x=f2 with Ctl(tt) in Figure 6A and x=h with Ct2(&) in Figure 6B, indicate a rise in the cross contaminat ion to Ctl (83) = 0.16 and Ct 2 ( 80 ) = 0.13, respectively.

Note that the success of the above approach de- pends on careful selection o f the bin size, for group- ing the data, by the cumulative histograms. On the one hand, the accuracy is improved by better resolu- tion when the bin is small, but on the other hand, smaller bins increase noise which may also decrease the accuracy. Some experimentation with different

121


0.2

0.15

~-~ 0.1

0.05

%_ .......... :;5

A

Ct I(x) ~ tl= 8 0 / ' C'l" I .16 c,, (x) ---:,\ I L

80 85 90

0.2

0.15

0.1

0.05

%

B

Ct2=.13

C,z (x) --~'.~,

Ct~(x)

Cz2=.04

7~5 80

t~= 83

/

ct =.o6,

- - C= =.02

85 90

0 . 2 m

0.15

X 0.1 e~

0.05

%

X

C

Ct.(x) ~ /,' / "" tu=81 / " C2u(X)

/'

C2u = .04

75 80

/

• Cau = .06

85 90

1

0.8

0.6

m

0.4

0.2

I 0 80 9JO 1 O0

X

D

P~ (x) : ~

/ ' / / / /

~ ' - Pu(x) , / / '

/ , /

• ):1 : ' - --- P, (x) / /

' / , I / / / , / / '

60 7'0 110

Figure 6. Transformation from feature frequency distributions of Figure 5 to probability distributions and plots of cross contamination functions. (A), (B), (C) and (D) are respective transformations of Figure 5.

bin sizes can indicate the optimal bin size. In the above examples we found a bin size of 1 m m to be optimal for the 55 to 110 mm apple diameter range.

Summary

We have developed an algorithm for unsupervised adaptive classification, based on a finite number of 'prototype populations' with distinctly different feature distributions, each representing a typically different source population of the inspected products. Intermittently updated feature distributions of samples collected from the currently classified products, are compared to the distributions of the stored prototype populations, and accordingly the system

switches to the most appropriate classifier. Additional work needs to be done in two direc-

tions. One is to determine how to optimally partition the feature space into 'prototype populations', when they are not known a pr ior i . The other is to develop a robust algorithm for detecting 'new' prototype populations, that will signal a possible need to train an additional classifier and add another prototype population to the system.

The first is a classical c lustering p r o b l e m (e.g. see Hartigan (1975) or Brailovski ( 1991 ) ), whence we need to find distinct subfeature spaces that form well separated prototype populations, while using a minimal subset of the features that are used for the pattern classification proper. In many practical produce-sorting systems, this problem is not very

122


detrimental because the potential sources of different prototype populations can be identified by experience.

The second can be solved by a suitable metric that measures the difference between the new prototype population, if any, and the currently used prototype population. For when a new prototype population appears, the system will automatically switch to the known prototype population that is closest to the new one. Thus, we need to determine whether the new prototype population is significantly and sufficiently different from the current population, to warrant a separate classifier. Note, however, that even without a facility to identify all possible prototype populations, adaptive sorting by a limited number of prototype populations is preferable to regular sorting. The unknown prototype populations will simply be sorted by the closest possible classifier, currently known to the system.

References

Ben-Hanan, U. (1992). Optimal classification by time variable features. D.Sc. Thesis, Technion Israel Institute of Technology, Haifa 32000, Israel (in Hebrew).

Brailovsky, V.L. ( 1991 ). A probabilistic approach to clustering. Pattern Recognition Lett. 12, 193-198.

Duda, O.D. and P.E. Hart (1973). Pattern Classification and Scene Analysis. Wiley, New York.

Hartigan, J.A. ( 1975 ). Clustering Algorithms. Wiley, New York. Kazakos, D. and L.D. Davisson (1980). An improved decision-

directed detector. IEEE Trans. Inform. Theoo~ 26 ( 1 ), 113- 116.

McAulay, R.J. and E. Denlinger (1973). A decision-directed adaptive tracker. 1EEE Trans. Aerospace Eleclr. Syst. 9 (2).

Nguyen, D.D. and J.S. Lee ( 1989 ). A new LMS-based algorithm for rapid adaptive classification in dynamic environments. Neural Networks 2, 215-228.

Peleg, K. ( 1985 ). Produce Handling Packaging and Distribution. AVI Publishing Corp., Westport, CT.

Peleg, K. (1989). Method and Apparatus for Automatically Inspecting and Classifying Different Objects. US Patent 4884696.

Stirling, W.C. and A.L. Swindlehurst (1987). Decision-directed multivariate empirical Bayes classification with nonstationary priors. 1EEE Trans. Pattern Anal Machine lntell. 9 (5), 644- 660.

Young, T.Y. and A.A. Farjo (1972). On decision-directed estimation and stochastic approximation. IEEE Trans. Inform. Theory 18, 671-673.

Widrow, B. and R. Winter (1988). Neural nets for adaptive pattern recognition. IEEE Comput., March. 25-39.

123

Documents

Adaptive sorting by prototype populations