7
Microprocessor arrays for pattern recognition S.M. Boxer and B.G. Batchelor Indexing terms: Computerised pattern recognition, Microcomputers Abstract: A linear array of microprocessors provides a powerful computing system that is particularly well suited to many pattern-recognition and cluster-analysis algorithms. These often rely heavily upon the calcu- lation of distances in high-dimensional vector spaces: distances can be computed at high speed by an array of identical processing elements, operating in parallel under the command of a central controller. To achieve high computing speeds in those pattern recognition algorithms which refer an input vector to each member of a set of stored reference vectors, the processing elements should each contain some 'local' storage. Of course, not all pattern-recognition algorithms are parallel, and to accomodate these, the processing elements may be required to operate autonomously. Nevertheless, the system controller must, at all times, be able to force the entire array to operate under its control again. The array can operate in a third mode, namely acting as a pipe-line processor, which is useful in some situations (e.g. computing polynomials) and for transferring data between the array's local store and the system controller. A rectangular array is even faster than a linear one, but is, of course, more expensive. The cost and performance of an array of Intel 8080 microprocessors are compared to those of other systems. 1 Introduction This paper explores the possibilities of using a linear array of identical processing elements (p.e.s) as a means of implementing pattern recognition (p.r.) calculations. One point we shall emphasise is that the p.e.s need not be very powerful; even an array of present-day microprocessors can exceed the speed of the CDC7600, at a fraction of the cost. We shall try to show that a line of p.e.s, operating under centralised control, is a 'natural' architecture to choose for many p.r. calculations. This paper is based upon the following axioms: (a) Serial processors are too slow for effective use in large p.r. problems. (2?) P.R. is sufficiently important to justify building special-purpose machines. (c) P.R. calculations are highly parallel. To illustrate axiom (a), let us consider the implementation of a popular family of algorithms, namely the 'nearest neighbour' procedures, using the CDC7600 computer. Even this powerful machine requires 100 jus to compute the squared Euclidean distance between two 64-dime.nsional vectors. In practice, hundreds or even thousands of such calculations may be needed to perform a single classifi- cation. Many important applications areas require speeds far in excess of this. For example, the analysis of satellite pictures requires classification times of about 2 jus. 3 Similar problems are likely to occur in the inspection of steel strip, in optical character recognition, in the classification of radar signals and the monitoring of industrial plant. Whereas methods are being developed for keeping the number of distance calculations as small as possible, faster hardware would make the analysis of large data sets easier. Several distinct approaches seem promising. Those labelled by an asterisk in the following list have been studied with special regard to the p.r. problem: Paper T141C, first received 3rd June 1977 and in revised form 6th January 1978 Dr Batchelor is with the Department of Electronics, University of Southampton, Southampton SO9 5NH, England. Mr. Boxer's address is c/o Dr. E.I. Boxer, 12 Queen's Road, Ealing, London W5, England 60 (a) optical processing * ib (very fast for certain tasks, but inflexible) (b) analogue techniques* 15 ' 4 (c) hard-wired digital arrays* 4 ' 5 (d) networds of r.a.m.-like units* 1 {e) associative processor* 13 (/) multiprocessors. 9 ' 17> 14 Each of these provides certain merits and demerits, and we believe that it is premature to make a final choice of one to the exclusion of all others. 2 Linear multiprocessor arrays and parallel p.r. calculations It is easier to consider how one particular system archi- tecture may be used for p.r. calculations than to explain the mental steps which led us to that design. For this reason, the reader is asked to relate the following remarks to the system block diagram shown in Fig. 1. The p.e.s each have the same internal structure (Fig. 2), although these details are unimportant for our present discussion. The linear array of p.e.s was designed specifically to expedite certain vector/matrix operations, especially those listed below. broadcast and personal commands, data p.e.. p.e, data summer (see fig.3) Fig. 1 Architecture of a linear array of m processing elements (p.e., —p.e. m ); s.c. is the system controller Detailed structure of the p.e.s is shown in Fig. 2 and the multi-input (pyramid) adder in Fig. 3 COMPUTERS AND DIGITAL TECHNIQUES, MA Y 1978, Vol. l,No. 2 0140 1335/781141C'0060 $1-50/0

Microprocessor arrays for pattern recognition

  • Upload
    bg

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Microprocessor arrays for pattern recognition

Microprocessor arrays for patternrecognition

S.M. Boxer and B.G. Batchelor

Indexing terms: Computerised pattern recognition, Microcomputers

Abstract: A linear array of microprocessors provides a powerful computing system that is particularly wellsuited to many pattern-recognition and cluster-analysis algorithms. These often rely heavily upon the calcu-lation of distances in high-dimensional vector spaces: distances can be computed at high speed by an arrayof identical processing elements, operating in parallel under the command of a central controller. To achievehigh computing speeds in those pattern recognition algorithms which refer an input vector to each memberof a set of stored reference vectors, the processing elements should each contain some 'local' storage. Ofcourse, not all pattern-recognition algorithms are parallel, and to accomodate these, the processing elementsmay be required to operate autonomously. Nevertheless, the system controller must, at all times, be able toforce the entire array to operate under its control again. The array can operate in a third mode, namely actingas a pipe-line processor, which is useful in some situations (e.g. computing polynomials) and for transferringdata between the array's local store and the system controller. A rectangular array is even faster than a linearone, but is, of course, more expensive. The cost and performance of an array of Intel 8080 microprocessorsare compared to those of other systems.

1 Introduction

This paper explores the possibilities of using a linear arrayof identical processing elements (p.e.s) as a means ofimplementing pattern recognition (p.r.) calculations. Onepoint we shall emphasise is that the p.e.s need not be verypowerful; even an array of present-day microprocessors canexceed the speed of the CDC7600, at a fraction of the cost.We shall try to show that a line of p.e.s, operating undercentralised control, is a 'natural' architecture to choose formany p.r. calculations.

This paper is based upon the following axioms:(a) Serial processors are too slow for effective use in

large p.r. problems.(2?) P.R. is sufficiently important to justify building

special-purpose machines.(c) P.R. calculations are highly parallel.

To illustrate axiom (a), let us consider the implementationof a popular family of algorithms, namely the 'nearestneighbour' procedures, using the CDC7600 computer. Eventhis powerful machine requires 100 jus to compute thesquared Euclidean distance between two 64-dime.nsionalvectors. In practice, hundreds or even thousands of suchcalculations may be needed to perform a single classifi-cation. Many important applications areas require speeds farin excess of this. For example, the analysis of satellitepictures requires classification times of about 2 jus.3 Similarproblems are likely to occur in the inspection of steel strip,in optical character recognition, in the classification ofradar signals and the monitoring of industrial plant. Whereasmethods are being developed for keeping the number ofdistance calculations as small as possible, faster hardwarewould make the analysis of large data sets easier.

Several distinct approaches seem promising. Thoselabelled by an asterisk in the following list have beenstudied with special regard to the p.r. problem:

Paper T141C, first received 3rd June 1977 and in revised form 6thJanuary 1978Dr Batchelor is with the Department of Electronics, University ofSouthampton, Southampton SO9 5NH, England. Mr. Boxer's addressis c/o Dr. E.I. Boxer, 12 Queen's Road, Ealing, London W5, England

60

(a) optical processing *ib (very fast for certain tasks, butinflexible)

(b) analogue techniques*15'4

(c) hard-wired digital arrays*4'5

(d) networds of r.a.m.-like units*1

{e) associative processor*13

(/) multiprocessors.9'17> 14

Each of these provides certain merits and demerits, and webelieve that it is premature to make a final choice of one tothe exclusion of all others.

2 Linear multiprocessor arrays and parallel p.r.calculations

It is easier to consider how one particular system archi-tecture may be used for p.r. calculations than to explain themental steps which led us to that design. For this reason,the reader is asked to relate the following remarks to thesystem block diagram shown in Fig. 1. The p.e.s each havethe same internal structure (Fig. 2), although these detailsare unimportant for our present discussion.

The linear array of p.e.s was designed specifically toexpedite certain vector/matrix operations, especially thoselisted below.

broadcast and personalcommands, data

p.e.. p.e, data

summer(see fig.3)

Fig. 1 Architecture of a linear array of m processing elements(p.e., —p.e.m); s.c. is the system controller

Detailed structure of the p.e.s is shown in Fig. 2 and the multi-input(pyramid) adder in Fig. 3

COMPUTERS AND DIGITAL TECHNIQUES, MA Y 1978, Vol. l,No. 2

0140 1335/781141C'0060 $1-50/0

Page 2: Microprocessor arrays for pattern recognition

The following operations are specified using the notationdeveloped by Iverson,11 which is particularly well suitedfor vector and matrix manipulations; a brief summary ofthe Iverson notation is provided in Appendix 7.1:

(a) element-by-element arithmetic and logical operations:

z<-x®y where ©€{+,—, x ,+, *, V, A,v}

(ft) Addition:

s*-+/x

(c) comparison:

z ^-x Sly, where (RG{>, <, >, <, =,=£, —}

(d) absolute value:

z •*- | JC

(e) scalar/vector operations:

z •«- k ®x

z <r-x®k

(/) masking and compression:

z +-u/x, where M,G[0, 1]

z<-fy;u;xl

(g) functional transformation:

z «-/(*), where z{ <-/(*,)

If we can implement these operations in parallel, then itshould be possible to achieve a significant improvement inthe speed of a large number of statistical p.r. procedures.(We do not mean to imply that the whole procedure canbe implemented using only these operations; the 'parallel'part of the procedure, which is time-consuming on a specialprocessor, is assigned to the vector processor.)

2.7 Vector and matrix storage, parallel operations

Vector are assigned element by element to the p.e.s in thearray. Matrices are stored as vectors in the followingmanner:

p.e.

\v?U - _J

p.e.2

v\v\V?

P-e-n

v\Vn ,T/mr n

row vector V2

columnvector

Vi

Thus, operations on the vector K1 can be performed inparallel by applying the same control signals to each p.e.;-(j = 1,. . . , m) from the system controller (s.c).

2.2 Autonomous p.e. operation

To perform functional transformations of the type

z +-f(x), where zt *-/(.*,•)

it may be necessary to allow the p.e.s in the array to operateindependently. To do this, we must ensure that:

(a) the p.e.s have (a small amount of) program-storagespace

(b) the s.c. can initiate the stored program (presumablythe same branching program but with different parameters)in each p.e. of the array

(c) the s.c. can monitor the state of the p.e. array (i.e. allp.e.s idle/some p.e.s active)

(d) the s.c. can at any time force the complete p.e.array to execute commands which the s.c. issues. Theobvious method of doing this is to use program interruptto restore direct s.c. control

(e) after a p.e. has completed its autonomous operation,that p.e. waits for direct s.c. control to be restored. (Presum-ably other p.e.s are still active, taking longer to completethe program.)Tasks requiring autonomous operation include calculationsof

exp (-x*2), a-(b+x*2) etc.

Parzen probability-density-function estimation and thepotential-function classifier both require functions of thistype (see Section 2-8).

to/fromjsci f;=m

to/from

Fig. 2A Detailed structure ofp.e.,- and its connection to neighbour-ing units

commands from s.c.

Ioperationcode

LZaddress highway i

r.o.m./ram

'no operation'code

k. C D

busy/idle(to s.c.via OR gate)

i.s.c.

logic

multiplexercontrol

statusregister

instructionunit

data highway

Fig.2B Instruction unit

The multiplexer can select any one of the four inputs A-D, dependingupon the signal i.s.c. (instruction-select control) and the statusregister. This register may be set/reset by the logic unit, in responseto personal commands by the s.c, or by the microprocessor,through the address highway. The input E to the logic unit carriesthe 'address' of the personal commands; the logic unit must com-pare this with the (hardwired) unit number, and if appropriatemodify the appropriate bits of the status register. Broadcast datamay be transmitted from the s.c. through the multiplexer, to themicroprocessor, in lieu of either operation code 1 or 2

COMPUTERS AND DIGITAL TECHNIQUES, MA Y 1978, Vol. 1, No. 2 61

Page 3: Microprocessor arrays for pattern recognition

2.3 Semi-autonomous p.e. operation

To allow the array to compute conditional functions of theform

z<-lf(x),u,g(x)l

(where u is a logical vector, stored in the usual manner inthe p.e. array, one element of u per p.e.) it is desirable toallow the s.c. to specify two commands. Each p.e. selectsone of these on the basis of the stored logical element ut

and executes it. The operation

would be implemented by the following sequence:

u<-x>0

z *-/— x,u,x/

The corresponding semi-autonomous operation commandsfrom the s.c. for this operation would be as follows:

zt <— *, no operation

The logical vector u will be stored in active bistable devices(i.e. distinct from the r.a.m); it may be desirable to providestorage for several such control variables.

2.4 Switching off p.e.s

If each p.e. knows its own identity (i.e. unit number),individual units can be switched off/on at will by the s.c.If the vector c containing the p.e. unit numbers is given by

c <-11 (m)

then 'personal' commands for p.e.y are preceded by settingup a logical array u (as in the previous Section) as follows:

u+-c = j

where / is the unit number of the p.e. to be addressed and/ £ [1, m]. Personal commands for p.e.y are then issued bythe system controller as for any ordinary semi-autonomouscommand; p.e.,- (/=£/) will ignore such commands, whichwill have the form

operationcode 1

operationcode 2

where operation code 1 isexcutedbyp.e.,(/ = 1, . . . , / — 1,/ 4 1,. . ., m), and where operation code 2 is executed byp.e.y. Usually, of course, operation code 1 will be 'nooperation', but this is not essential.

However, the solution just outlined does not provide thetype of facility that would be most useful. We need to beable to switch off a p.e. with a single personal command,and keep it in that state until another personal command isissued by the s.c. to the same p.e. To do this, within eachp.e. we provide a multiplexer with 3 inputs as follows:

(a) operation code 1, specified by the system controller(b) operation code 2, specified by the system controller(c) no operation (hard-wired).

Selection of input a or b is conditional upon the bistableu\ (Section 2.3). Another bistable device, which is set orreset by a personal command to p.e.y from s.c, selects

62

input c or alb. If input c is selected, p.e.y must remaininactive until the s.c. issues a personal command to re-activate it. Gearly, p.e.y cannot detect this reactivationcommand, and so some additional hardware is needed tomonitor the output of the s.c. and reactivate p.e.y at theappropriate time. Details of this are shown in Fig. 2.

from p.e. number:r~~ ~ *»

1 2 3 U m

1 1 L L L L 1 1 L L LL LL LL

to s.c.

F ig. 3 Detail of the fast pyramid adder

Each box is a 2-input combinational logic digital summer

2.5 Multi-inpu t adderThis unit is provided to expedite calculations of the form

s-*- + jx

The structure of a suitable adder is shown in Fig. 3. Thispyramid structure is faster than a ladder-type network bya factor of m/log2 (m) for an m-input adder.

If necessary, hardware units could be connected tofacilitate other calculations, including

maximum s <~ \x

minimum s *- [x

AND/threshold s+-r\/x>t

OR/threshold s^v/x>t

Whereas the last two are both simple to implement anduseful in p.r. calculations, the first two are not particularlyimportant in view of the fact that the vector which is to bemaximised or minimised usually appears in serial form atthe adder output. However, the Chebychev metric doesrequire maximisation of a vector stored in the array,although it is doubtful whether this particular metricwarrants the relatively expensive hardware unit whichwould be needed to implement it.*

*The justification for the non-Euclidean distances is usually basedupon the reputed difficulty of implementing the Euclidean distance.However, recent electronic developments have largely circumventedthe difficulty, which was much more troublesome 5-10 years ago.4

COMPUTERS AND DIGITAL TECHNIQUES, MA Y1978, Vol. 1, No. 2

Page 4: Microprocessor arrays for pattern recognition

2.6 Transferring data into p.e. (Fig. 4)

Consider the s.c. as acting like p.e.o or p.e.m + 1 dependingupon the direction of data shift. Then the array can beloaded using the vector-shift concept

or

The process is quite fast, requiring 2mbr seconds, where(a) the p.e. array has m elements(b) each p.e. has two (master/slave) registers, one for

forward and the other for reverse transfer of data(c) T is the time to transfer data from a p.e. to the register

(and from the register to a p.e)(d) each p.e. can store b bytes.

s.c.

P.e., P.e.2 P.e. 3 P-em

Fig. 4 Transfer of data into the array;

Eventually xi will be stored in p.e., , x2 in p.e.2 etc. The numbersXj are moved along the array in just the same way as data are movedin a shift register. The p.e. array can however move vectors in eitherdirection and into/out of the s.c. Once the vector is located in thecorrect position, it can be transferred from the registers withinthe microprocessors to either the r.a.m. or the 'drum'

x 2

f ( x . ) f ( x , ) * ( x r

used to good effect in performing functional transform-ations of a single variable, such as

sin (x)

log (x)

Distance calculations can be pipelined, but this is lessefficient than the broadside mode. Ranking vectors (oflength < m) can be achieved by a series of pairwise com-parisons and, if necessary, data-exchange operationsbetween p.e.s. This process requires a sequence of, at most,m comparison-exchange operations (recall that ra com-parisons can be made simultaneously by the p.e. array). Thetransfer of data into the array (see previous Section) is anend-on mode of operation.

2.8 P.R. procedures implemented on the linear p.e.array

Many p.r. procedures require little processing other thanthe calculation of a distance matrix; apart from the calcu-lation of the distance matrix, there is no computationalbottleneck that cannot be solved using a serial processor.Those p.r. procedures listed below can all be convenientlyimplemented using a p.e. array.* (Where vector/matrixoperations are required in addition to the distance-matrixevaluation, the appropriate function is given.)

(a) k-nearest neighbour p.d.f. estimation/classification(b)potential-function method of classification. Ad-

ditional function:z <-/(*)

(c)Parzen method of p.d.f. estimation. Additionalfunction:

z<-f(x)

id) Single-linkage method of cluster analysis (c.a.)(e) Complete-linkage method of c.a.if) Centroid method of c.a. Additional function:

fg{f<x>}]f (x )

Fig. 5 Modes of connection

a Broadside modeb End-on modeThe heavy arrows indicate the dominant direction of data flow

2.7 End-on mode of operation (Fig. 5)

We have described the p.e. array as operating 'broadside on'.By this, we mean that elements of an array x receive thesame processing operations in parallel. The alternative modeof operation resembles pipeline processing,12 and can be

COMPUTERS AND DIGITAL TECHNIQUES, MA Y 1978, Vol. l,No. 2

(g) Minimal-spanning-tree representation of dataQi) Sammon 's data-mapping procedure(/) C.A. using number of shared nearest neighboursif) hodata/k-means analysis. Additional function:

(k) Forgey method of c.a. (see procedure /)(/) Jancey method of c.a. (see procedure /)(m) Maximindist (complete distance matrix is not needed)(n) Compound classifier. Additional function:

A' <-Al —k x rax A' —x

(p) Adaptive-sample set construction. Additionalfunctions:

(0 y <- exp {-x * 2)

(if) as procedure n

(p)Gitman and Levine's measure of grade of clustermembership.

More complete details are provided by the standard texts, namelyReferences 2, 10, 4 and 6. Reference 6 lists Iverson programs forsome of these procedures

63

Page 5: Microprocessor arrays for pattern recognition

Table 1: Distance-measuring methods

Distance

EuclideanCity-blockHamming

MinkowskiMahalanobisDirectioncosine t(squared)

Functional form

s*-+/{x-y)*2s<-+/\x-y\as for city block butwith x &y as logicalvectorss*- + /{x-y)*ps*-+ /Ax —y

1 + \*-> + v +1 v x-''' 2-XyXX.y^y

Execution time on thep.e. array

T*2 + T+

T. +Tm + T+

T_ + T*p + / +T. + V(X)TX + T+

3\TX+ T+) + T*2 + tx + t

Numerical values for these calculation times are given in Appendix7.2

Notation: rffl is the time to compute* <-x<sy in the p.e. arrayT*I is the time to perform the squaring operationT*p is the time to perform thepth power exponentiationtffl is the time to compute c *-a®b in the system controllerT+ is the time to compute s «- + Ix in the multi-input adder

rm is the time to compute x *- \x\ in the p.e .array

^Although this is not a true distance measure, it is used in lieu ofone of the distances as a measure of dissimilarity

This list reflects the central role of the distance functions.Of the many suggested methods of measuring distance,the most popular are given in Table 1.

Parallel implementation of the distance functions, plusthe following vector operations, provides a significantspeed improvement over a serial computer:

z <-y ®k

z <-/(*)

s +-+/x

x<-\x\

However, some functions are not easily implementedon an array processor. For example, polynomials requirecross- product terms which the array finds difficult tocalculate. Cross-product terms can be computed by storingall of the xt in each of the p.e.s, but of course this iswasteful of storage.

2.9 Rectangular p.e. arrays

A 2-dimensional array can be organised in such a mannerthat several distances can be computed in parallel. That is,

can be calculated in a time [T_ + r# 2 + T+] which is v (>")times faster (per distance function evaluated) than thelinear array and v (V) x p (Y) times faster than a single p.e.To control such a system, there would be a controller foreach linear array (column vector) and a super s.c. thatco-ordinates its subordinate units (Fig. 6). Other vectoroperations can, of course, be performed at high speed ina 2-dimensional array.

Certain picture-processing operations can be performedby a rectangular array, but to be effective in this role the

array would have to be very large (to ensure good spatialresolution). This would be an inefficient use of a verysophisticated system, in a situation where much simplerhardware modules will suffice.8

3 P.E. structure: constraints imposed by p.r.

We have devised a system around the Intel 8080 micro-processor, which was, at the time of writing this article, themost attractive device on the market, by reason of itsspeed, instruction repertoire and immediate availability.(This particular design is described at length in Reference 7.The performance is summarised in Appendix 7.2.) Thefollowing remarks refer to Fig. 2.

3.1 Microprocessor

The microprocessor requires two data highways:(a) 16-bit address highway(b) 8-bit bidirectional data highway

Fig. 6 Structure of a rectangular array of p.e.s

S.S.C. is a super s.c. Vector X} is presented in parallel to. row n,where it is processed using parameter vector An, say. XJ is thenshifted upwards to row (n — 1), where it is processed with par-ameter vector yi""1 . Meanwhile, row n is processing .another vectorX1+1 , again using An. The process of moving X1 upwards andreading in a new vector to row n is repeated until all of the p.e.s areoperating in tandem. (Of course, for each upwards shift of the XJvectors there may be more than one parameter needed; the aboveexample has been simplified for clarity.) It is possible to calculate nsquared-Euclidean distances simultaneously by this technique,thereby achieving a speed increase of mn over a single p.e. It isinteresting to note that the bottom-most row of p.e.s may be usedto store and supply the Xj vectors; these are supplied to row n fromS.C.n

3.2 R.O.M. look-up tables

These are included to facilitate the rapid calculation ofcertain frequently used functions, e.g. squares, logarithms,e~x*. The output of a 16-bit r.o.m. must be multiplexedinto the 8-bit port of the microprocessor, but all 16 bits canbe presented to the adder. Floating-point values may bestored in the r.o.m.s.

3.3 RAM.

This may store instruction codes for autonomous operationof the p.e.s and data.

64 COMPUTERS AND DIGITAL TECHNIQUES, MAY 1978, Vol. l.No. 2

Page 6: Microprocessor arrays for pattern recognition

3.4 Instruction unit (also see Section 2.4)

This can select one of two operation codes supplied by thes.c, or force a 'no operation' code into the microprocessorinstruction-entry port. Control of the instruction unit ismaintained by two bistable devices

3.5 M.O.S. drum

The drum is a recirculating shift register that is used asthe main site for storing reference data (vectors). The useof a cyclic store in p.r. is justified in Reference 5, but itcould be replaced by enlarging the r.a.m. if the latter werecheap enough. It is felt that a capacity of about 16 Kbytesper p.e. (8 bits/byte) represents a reasonable compromisebetween cost and complexity of construction on the onehand and speed of computation on the other.

4 Discussion

It would be interesting to compare the linear array with theother fast-hardware approaches mentioned in Section 1.To do so effectively, we require a benchmark calculationfor comparing their speeds. The natural choice would beone of the distances, but these are not convenient tocalculate on an optical processor, nor in a network ofr.a.m.s. The Euclidean distance can be computed in ananalogue computer, but with rather a limited bandwidth(limited by the nonlinear element used to calculate squares).The other distances have usually been employed by workerswho were frustrated in their attempts to use the Euclideandistance, which was still relatively difficult to computeonly 5 years ago. (Batchelor5 pointed out a number ofproblems arising with the non-Euclidean distances whichmake their use even less desirable.) We are left then with ayardstick which cannot be applied fairly to four of our sixoptions listed in Section 1. (Clearly the associative pro-cessor cannot be expected to compete with a parallel arrayof p.e.s in performing numeric calculations). The tworemaining options are perfectly well suited to the Euclideandistance calculation. The hard-wired logic array using e.cl.can compute the squared Euclidean distance (in 64-space)in 0-15 us (or 1-5/is using t.t.L). Clearly, the speed of amulti-p.e. array depends upon the speed of the individualp.e.s and upon the array geometry (Section 2-9). A lineararray using Intel 8080 microprocessors requires 26 JUS tocompute the squared Euclidean distance, but is, of course,much more flexible than the hard-wired array. (Forexample, it can compute the city-block distance con-currently with the Euclidean distance.) The array builtaround present-day microprocessors can hardly be expectedto compete for speed with its more powerful relatives(such as ILLIAC IV), which employ much faster p.e.s. Theobjection to the ILLIAC IV system is that it is both costlyand is not yet widely available.

The distributed-array processor is also promising* but isnot yet commercially available. Other multi-p.e. systemsare being developed or are already in use (see for exampleReferences 17 and 9), and some of these would providecomparable execution times for the particular calculationswhich interest us.

•REDDAWAY, S.F.: Private communication. This showed that thesystem can compute 8000 squared Euclidean distances in 64-spacein 17 ms (*» 2 jus/distances. (This calculation is based on a 32 X 32array of p.e.s)

There is another topic which requires at least a briefdiscussion here. We have designed a machine to implementcertain procedures that were originally devised withoutregard to their implementation. The multi-p.e. array canalleviate the computational bottleneck for a numberof these procedures. Can we now specify what kind ofprocedures the array is best suited to handle? Any procedurewhich can be expressed succinctly in that subset of theIverson language given in the second paragraph of Section 2is suitable for the p.e. array. Provided that most of theprocedure can be expressed thus, the procedure can beexecuted at high speed using a linear array. If, however,we need cross-element operations, such as those requiredfor polynomials, for example

s < - ( a 0 x xY x * 3 x x l s ) + {ax x x 2 x * 4 x x s * 2 ) + . . .

then the linear array cannot operate efficiently. Auton-omous p.e. operation could overcome this, but the solutionlacks the elegant simplicity of an array of p.e.s 'marching toorders' issued by the s.c. It demands the use of individualprograms for each p.e., which, although possible, isdifficult to organise. Cross-element operations are thusto be avoided wherever possible, unless they are operationswhich can be expressed involving only shift or rotate, forexample

z <-* + I *5 Conclusions

A linear array of p.e.s has been described which seems tobe well suited for p.r. and c.a. applications. An array hasbeen discussed which can calculate the squared Euclideandistance in 26/is or the city-block distance in 16-5jus. Thecost of such an array, which employs Intel 8080 micro-processors, would be about £500 x m, where m is the arraylength (cost of components only). The performance of theproposed array has been compared briefly to several otheralternative computing systems, and in terms of the cost/speed ratio appears to be very attractive.

6 References

1 ALEKSANDER, I.: 'Pattern recognition with networks ofmemory elements', in BATCHELOR, B.G. (Ed.): 'Pattern re-cognition: ideas in practice' (Plenum, New York, 1978)

2 ANDERBERG, M.R.: 'Cluster analysis for applications'(Academic Press, New York, 1973)

3 BARRET, E.C., and CURTIS, L.F.: introduction to environ-mental remote sensing' (Chapman and Hall, London, 1976)

4 BATCHELOR, B.G.: 'Practical approach to pattern classification'(Plenum, London, 1974)

5 BATCHELOR, B.G.: 'Design for a high-speed Euclidean distancecalculator and its use in pattern recognition' in 'Computersystems and technology'. IEE Conf. Publ. 121, 1974, pp.213-218

6 BATCHELOR, B.G. (Ed.): 'Pattern recognition: ideas in practice'(Plenum, New York, 1978)

7 BOXER, S.M.: B.Sc. project report, Department of Electronics,University of Southampton. Available on application to B.G.Batchelor

8 DUFF, M.J.B.: 'Parallel processing techniques', in BATCHELOR,B.G. (Ed.): 'Pattern recognition: ideas in practice' (Plenum, NewYork,1978)

9 ENSLOW, P. (Ed.): 'Multiprocessors and parallel processing'(Wiley, New York, 1974)

10 FUKUNAGA, K.: introduction to statistical pattern re-cognition' (Academic Press, New York, 1972)

11 IVERSON, K.E.: 'A programming language' (Wiley, New York,1962)

12 LEWIN, D.W.: 'Theory and design of computers' (Nelson,London, 1972)

COMPUTERS AND DIGITAL TECHNIQUES, MA Y 1978, Vol. 1, No. 2 65

Page 7: Microprocessor arrays for pattern recognition

13 NAVARRO, A.B.: 'The role of the associative processor inpattern recognition'. Proceedings of the NATO-ASI conferenceon pattern recognition, theory and practice, Bandol, France,1975

14 REDDAWAY, S.F.: 'DAP-a distributed array processor.Proceedings of the 1st annual symposium on computer archi-tecture, Florida, 1973 pp. 61-65

15 TAYLOR, W.K.: 'UCLM4 programmable speech recognitionmachine'. Proceedings of the NATO-ASI conference on patternrecognition, theory and practice, Bandol France, 1975

16 ULLMAN, J.R.: 'Review of optical pattern recognition tech-niques', in BATCHELOR.B.G. (Ed.): 'Pattern recognition: ideasin practice' (Plenum, New York, 1978)

17 GRIMSDALE, R.L.: "The architecture of a reconfigurablemultimicrocomputer system POLYPROC Proceedings ofthe IERE-ICS conference on computer systems and technology,University of Sussex, Brighton, England, 1976

7 Appendixes

7.1 Summary of Iverson notation

Only that small subset used in this paper is included.

Operands Notation Example

Scalar lower case

Vector lower case, bold type

Matrix upper case, bold type

X

X = (JC, ,JC

Y V 1

A =- A ,

*„(*))

.P(A-) rPiX)'v(X)

Operation Explanation

z <-JC ®y logical arithmetic operation with operator ©e{+, —, X,•K \ A , V, V}

z <- <B/X e q u i v a l e n t t o x , © x 2 « . . . ®JCy(X)

z^-xt'y equ i va len t t o ®{/{x®2y)

z «- |JC | modulusz *~u/x remove JC,- if «,• = 0 (arithmetic 'zero' or logical 'false')z •- lx;u;yl JC,- is selected if«,- = 0 andj>,- if u,- =£ 0z *~ [x ceiling function: z > JC,-, and z is equal to at least one

of the JC,-

z <- LJC floor function: z < JC,-, and z is equal to at least oneof the JC,-

z *~lx right shift: z 3= (0, JC, ,JT, , . . . ,xV(Xy-i)z<-$JC left shift: z = (JC2,JC3, . . . ,xv(X),0)

z *~ f(x) functional transformation: z,- *- f(X{)u *-x &.y relation: creates a logical vector w where «,- =£ 0

iffxt <R.y,-and <R e > , > , < , < , = , i=, «B *- ®IA perform operation e along the rows of AB <-- ®//i4 perform operation ® along the columns of A

Precedence rule: right to left, brackets evaluated first.Special vectors: c = 1, 2, 3 , . . . (length determined by context)

7.2 A practical system

Byte length: 8 bits

In view of the fact that most digital i.e. devices have 4 or 8logic-functions/package, the 'natural' choice was from 4, 8or 12 bits/byte. P.R. is a relatively imprecise subject, so12 bits/byte is unnecessarily expensive, while 4 bits/byteis not precise enough. The choice of 8 bits/byte is idealfor use with the Intel 8080 microprocessor and the 256 x16-bit r.o.m.s

Dimensionality: 64

Few p.r. problems present more than 50 variables. Problemsof greater dimensionality can be accommodated in twoways:(a) adding more p.e.s to the array(b) multiplexing the long vector into the array. Partialresults are combined by the s.c.

Local storage

Shift registers; 214 x 8 bits (32 i.e. packages).r.o.m.s: 4 tables of 256 x 16 bits each (8 i.e. packages).r.a.m.: 256 x 8 bits (2 i.e. packages).Intel 8080 was preferred on grounds of speed, availabilityand good instruction repertoire.

Construction of p.e.s

3 boards 20 x 20 cm, total of about 70-75 i.c.s/p.e.

Cost/p.e.

£500 (components only)

Pyramid adder

25 x 2-bit full adders

133 x 4-bit full adders

System controller

Conventional minicomputer with 18-bit word and cycletime less than 2 jus.Storage requirements for array control programs is modest;8 K words will usually suffice.

Speed of operations (Intel 8080A microprocessor

Add, subtract, AND, OR, exclusive OR = 7/xsDivide, multiply = 72 jus (or 31 JUS using log tables)Squaring with r.o.m. tables = 15-5/usFunctions using r.o.m.s = 15-5jusModulus = 14-5 jusCity-block distances = 16-5/usSquared Euclidean distances = 40 /is (can be reduced to26 jus using Intel 8080 A-1 microprocessors).

66 COMPUTERS AND DIGITAL TECHNIQUES, MA Y 1978, Vol. l.No. 2