12
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6. JUNE 1993 60 1 A Sliding Memory Plane Array Processor Myung Hoon Sunwoo, Member, IEEE, and J. K. Agganval, Fellow, IEEE Abstract-This paper describes a new mesh-connected SIMD architecture, called a Sliding Memory Plane (SIiM) Array Proces- sor. On SIiM, the inter-processing element (inter-PE) communi- cation, using the sliding memory plane, and the data input/output (I/O), using two U0 planes, can occur without interrupting the PE’s, which greatly diminishes the communication and I/O overhead. SliM is unique in its ability to overlap inter-PE com- munication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller. In addition, SliM uses four rather than eight links per PE to provide eight-way connectivity using the by-passing path, thus reducing the diagonal communication time and eliminating the necessity of diagonal links. The realization of these virtual links for diagonal communication without instruction overhead is another novel feature of SYM. An alternative method to achieve diagonal communication is to use two sliding memory plane shifts that can be overlapped with computation. The by- passing path can also accomplish nonlocal communication and broadcast. This paper illustrates the unique advantages of these inter-PE and diagonal communication schemes and proposes new parallel algorithms for image processing on SliM that have a zero or an O(1) communication complexity. With these salient features, SliM shows a significant performance improvement, illustrated with several tasks including the DARPA low level vision benchmarks. Index Tems-Computer architectures, computer vision, image processing, parallel architectures and algorithms, mesh-connected SIMD machines, VLSI architectures, VLSI design. I. INTRODUCTION XISTING mesh-connected SIMD (single instruction E stream-multiple data stream) architectures have several disadvantages, such as inter-processing element (inter-PE) communication overhead, data 1/0 overhead, and relatively complicated interconnections, all of which limit the efficient implementation of image processing algorithms. To alleviate these disadvantages and to improve performance, a new fine- grained mesh-connected SIMD architecture called the Sliding Memory Plane (SliM) Array Processor is proposed for image processing [1]-[4]. Most operations in low level image processing tasks are window operations that transform the value of each pixel into a new value calculated from itself and the neighboring pixels. Such operations can be achieved with a high degree of concur- rency by using mesh-connected SIMD architectures, which are well suited to the structure of image data [5]-[7]. Since Unger first proposed a computer based on a mesh topology for spatial Manuscript received September 25, 1990; revised November 20, 1991. M. H. Sunwoo is with the Department of Electronic Engineering, Ajou J. K. Agganval is with the Computer and Vision Research Center, The IEEE Log Number 9209541. University, Suwon, 441-749 Korea. University of Texas at Austin, Austin, TX 78712. problems [7], many mesh-connected SIMD architectures have been proposed [SI, [9]. Examples of such architectures include SOLOMON [lo], ILLIAC IV [ll], MPP [12]-[14], CLIP [15]-[MI, DAP [19], GAPP [20], [21], GRID [22], N I T [23], BAP [24], CAAPP [25], Polymorphic-Torus [26], [27], Mesh with reconfigurable buses [28], CM I with a hypercube topology for the routers [29], MasPar [30], BLITZEN [31], etc. However, these architectures may limit the speedup for im- age processing. During processing, almost all communications are localized. In other words, a great deal of local com- munication occurs between neighboring PE’s. This inter-PE communication overhead is a significant problem of existing mesh-connected SIMD architectures [6], [32]- [35]. Moreover, when the size of a window operator is larger than 3 by 3, the resulting overhead may seriously decrease the Performance. We discuss several of the previously mentioned architectures for more detailed comparisons. To reduce the communication overhead, the LIPP architec- ture was proposed [33]. However, LIPP has several drawbacks. It requires complex gate logic circuits and complicated control for multiplexing and needs a special purpose RAM (Random Access Memory) and processor [8]. Data from a memory mod- ule may be routed over several multiplexer levels inside one, two or three processors before it reaches the final destination [33], resulting in considerable propagation delay. In addition, if the size of a window is larger than 3 by 3, or if another window shape (circular, diamond, rectangular, etc.), such as is commonly used in image processing [36], [37], is employed, the communication overhead may increase. MPP does not have a separate controller for inter-PE com- munication and therefore cannot overlap communication with computation [ 121-[ 141. On MPP, I-b communication between neighboring PE’s requires one instruction cycle (100 ns) [32]. MPP has a separate controller for I/O and thus can overlap I/O with computation. MasPar does not have a separate controller for inter-PE communication [30] and thus inter-PE communication cannot be overlapped with computation. For MasPar to send 8 b to a neighboring PE, it takes 3 clock cycles for setup and 8 cycles to send 8 b, or a total time of 11 cycles. The I/O subsystem can overlap computation with I/O. A CLIP7A PE consists of two CLIP7 chips, a processor and a coprocessor. The processor mainly handles data manipula- tion, while the coprocessor deals with address generation. The coprocessor provides access to a 3 by 3 neighborhood of data. However, the coprocessor needs several steps (instructions) to read the appropriate rows (i - 1, z, i + 1) of a data array from external RAM. This data is loaded into the upper 1045-9219/93$03,00 0 1993 IEEE

A sliding memory plane array processor - Parallel and ...cvrc.ece.utexas.edu/Publications/M.H. Sunwoo Sliding Memory Plane...A Sliding Memory Plane Array Processor ... IEEE Abstract-This

  • Upload
    vutruc

  • View
    236

  • Download
    3

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6. JUNE 1993 60 1

A Sliding Memory Plane Array Processor Myung Hoon Sunwoo, Member, IEEE, and J. K. Agganval, Fellow, IEEE

Abstract-This paper describes a new mesh-connected SIMD architecture, called a Sliding Memory Plane (SIiM) Array Proces- sor. On SIiM, the inter-processing element (inter-PE) communi- cation, using the sliding memory plane, and the data input/output (I/O), using two U 0 planes, can occur without interrupting the PE’s, which greatly diminishes the communication and I/O overhead. SliM is unique in its ability to overlap inter-PE com- munication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller. In addition, SliM uses four rather than eight links per PE to provide eight-way connectivity using the by-passing path, thus reducing the diagonal communication time and eliminating the necessity of diagonal links. The realization of these virtual links for diagonal communication without instruction overhead is another novel feature of SYM. An alternative method to achieve diagonal communication is to use two sliding memory plane shifts that can be overlapped with computation. The by- passing path can also accomplish nonlocal communication and broadcast. This paper illustrates the unique advantages of these inter-PE and diagonal communication schemes and proposes new parallel algorithms for image processing on SliM that have a zero or an O(1) communication complexity. With these salient features, SliM shows a significant performance improvement, illustrated with several tasks including the DARPA low level vision benchmarks.

Index Tems-Computer architectures, computer vision, image processing, parallel architectures and algorithms, mesh-connected SIMD machines, VLSI architectures, VLSI design.

I. INTRODUCTION XISTING mesh-connected SIMD (single instruction E stream-multiple data stream) architectures have several

disadvantages, such as inter-processing element (inter-PE) communication overhead, data 1/0 overhead, and relatively complicated interconnections, all of which limit the efficient implementation of image processing algorithms. To alleviate these disadvantages and to improve performance, a new fine- grained mesh-connected SIMD architecture called the Sliding Memory Plane (SliM) Array Processor is proposed for image processing [1]-[4].

Most operations in low level image processing tasks are window operations that transform the value of each pixel into a new value calculated from itself and the neighboring pixels. Such operations can be achieved with a high degree of concur- rency by using mesh-connected SIMD architectures, which are well suited to the structure of image data [5]-[7]. Since Unger first proposed a computer based on a mesh topology for spatial

Manuscript received September 25, 1990; revised November 20, 1991. M. H. Sunwoo is with the Department of Electronic Engineering, Ajou

J. K. Agganval is with the Computer and Vision Research Center, The

IEEE Log Number 9209541.

University, Suwon, 441-749 Korea.

University of Texas at Austin, Austin, TX 78712.

problems [7], many mesh-connected SIMD architectures have been proposed [SI, [9]. Examples of such architectures include SOLOMON [lo], ILLIAC IV [ l l ] , MPP [12]-[14], CLIP [15]-[MI, DAP [19], GAPP [20], [21], GRID [22], N I T [23], BAP [24], CAAPP [25], Polymorphic-Torus [26], [27], Mesh with reconfigurable buses [28], CM I with a hypercube topology for the routers [29], MasPar [30], BLITZEN [31], etc.

However, these architectures may limit the speedup for im- age processing. During processing, almost all communications are localized. In other words, a great deal of local com- munication occurs between neighboring PE’s. This inter-PE communication overhead is a significant problem of existing mesh-connected SIMD architectures [6], [32]- [35]. Moreover, when the size of a window operator is larger than 3 by 3, the resulting overhead may seriously decrease the Performance. We discuss several of the previously mentioned architectures for more detailed comparisons.

To reduce the communication overhead, the LIPP architec- ture was proposed [33]. However, LIPP has several drawbacks. It requires complex gate logic circuits and complicated control for multiplexing and needs a special purpose RAM (Random Access Memory) and processor [8]. Data from a memory mod- ule may be routed over several multiplexer levels inside one, two or three processors before it reaches the final destination [33], resulting in considerable propagation delay. In addition, if the size of a window is larger than 3 by 3, or if another window shape (circular, diamond, rectangular, etc.), such as is commonly used in image processing [36], [37], is employed, the communication overhead may increase.

MPP does not have a separate controller for inter-PE com- munication and therefore cannot overlap communication with computation [ 121-[ 141. On MPP, I-b communication between neighboring PE’s requires one instruction cycle (100 ns) [32]. MPP has a separate controller for I/O and thus can overlap I/O with computation.

MasPar does not have a separate controller for inter-PE communication [30] and thus inter-PE communication cannot be overlapped with computation. For MasPar to send 8 b to a neighboring PE, it takes 3 clock cycles for setup and 8 cycles to send 8 b, or a total time of 11 cycles. The I/O subsystem can overlap computation with I/O.

A CLIP7A PE consists of two CLIP7 chips, a processor and a coprocessor. The processor mainly handles data manipula- tion, while the coprocessor deals with address generation. The coprocessor provides access to a 3 by 3 neighborhood of data. However, the coprocessor needs several steps (instructions) to read the appropriate rows ( i - 1, z, i + 1) of a data array from external RAM. This data is loaded into the upper

1045-9219/93$03,00 0 1993 IEEE

602 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6, JUNE 1993

edge storage element (i - l), lower edge storage element (i+l) , and the CLIP7 processor (i) [18]. This external memory read/write overhead cannot be overlapped with computation, which creates communication overhead through the external RAM. The CLIP7 coprocessor is not complete in itself as it requires external devices (latch, transceiver, and buffer) to isolate or connect the various data buses. In addition, this scheme provides only a 3 by 3 neighborhood of data. Since the processor and the coprocessor can handle the computation and data 1/0 separately, I/O can be overlapped with computation.

AMT DAP 510 and 610 have two processors: a 1-b pro- cessor for communication and computation, and an 8-b co- processor for computation only [ 191. Inter-PE communication can occur through the 1-b processor during coprocessor com- putation. Since the 1-b processor and the 8-b coprocessor reside on different chips, communication between the two can occur through the array memory. This causes external memory read/write overhead which may degrade the performance. The 1-b processor executes simple computations, such as Boolean operations and integer addition, since there is no benefit in routing them to the external memory and coprocessor [19]. Inter-PE communication cannot occur during such computa- tions. Above all, every PE on DAP requires the coprocessor for partial overlapping.

SliM requires neither a coprocessor nor an on-chip Direct Memory Access (DMA) controller to overlap inter-PE com- munication without interrupting PE’s. During computation, the contents of all register cells on the sliding memory plane can be shifted simultaneously and in the same direction to the neighboring cells. This inter-PE communication overlapping regardless of the window size and shape is unique to SliM.

Many mesh-connected SIMD architectures mentioned above are not capable of 1/0 overlapping. As a result, 1/0 overhead may degrade their performance. In contrast, SliM has two 1/0 planes that provide I/O overlapping without interrupting PE’s. Since communication, I/O and computation occur simultane- ously, communication and 1/0 overhead can be overlapped with computation, significantly diminishing communication and I/O overhead.

Moreover, architectures such as CLIP4, CLIP7, BAP, and NTT have six or eight communication links per PE to reduce the overhead for diagonal communication. The realization of virtual links for diagonal communication using only four links per PE without instruction overhead is another unique feature of SliM. Although the XNet three-state interconnect on MasPar [30] (similar to the BLITZEN [31] grid network) has eight connectivity using four links, each connection requires 3 instruction cycles for setup alone. In addition, MasPar requires 1 instruction cycle for communicating each bit. Thus, MasPar has inter-PE communication overhead for all com- munication. BLITZEN also needs 1 instruction cycle for any 1-b inter-PE communication [31], as does MPP. In contrast, virtual diagonal communication using the by-passing path on SliM needs several nanoseconds for gate delay, which can be overlapped with computation. An alternative method to achieve diagonal communication is to use two sliding memory plane shifts that also can be overlapped with computation. Therefore, four rather than eight links are sufficient for eight

connectivity, greatly reducing the diagonal communication time and eliminating the necessity of diagonal links.

The by-passing path can also perform nonlocal communi- cation and broadcast. SliM provides various types of com- munication (local communication, nonlocal communication, and broadcast). Each PE in SliM can operate separately based on three autonomies (operation autonomy, addressing autonomy, and connection autonomy). As in CLIP7A, DAP, and MasPar, SliM uses bit-serial communication and bit- parallel computation in which more area of VLSI can be saved and the number of pins can be reduced.

Fang et al. [35] describe the inter-PE communication re- quirements for a general 2-D convolution on a typical mesh- connected architecture. In their paper, a general 2-D con- volution algorithm on a mesh-connected architecture has an O ( W 2 ) communication complexity, where the size of a win- dow operator is W by W . In contrast, we propose several new parallel algorithms for median filtering, the 2-D convolution, etc., on SliM having a zero or an O(1) communication complexity. SliM’s performance shows significant improve- ments over existing mesh-connected SIMD architectures [3], [6], [8], [38], which is illustrated by several examples of image processing tasks, including the DARPA low level vision benchmarks.

The remainder of this paper is organized as follows. Section I1 introduces the SliM architecture and compares its features to existing mesh-connected architectures. The section addresses virtual connectivity, various types of communication, and local autonomies. Section 111 establishes the analytical model of SliM for performance evaluation and compares it to exist- ing mesh-connected architectures. Section IV discusses the applications to image processing tasks. Section V describes the performance evaluation based on timing analysis using the analytical model and presents computation and communication complexities for image processing tasks. Finally, Section VI contains concluding remarks.

11. THE ARCHITECTURE OF SliM

This section describes the architecture of SliM and presents the overall system and structure of a PE. The section then addresses the issues of connectivity, communication, and autonomy.

A. The Overall System

Fig. 1 shows the logical diagram of SliM. The processor plane consists of N x N processors and the total number of PE’s is N 2 . The sliding memory plane S consists of N x N shift registers, connected by a grid network. The S plane, instead of the processor plane, forms a mesh topology. The top row of the sliding memory plane is connected to the bottom row to form a wrap-around mesh connection scheme. Similarly, the leftmost column is connected to the rightmost column (torus interconnection). In SliM, as in most mesh- connected SIMD architectures, grid-like communication links among PE’s are used for inter-PE communication. However, on SliM, inter-PE communication can occur during compu-

SUNWOO AND AGGARWAL: SLIDING MEMORY PLANE ARRAY PROCESSOR 603

D’ Plane D Plane Processor Plane

Fig. 1. The Sliding Memory Plane (SliM) array processor.

Data Gut

-e

From I Neighbors N E Tn N~iohhrrm

Fig. 3. A processing element.

Each processor can access its pixel in the corresponding reg- ister cell on the sliding memory plane. During the computation, the contents of all register cells on the sliding memory plane S can be shifted in parallel to the neighboring register cells (North, East, West, or South). After shifting, every processor can access one of its neighboring pixels. Thus, communication can also be overlapped with computation. Accordingly, 1/0 and communication overhead can be overlapped with compu- tation and so are diminished. The next section describes the operational details of the sliding memory plane.

Sliding Memory Plane

Control Subunit Processor Plane Control Subunit

Processor Plane Sliding Memory Plane

I O P + - +

Fig. 2. The control unit (CU).

tation without interrupting PE’s. The I/O planes, D and D’, are exclusively used for input and output, whereas the sliding memory plane S is used for inter-PE communication (parallel data movement). The next section contains more details.

The control unit (CU), shown in Fig. 2, consists of a processor control subunit and a sliding memory plane control subunit. Each subunit is connected to the host that handles program loading. The processor control subunit broadcasts the instruction sets to processors from its program memory, while the sliding memory plane control subunit mediates the data movement in S simultaneously. The input and output processor (IOP) controls the data I/O in the D and D’ planes. Since each control unit operates separately, computation, communication and 1/0 can occur simultaneously.

The processors can process the data in the sliding memory plane S. IOP can load the input data in a row-parallel (or column-parallel) manner into the 1/0 shift register plane D or D’. Of course, if D or D’ uses a sensory array, an image- parallel 1/0 can be achieved. After being loaded, the data in D or D’ are shifted into the sliding memory plane S in one unit cycle time (parallel shift). While processors process the data in S from D, IOP can unload the output data from D’ and load the next input data into D‘. The output data in D’ are those which were previously processed and shifted from S . While processors process the data in S from D’, IOP can alternately unload the output data from D and load the next input data into D. This buffering capability allows 1/0 to be overlapped with computation.

B. A Processing Element The processing element (PE) shown in Fig. 3 consists of an

ALU (Arithmetic Logic Unit) providing Boolean functions as well as arithmetic functions, registers, multiplexers (MUX’s), a demultiplexer (DMUX), and a 4 x 2 switching element (SW). We can use two three-state drivers instead of a MUX. In practice, SW can be realized by two 4 x 1 MUX’s with the same input lines and different output lines. The shift register s is an element of the sliding memory plane S shown in Fig. 1. Similarly, d and d’ are elements of the 1/0 planes D and D’.

s is connected to four neighboring registers via the switching element SW and MUX. Thus, a PE is also connected to its four neighboring PE’s via the register s. Upon completing this sliding operation, the content of s is copied to a latch 1 to prevent the data in s from conflicting with the newly incoming data from a neighbor. d and d’ are connected only to their left and right neighboring registers. This scheme provides inter- PE communication and data 1/0 without interrupting PE’s, as described below.

While the ALU processes the pixel in s, the pixel is shifted into a neighboring shift register and a new neighboring pixel is shifted into s. At the same time, the I/O operation can occur through d or d’. Again, while the ALU processes the new pixel shifted into s, the new pixel is shifted to another neighboring shift register, and another neighboring pixel is shifted into s. The 1/0 operation can occur simulta- neously through d or d’. These operations can be executed by all PE’s simultaneously. All pixels are moved into the neighboring register cells simultaneously and in the same

604 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6, JUNE 1993

direction. Each operation can be controlled separately by the processor control subunit, the sliding memory plane control subunit, or the IOP. Computation, communication and 1/0 can occur simultaneously, and inter-PE communication and I/O overhead can be overlapped with computation and so are greatly diminished.

The other major components of a PE consist of a shift register ( S H ) , a condition register (C) for local operation, an address register ( A ) for local addressing, four general registers (T's), and a small amount of memory (RAM) located either inside or outside of a PE and used for local data storage. S H , which performs arithmetic and logic shifts, is valuable for multiplication, division, and floating point calculations. C provides conditional operations and the control of SW and MUX for neighboring communication and gives operation and connection autonomies. The register set T is used for storing intermediate results.

MPP, CLIP, DAP, GAPP, GRID, CAAPP, CM I, etc., are based on bit-serial communication and computation. SOLOMON, ILLIAC IV, CLIP6, etc., on the other hand, are based on bit-parallel communication and computation. Mas- Par, CLIP7A, and DAP are based on bit-serial communication and bit-parallel computation. Since most operations in image processing are performed on grey-level rather than binary data, bit-parallel is better suited to image processing [16]-[18]. Bit- parallel links between PE's occupy a large portion of the VLSI area and contribute to an increase in the number of output pins on a VLSI chip. Hence, the link between neighboring PE's on SliM is one-bit wide, so that more VLSI area can be saved and the number of pins can be reduced. In Fig. 3, the thick lines represent 8-b parallel datapaths while the thin lines represent bit-serial datapaths. Each register cell contains one pixel and the 8-bit ALU operates in bit-parallel. Thus, SliM operates in bit-serial communication and bit-parallel computation as in MasPar, CLIP7A, and DAP.

For fast sliding, shifting and I/O operations, two clock rates are used-one for normal operations and the other for sliding, shifting and 1/0 operations. Sliding and 1/0 operations are between-chip operations, and the faster clock rate is completely dependent on the delay between chips. As discussed, sliding and I/O operations are basically the data transfer between shift registers. Hence, it may be possible to realize 8-b inter-PE communication for the sliding operation and 8-b data transfer for 1/0 can be completed within one instruction cycle. In this case, the faster clock rate is eight times faster than that for instruction. If the delay between chips is not fast enough, then a slower clock should be used for sliding and 1/0 operations.

The number of transistors in the 16-bit PE of CLIP7A is approximately twice as large as that in the 8-bit PE of SliM. 6800 transistors are used in the PE of CLIP7A [18], while 6000 transistors are used for eight bit-serial PE's in one chip on MPP [ 141. The current VLSI technology achieves over one-million transistors on a single chip [39]. To be conservative, it may be assumed that the 8-b PE of SliM requires approximately the same number of transistors as the 16-b PE of CLIP7A or as eight 1-b PE's of MPP. Therefore, it would be possible to build a number of the PE's of SliM on one VLSI chip.

co"unica!im LinLs CO-"

to neighbors

W S

(a) receiving mode

(b) by-passing mode

+ (c) mivinglby-passing mode

Fig. 4. Three connection modes.

C. Connectivity and Communication

As shown in Fig. 4, the switching element SW and MUX have three connection modes via two different paths, a receiv- ing path (s) and by-passing path. The by-passing path is a new feature that can perform various types of communication. The by-passing path provides virtual communication links, eliminating the diagonal links and accomplishing various communication and broadcast schemes, which are described later. In Fig. 4 the solid lines represent the paths used for communication and the dotted lines represent the unused paths. Three connection modes can be achieved based on the status of SW and MUX.

1) receiving mode: one of the neighboring pixels is received in the sliding register s.

2 ) by-passing mode: one of the neighboring pixels is passed to another neighboring PE without receiving in s.

3 ) receivinglby-passing mode: one of the neighboring pix- els is received in s and this pixel or another neighboring pixel is passed to one of the neighboring PE's.

Even though three connection modes are provided, only the receiving mode is necessary for the sliding operation. The other two connection modes are used for different functions which will be described later. The connection modes are determined by the status of the condition register C in each PE. The processor control subunit can globally change the status of C in every PE to achieve centralized control. On the other hand, each PE can locally change the status of C and thus control connectivity independently to achieve distributed control. The distributed control strategy for each PE depends on data and algorithms which will be discussed later.

Using three connection modes makes it possible to produce the virtual communication links for diagonal neighboring PE's, shown in Fig. 5. For example, if the west PE sets the by- passing mode for the center PE, and if the southwest PE sets the receiving mode for the west PE, then the virtual link

SUNWOO AND AGGARWAL: SLIDING MEMORY PLANE ARRAY PROCESSOR 605

from center to southwest is realized. The dotted arrow line is referred to as a virtual link for diagonal communication. Similarly, other virtual links can be realized. However, two neighboring PE’s in the same row (or column) cannot send data to their diagonal PE’s at the same time because of communication link conflicts. All virtual diagonal links for all PE’s (NE, NW, SE, SW) can be realized simultaneously by two sliding memory plane shifts that can be overlapped with computation. Even with four communication links, eight connectivity (four physical links and four virtual links) can be achieved by adding the by-passing path. These virtual links are especially advantageous for computations along the border pixels of regions. This type of computation, which is commonly used in image processing [40], will be discussed in detail later.

SliM employs three different communication schemes: local communication between nearest neighbors, nonlocal commu- nication between nonnearest neighbors, and broadcast. As shown in the previous section, concurrent local communication in the same direction can be realized by using the sliding memory plane (sliding operation). Any two PE’s can commu- nicate by using three connection modes. For instance, the left uppermost PE can communicate with the right lowermost PE by forming a virtual communication link. The PE’s located right above or right below the diagonal direction set the by- passing mode. Then, the nonlocal communication link between the left uppermost PE and the right lowermost PE is formed. Hence, using the connection modes accomplishes any nonlocal communication.

In addition, SliM provides three different broadcast schemes (shown in Fig. 6): row-broadcast, column-broadcast, and broadcast. For row-broadcast, every PE in a row, except the one issuing broadcast, sets the receivinghy-passing mode. In each PE, the broadcast data is simultaneously stored in the s register and passed to a neighboring PE. All by-passing paths in the row form a bus via SW’s and MUX’s, and each s register in a PE is connected to the formed bus, thus achieving the row-broadcast scheme. Similarly, column-broadcast can be realized.

For broadcast, all the PE’s in the same row, except the one issuing broadcast, set the receivinghy-passing modes as in row-broadcast. Every PE in the rows above sets the receiving mode for its south PE and the by-passing mode for its north PE. Every PE in the rows below sets the receiving mode for its north PE and the by-passing mode for its south PE. Then, all

t

(a) row-broadcast

t f t

(b) column-broadcast

(c) broadcast

Fig. 6. Broadcast schemes.

the PE’s can receive the broadcast information. The control strategy for these communication schemes is determined by the processor control subunit or by each PE according to the data and algorithms (data or algorithm driven control strategy). Because propagation and gate delays for nonlocal communication and broadcast are not negligible [9], these communications may require several clock cycles.

D. Three Autonomies

Maresca et al. [9] define three different autonomies of SIMD mesh-connected architectures: operation autonomy, ad- dressing autonomy, and connection autonomy. With these autonomies, SIMD architectures show great computational power [9], [17], [18]. CLIP7A [18] introduces operation and addressing autonomies. Connection autonomy is presented in [25]-[28]. BLITZEN [31] has operation and addressing autonomies. In contrast, SliM has three autonomies as follows:

1) operation autonomy: The condition register C provides operation autonomy. Each PE can execute different op- erations based on the status of C. Conditional operations can be efficiently accomplished by using the condition register.

2) addressing autonomy: The address register A provides the local memory address. One bit in C indicates whether the next address is in A or in the broadcast instruction. When C in a PE indicates that the next address is in A, the PE uses the address in A instead of the broadcast address. Thus, SliM realizes addressing autonomy. This scheme efficiently supports a linked list data structure discussed later.

3) connection autonomy: Connection autonomy on mesh- connected SIMD architectures provides powerful com- munication routes and reconfigurability [9], [28]. Due to connection autonomy, various topologies can be embed- ded and more complex algorithms can be implemented. As the previous section showed, each PE can control SW and MUX, and can provide various connection modes and communication schemes. SliM, with flexible connection modes and various communication schemes, achieves connection autonomy.

-

606 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6, JUNE 1993

Because SliM supports three different autonomies, it can be used for more complex tasks. With these autonomies and the sliding memory plane, SliM is a flexible and reconfigurable architecture.

111. THE ANALYTICAL MODEL OF SliM

This section discusses the analytical model of SliM and compares it with those of typical mesh-connected architec- tures. Disregarding the time for program loading from the host to the CU, the total processing time (TA) has three main components: data I/O time (TIo) , computation time (TcP), and inter-PE communication time for data exchange ( T p p ) . These times are functions of the program length ( L ) for a specific algorithm, image size ( I 2 ) , and the number of PE’s employed ( N 2 ) . If the image size is larger than the array processor, then the size of a subimage becomes N 2 . The total number of subimages (n,) is [12/N21. For simplicity, n, is assumed to be one; in other words, the size of SliM is larger than or equal to the size of the input image. If n, is larger than one, the following times are multiplied by n,. The total time to process a whole image can be expressed by

TA = TIO + TCP + Tpp. (1)

If input and output using row parallel (or column paral- lel) occur simultaneously by shifting, then the I/O time is expressed by

where ni, is the number of columns in an image frame and ti, is the IOP cycle time for one column. If input and output cannot occur simultaneously, then Ti0 is expressed by 2niotio. Of course, if D and D’ are the actual sensor arrays, then all pixels can be loaded in one cycle.

The computation time is expressed by

Tcp = niti (3)

where ni is the number of instructions to execute a specific algorithm and ti is the processor instruction cycle time. Note that ni is about 1/8 of the number of instructions on existing bit-serial mesh-connected SIMD architectures because of bit- parallel processing on SliM.

The inter-PE communication time is expressed by

TPP = n,tc (4)

where n, is the number of bits to be transferred to neighboring PE’s for a specific algorithm, and t , is the communication time for one bit between neighboring PE’s. Note that t , on SliM is about 8 times faster than existing architectures because of SliM’s separate fast clock for sliding operations. Since the width of SliM’s communication link is one bit, the number of bits to be transferred must be considered instead of the number of bytes.

Therefore, the total time for a whole image is again ex- pressed by

TA = niotio + niti + n,t,. ( 5 )

Since SliM has a buffering capability, I/O can be overlapped with processing. In this case, the total processing time reduces to

In addition, SliM is capable of inter-PE communication during computation, thus, communication can be overlapped with computation. However, for some tasks inter-PE commu- nication cannot be fully overlapped and some portions of Tpp may still exist. This further reduces the total processing time. The total time TA can be expressed as follows:

TCP + PTPP, if TCP + PTPP 2 (1 - P)TPP and TCP + PTPP 2 TI0

if (1 - P)TPP > TCP 4- PTPP and (1 - P P P P 2 TI0

otherwise

TA = (1 - P)TPP, I T I O , (7) where p is the nonoverlapped portion of T p p with Tcp.

As (7) shows, SliM’s total processing time can be expressed by one of three components, a small portion of Tpp, or both. In general, the computation time is larger than the 1/0 time or the inter-PE communication time. Hence, on SliM the total processing time is composed of only pure computation time, with little or no communication time. In contrast, the total processing time for most bit-serial mesh-connected SIMD architectures is expressed by (5) , with larger ni and longer t,.

Iv. APPLICATIONS TO IMAGE PROCESSING

Because it performs window operations with little or no communication overhead, SliM is well suited to image pro- cessing, where excessive data exchange occurs between neigh- boring PE’s. For instance, 2-D convolution, median filtering, average value, template matching, zero-crossing, etc., are suitable applications for the proposed architecture. Edge detec- tion can be performed by using 2-D convolution algorithms. The Gradient, Laplacian, difference of Gaussians, Laplacian of Gaussian, and Sobel operators are some examples. After the convolution of an image with these operators, SliM can efficiently detect edges.

This section presents parallel algorithms for a general 2-D convolution and the Sobel operator and demonstrates how communication overhead for convolution is entirely overlapped. In tasks such as the Sobel operator, commu- nication overhead cannot be entirely overlapped. In many image processing tasks, computation takes place along the border pixels of regions [40]. The K-curvature, 1-D Gaussian smoothing along the border, and perimeter and area calculations are a few examples. The virtual communication links are advantageous for these tasks. The 1-D Gaussian smoothing along the border is illustrated to show the suitability of SliM’s virtual links for these tasks. On SliM, these tasks can be implemented without communication overhead. More details will be discussed later.

SUNWOO AND AGGARWAL: SLIDING MEMORY PLANE ARRAY PROCESSOR 607

114 114

Fig. 7. A 3 by 3 square window and the direction of sliding operations.

A. 2 - 0 Convolution

A parallel 2-D convolution algorithm [37] is highly suited for implementation on SliM. Fig. 7 shows a 3 by 3 convolution window. The arrow represents the direction of sliding opera- tions. If the direction starts at the center pixel and ends at the southwest pixel in a counterclockwise direction, the direction is: 0 -+ S + E -+ N + N + W 4 W 4 S -+ S. The sequence of pixels to be accessed in each PE is opposite to the direction of sliding operations: its own pixel, north, northwest, west, southwest, south, southeast, east, and northeast pixels. Every PE can receive its neighboring pixels in this order. The direction of sliding operations is like a Hamiltonian path that starts at any node and visits every node only once. After sliding into all neighbors within the window, every PE can get its final result concurrently.

The equation of the general 2-D convolution is expressed as follows:

w-1 w-1

k=O 1=O

where I i j is the value at the input pixel i , j ; 'wkl is a window coefficient; and i i j is the value at the output pixel i , j .

Suppose that an 8 b/pixel image is convolved with a W by W window. After multiplication, the result is 16-b; there- fore, on existing bit-serial mesh-connected architectures, 8W2 multiplications, 16(W2 - 1) additions, and 16(W2 - 1) inter- PE communication steps are required. Thus, the computation complexity is O( W2) , and the communication complexity is also O( W 2 ) on existing mesh-connected SIMD architectures

Since the time for a sliding operation is much less than the time for one multiplication and one addition, a sliding operation can be completed within the computation time. If SliM is employed, W 2 multiplications and 2(W2 - 1) additions are needed, regardless of the image size. The inter- PE communication overhead is completely overlapped, and the 1/0 overhead can also be overlapped if the computation time is larger than the 1/0 time. Thus, the computation complexity is O ( W 2 ) and the communication complexity is zero. Therefore, the total processing time for a whole image consists of only the computation time. Section V presents a more detailed algorithm. Since the direction of sliding operations is programmable and flexible, any shape and any size of a window can be employed on SliM with little or no communication overhead. In contrast, existing mesh-connected SIMD architectures may suffer from performance degradation,

L3-51.

particularly when the window size is larger than 3 by 3 or the shape of the window is not square.

B. The Sobel Operator

Fig. 8 shows the Sobel operator. While the direction of sliding operations for X-magnitude passes through the cen- ter and its north and south coefficients, no computation is required because of zero coefficients in the window. Thus, the communication overhead cannot be entirely overlapped. However, other necessary operations can be executed during the nonoverlapped communication time, (e.g., storing inter- mediate results), then communication overhead can be further reduced. Section V describes this case in detail.

C. The One-Dimensional Gaussian Smoothing Along the Border

SliM's virtual communication links can be efficiently used for the computation along the border pixels of regions. Fig. 9 shows the border of a region including virtual links and physical links. The 1-D Gaussian smoothing along the border is completed by summing the products of each pixel along the border with each coefficient. The route for this computation, shown in Fig. 9, is formed by the connection modes.

One possible routing strategy is determined by the following procedure:

Each PE checks its eight nearest neighbors using sliding operations to determine if it has a border pixel (called a border PE) or not (a nonborder PE). Each border PE locates its two border neighbors (a clockwise neighbor and a counterclockwise neighbor). There are four possible orientations of two neighboring border PE's (O', 45', 90', and 135'). For example, if two border PE's are located from the upper right to the lower left, the orientation is 45'. If the orientation of two border PE's is 0' or 90', the physical communication link is used. However, if the orientation is 45' or 135', then a virtual communication link is used. For example, if the orientation is 45", the nonborder PE located outside the region sets the by- passing mode for a virtual link between two border PE's. If the orientation is 135', the PE located inside the region sets the by-passing mode for a virtual link between two border PE's. After setting the route, the sliding operation is used in the clockwise or counterclockwise direction along the border within one cycle.

Hence, the distributed control strategy is determined by the data. As in the 2-D convolution, during a multiplication

608 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6, JUNE 1993

Our performance evaluation of SliM is based on the fol- lowing conservative assumptions. First, memory access time (1 byte) which is 100 ns, is defined as a nominal instruction cycle time. In practice, the time for 1-b memory access is equal to the time for 1-byte memory access. Only one memory access, with no operation, can be executed in one cycle. Second, two operations at most can be merged into one instruction if no conflict exists, and this instruction can be executed within one cycle. Thus, ti is 100 ns. Third, 8-b passing to a neighboring PE is completed within one cycle time. Fourth, within one cycle the number of 1-b shifts is up to eight; in other

- Virtual Path Direct Path

Border Pixel

-

Fig. 9. The route for computation along a border.

and an addition, inter-PE communication occurs along the border. Thus, the communication overhead is overlapped. Computation along borders, as well as window operations, can be conveniently implemented on SliM with little or no communication overhead. In contrast, existing mesh-connected SIMD architectures with four links use two steps for diagonal communication. The next section describes more detailed algorithms, algorithm complexities, and expected times for image processing algorithms.

v. ALGORITHM COMPLEXITIES AND EXPECTED TIMES FOR IMAGE PROCESSING TASKS

This section discusses the algorithm complexity, including the computation complexity and the communication complex- ity for image processing on SliM. The section then presents estimates of the expected times for image processing tasks. In many papers, the algorithm complexity on mesh-connected architectures is based on the overall time complexity, including the computation complexity and the communication complex- ity. Fang et al. describe the communication complexity of the generalized 2-D convolution on array processors [35]. In unit time, each PE can send or receive a word of data from each of its neighbors [35], [41]. Also, standard arithmetic and Boolean operations can be executed in unit time. In other words, an O ( n ) complexity means that an algorithm requires at most C1 * n inter-PE communication steps and C2 * n instructions, where C1 and C2 are positive constants.

Fang et al. [35] deal with the communication complexity and the computation complexity separately. In their paper, a general 2-D convolution algorithm on a mesh-connected architecture requires an 0 ( W 2 ) communication complexity [35]. In contrast, a general 2-D convolution algorithm on SliM has a zero communication complexity because the inter-PE communication overhead can be entirely overlapped. Simi- larly, many parallel image processing algorithms have either a zero communication complexity or an O( I) communication complexity, which this section discusses.

The assumptions used for our performance evaluation of SliM are based upon the figures for MPP. In MPP, one memory access and several operations can be merged into one instruction which can be executed within one instruction cycle (100 ns) [12]-[14]. The actual memory access time is about 50 ns [ 131. Moreover, 1-b communication between neighboring PE’s takes 100 ns, namely, one instruction cycle [32]. In the followings, the term cycle means the instruction cycle.

words, an 8-b shift is the maximum for one cycle. Fifth, most operations, such as addition, shift, compare, etc., are assumed to be executed within one cycle, excepting multiplication and division. The multiplication of two 8-b integers requires eight additions and seven 1-b shifts. One addition and one 1-b shift can be combined into an instruction, which then can be executed in one cycle. If two operands are from registers, and the result is stored into registers, 8 cycles are needed for a multiplication. But if two operands are from memory, and the result is stored into memory, 12 cycles are needed. Sixth, to simplify the performance evaluation, n, is assumed to be 1; in other words, the size of SliM is equal to that of the image (512 x 512). If ns is not 1, the total processing time becomes the processing time for a subimage multiplied by n,. Since a set of image processing tasks is subsequently performed on the same image, the 1/0 time is assumed to be less than the total computation time and is overlapped.

The method for the performance evaluation is as follows. The register transfer level algorithms are made. Then, the num- ber of instructions required for these algorithms is counted. Based on the analytical model of SliM, expected times are measured. Since sliding (in other words, inter-PE communi- cation) can occur during computation, sliding operation and computing operation statements are put on the same line. In the following description of the algorithm, wherever more than one statement occurs on the same line, then these statements can be considered to overlap. As described, on completing the sliding operation, the content in s is transferred in parallel into the latch I to prevent the data in s from conflicting with the new data coming from a neighbor. The step for transferring the content in s into 1 is omitted.

A. A Convolution Algorithm with a Zero Communication Complexity

Assume that window coefficients are in the broadcast in- structions. Fig. 10 shows the general algorithm, in which s represents a shift register on the sliding memory plane S, and T is a set of registers in a PE. As Fig. 10 shows, W 2 multiplications and 2(W2 - 1) additions must be required, and inter-PE communication is entirely overlapped. The portion of T p p that does not overlap with Tcp, that is, p , is zero. Thus, the algorithm requires an O( W 2 ) computation complexity and a zero communication complexity. If the window size is 3 by 3, and the number of bits per pixel is 8, the algorithm requires 9*nm + 16 cycles, where n, is the number of cycles for two 8-b integer multiplication. Since the register set T is used

SUNWOO AND AGGARWAL: SLIDING MEMORY PLANE ARRAY PROCESSOR 609

T c l*ww; for i t 1 until W - 1 do

s c a neighboring pixel: /* Sliding */

for j c 0 until W - 1 do T t T + 1 *w,,; s t a neighboring pixel; P Sliding */

Fig. 10. A parallel convolution algorithm on SIiM.

instead of memory, n, is assumed to be 8 (8 shiftdadditions). Thus, 88 cycles are required, ni is 88, and ti is assumed to be 100 ns. From (3) and (7), the total time estimated is 8.8 p s .

B. A Sobel Operator Algorithm with an O( 1) Communication Complexity

Some algorithms cannot be implemented without the com- munication overhead. The Sobel operator is one example. Fig. 7 shows the Sobel operator and Fig. 11 shows the algorithm. Assume that its own pixel is initially in the s register. As shown in Fig. 7, when the direction of sliding operations passes the center, north, and south, the computation is not required. Thus, other necessary steps can be executed. For example, moving its own pixel into memory and storing its pixel into T for an X-magnitude computation can be executed in the first two statements. If no step needs to be executed, the communication overhead cannot be overlapped. In the algorithm shown in Fig. 11, the sixth statement for each magnitude is not overlapped.

Hence, this algorithm has an O( 1) computation complexity and an O(1) communication complexity. If the number of bits per pixel is 8, the algorithm needs 10 cycles for the X-Magnitude and 10 cycles for the Y-Magnitude. Thus, it takes about 2.0 ,us (20 cycles) for an 8-b integer per pixel.

As Fig. 12 shows, during comparison and addition, each pixel in the s register can be passed into one of its neigh- boring s registers and the communication overhead is entirely overlapped. Each iteration within each for loop requires two operations. Thus, this algorithm requires an O ( N ) computa- tion complexity and a zero communication complexity. If N is assumed to be 512 and the maximum intensity value is assumed to be 511, this task requires about 1539 cycles and takes about 153.9 ps. The detailed counting is omitted.

D. An Average Value Algorithm with a Zero Communication Complexity

The average value algorithm requires 2(W2 - 1) additions and 1 division. The algorithm described below is similar to the 2-D convolution algorithm, except for the calculation. Since the inter-PE communication can be overlapped with computation, this algorithm requires an O( W 2 ) computation complexity and a zero communication complexity. If a 3 by 3 window is used, this task requires 17 + n, cycles, where n, is the number of cycles for a division.

In the worst case, in other words, when the intensity values of all pixels within the window are 255, the expected maximum value of T is 2295, which can be expressed by 12 b. Thus, the division may consist of 12 additions and 12 shifts if the nonrestoring division technique is applied since one addition and one 1-b shift can be executed in one cycle. This algorithm requires about 29 cycles and takes about 2.9 ps.

E. A Median Filtering Algorithm with a Zero Communication Complexity

In general, median filtering requires a sorting algorithm after all neighboring pixels are collected. Since the sorting algorithm itself may take a long time, and since sorting and collecting neighbors cannot occur simultaneously, median filtering is

C. A Histogramming Algorithm with a Zero Communication Complexity

Histogramming, which is not a window operation, is a time- consuming task on MPP due to the inter-PE communication overhead [32] . The parallel algorithm for histogramming on SliM is based on the algorithm proposed in [32], which consists of two main steps: histogramming columns (voting) and totalling rows (summing). In the first step, every pixel is passed to the north (or the south) cyclically using the wrap- around feature. Whenever the gray-level of the received pixel is the same as the row number of a PE, the counter in that PE is incremented. After the voting, every value of a counter is passed to the west for summing. The leftmost column of PE’s sums the value of a counter and, finally, contains the histogram for the image.

The differences between the Kushner et al. algorithm on MPP [32] and our algorithm are that inter-PE communication overhead can be overlapped with computation and that bit- parallel computation can be performed on SliM. Assume that each PE has two numbers which represent its row and column numbers based on a mesh topology. N is the number of PE’s in a row or a column. Fig. 12 shows the algorithm on SliM, where T1 and T2 are registers in the register set T. Each pixel is assumed to be in the s register. T1 contains the row number of the PE if the row number is less than or equal to the maximum intensity value. T2 acts as a counter.

a time-consuming task. In contrast, on SliM, collecting and sorting procedures are not required in the new proposed algorithm.

The ordered, singly linked list shown in Fig. 14 is used for median filtering, where pixel i represents the zth received pixel. The order in the list is pixel 1 5 pixel 3 5 pixel 4 5 pixel 2. Each PE has its own list. After shifting the sliding memory plane, each PE can access its neighboring pixel and insert it into its list in order. While this insertion occurs, the sliding memory plane can be shifted. Thus, collecting can be overlapped with inserting.

Fig. 15 describes the new parallel algorithm for median filtering in which A represents the address register. When the condition register C indicates that the next address is in A, the PE neglects the address in the broadcast instruction and fetches the next address from A. This address register A is convenient to implement linked list data structures because each PE can fetch the next address from A. Hence, addressing autonomy is achieved.

Since the time for insertion into the list is larger than the time for sliding, the inter-PE communication overhead is overlapped, and only the time for creating the ordered, singly linked list of neighboring pixels is needed. After making the

610 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6, JUNE 1993

Computing the X-Magnitude

ownqixel c I ; T c owngixel; T c T - I ; T c T - (I << 1); T c T - I ;

T c T + I ; T e T + (I << 1); X-Mag t (T + I ) >> 2;

s c South pixel; P Store pixel in memory; Sliding to North*/ s c Southwest pixel; /' Sliding to East*/ s c West; P Sliding to South *I s e Northwest; P << represents 1-bit left shift; Sliding to South*/ s c Nolth: I* Sliding to West *I s c Nonheast; P Sliding to West *I s t East; /* Sliding to North */ s c Southeast; P Sliding to North*/

P After two I-bit right shifts. the result is stored into memory */

Computing the Y-MagniNde P pixel is in ownqixel */

s t ownqixel; T c ownqixel; T c T - 1 ; T c T - ( I << 1); T c T - 1 ;

T e T + l ; T t T + (I << 1); Y-Mag t (T + I ) >> 2;

s c East pixel; I* Pixel into s ; Sliding to West*/ s c Southeast; P Sliding to North*/ s t South: /* Sliding to East */ s t Southwest; P Sliding to East*/ s t West; I* Sliding to South *I s c Northwest; P Sliding to South */ s c No* P Sliding to West */ s c Nonheast; P Sliding to West*/

I* After two I-bit right shifts, the result is stored into memory */

Fig. 11. A parallel algorithm for the Sobel operator.

If (row-number S max-intensity) T1 c row-number. T 2 e 0;

if ( I = T 1)

/* Counter *I

P row-number = intensity */

/* load Counter into Sliding memory */

for i c 0 until N - 1 do

T 2 = T 2 + 1; s c South pixel; I* Sliding to North */ s c T2; for i t 0 until N -1 do

if Oefrmost PE) /* column-number = 0 */ T 2 t T 2 + 1 ; s c East counter. /*Sliding to West */

Fig. 12. A parallel algorithm for histogramming.

T c s ; for i c 1 until W 2 -1 do

s t TIW';

Fig. 13. A parallel algorithm for average value.

s t one of neighbors; /* Sliding *I

T c T + s ; s c one of neighbors; /* Sliding */

pixel 2

pixel 3

Fig. 14. An ordered singly l inked list for median filtering.

list, the median value in each list can be easily found in the middle of the list simultaneously. Thus, the total processing time consists of only the time needed for making the list. In contrast with a typically used median filtering algorithm that consists of collecting and sorting procedures, a sorting procedure is not required, and a collecting procedure (inter-PE communication) is invisible in the new algorithm.

The worst case is one in which the pixel is greater than the pixels in the list, every time it is received from a neighbor. Thus, the pixel just received should be compared with all the pixels in the list. In this case, the new algorithm requires an O( W 2 ) computation complexity and a zero communication

PIXEL[l] c s; for i e 1 until W 2 - 1 do

/* store its own pixel */

begin A c NEXT[O]; s t one of neighbors; I* Sliding */ for j t 1 until j 5 i do

i f ( 1 <PIXEL[A])

else insert ( I . i+l , j-1);

If (i f j)

elseP i = j *I A c NEXT[A];

insert (I. i+l. j); break one innermost loop;

end k = W 2 >> 1; A c NEXT[O]; for i e 0 until k - 1 do

A e NEXT[A]; T c PIXEL[A];

/* 1-bit right shift */

/* Median in T register */

procedure insert (data, empty, order) begin

PIXEL[empty] c pixel; NEXT[empty] c NEXT[order]; NEXT[order] t empty;

end

Fig. 15. A parallel algorithm for median filtering.

complexity. For a 3 by 3 window and an 8-b/pixel, the estimated time is 11.2 p s in the worst case on SliM. In contrast, a typical median filtering algorithm using collecting and sorting requires about 16.6 p s in the worst case on SliM. Further details of this calculation are omitted.

In summary, Table I presents the computation and communi- cation complexities and the expected times for the algorithms previously mentioned.

F. The DAREA Low Level Vision Benchmarks

The estimated performance figures of SliM for the DARPA low level image understanding benchmarks [40], presented in [3], show significant improvements compared with those of existing architectures [3], [38]. Tables I1 and 111, respectively, list the performance for the 512 x 512 8-b integedpixel intensity image and the performance for the 512 x 512 32-b floating point/pixel depth image.

SUNWOO AND AGGARWAL: SLIDING MEMORY PLANE ARRAY PROCESSOR

TABLE I THE PERFORMANCE SUMMARY OF SliM FOR THE

N x N (512 x 512) 8-b INTEGERPIXEL INTENSITY IMAGE

Computation Communication Estimated Complexity Complexity Time (ps) Task

2-D Convolution O(W2) zero 8.8

Sobel O(1) OP) 2.0

Histogramming O ( N ) zero 153.9

Average O W 2 ) zero 2.9

Median Filtering O(IY2) zero 11.2

TABLE I1 THE PERFORMANCE OF SliM FOR THE DARPA

512 x 512 8-b INTEGERPIXEL INTENSITY IMAGE

Task Estimated Time (us) K-curvature 12.7

I-D Gaussian Smoothing 9.2 1.2 1st Derivative of Smoothed Curvature

Zero-crossing 1.1 Thresholding 0.3

ANDinE 0.1

VI. CONCLUSIONS

This paper proposes a new mesh-connected SIMD ar- chitecture, called the Sliding Memory Plane (SliM) Array Processor. Differing from existing mesh-connected SIMD ar- chitectures, SliM has several salient features, such as a sliding memory plane that provides inter-PE communication during computation. Two I/O planes provide an I/O overlapping capability. Thus, inter-PE communication and I/O overhead can be overlapped with computation. Inter-PE communication time is invisible in most image processing tasks because the computation time is larger than the communication time on SliM. The ability to overlap inter-PE communication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller, is unique to SliM.

Diagonal communication links are not needed on SliM because the by-passing path or two sliding memory plane shifts can achieve diagonal communication and, as a result, SliM can eliminate diagonal links. These virtual links, which have no instruction overhead, are unique to SliM and are especially advantageous for computations along the border pixels of regions, which are commonly used in image pro- cessing. Therefore, SliM can alleviate the disadvantages of existing mesh-connected SIMD architectures, namely, inter- PE communication overhead, 1/0 overhead, and relatively complicated interconnections. The by-passing path can also realize nonlocal communication and broadcast. In addition, each PE on SliM can operate separately, based on local autonomies.

The paper proposes new parallel algorithms for SliM with a zero or an O( 1) communication complexity. As discussed, SliM can achieve performance improvements, even with con- servative assumptions. There are several reasons why SliM’s performance surpasses existing mesh-connected SIMD archi- tectures [6], [SI, [38]. First, inter-PE communication overhead

61 1

TABLE 111 THE PERFORMANCE OF SliM FOR THE DARPA 512 x 512 32-b FLOATING POINTPIXEL DEITH IMAGE

Task Estimated Time ( g s ) Median Filtering 40.9 Sobel Operator 15.7

Gradient Magnitude 9.4 Thresholding 0.5

can be overlapped with the computation. Second, bit-parallel processing is faster than bit-serial processing. Third, in most existing mesh-connected SIMD machines, PE’s store pixels into memory after pixels are received. During processing, the pixels stored in memory must be accessed for computation. This memory access overhead is significant. In contrast, SliM’s S register plane contains all pixels, which can be transferred to neighbors during computation and directly accessed by the ALU. Thus, the overhead for memory access can be reduced. Fourth, a set of registers (7‘) can be effectively used in place of memory for storing intermediate results, further reducing the overhead for memory access.

In summary, SliM is a flexible and reconfigurable ar- chitecture that has unique features which can alleviate the drawbacks of existing mesh-connected SIMD architectures. Performance degradation due to those drawbacks is minimized to allow higher throughput. The concept of the sliding memory plane may also be applicable to other special purpose VLSI architectures. Future research will investigate more detailed VLSI implementation issues and the applicability of the sliding memory plane idea for special purpose architectures.

REFERENCES

M. H. Sunwoo and J. K. Agganval, “A sliding memory plane array processor for low level vision,” in Proc. Int. Conj Pattern Recognition, Atlantic City, NJ, June 1990, pp. 312-317. -, “A sliding memory plane array processor,” in Proc. 2nd Symp. Frontiers ’88 Massively Parallel Computation, Fairfax, VA, Oct. 1988, pp. 537-540. -. “A vision tri-architecture (VisTA) for an integrated computer vision system,” in Proc. DARPA Image Understanding Binchmark Work- shop, Avon, CT, Oct. 1988. -, “VisTA - An image understanding architecture,” in Parallel Architectures and Algorithms for Image Understanding V. K. P. Kumar, Ed. Special Issue on Computer Architecture for Image Processing, IEEE Comput. Mag., Jan. 1983. F. A. Gerritsen, “A comparison of the CLIP4, DAP and MPP processor- array implementations,” in Computing Structures for Image Processing, M. J. B. Duff, Ed. S. H. Unger, “A computer oriented toward spatial problems,” Proc. IRE,

T. J. Fountain, “A survey of bit-serial array processor circuits,” in Computing Structures for Image Processing, M. J. B. Duff, Ed. New York: Academic, 1983, pp. 1-13. M. Maresca, M. A. Lavin, and H. Li, “Parallel architectures for vision,” Proc. IEEE, vol. 76, Aug. 1988. J. Gregory and R. McReynolds, “The SOLOMON computer,” IEEE Trans. Electron. Comput., vol. EC-12, pp. 774-780, Dec. 1963. G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, and R. A. Stokes, “The ILLIAC IV computer,” IEEE Trans. Comput., vol. C-17, pp. 746-757, Aug. 1968. K. E. Batcher, “Design of a massively parallel processor,” IEEE Trans. Comput., vol. C-29, pp. 836-840, Sept. 1980. __, “Bit-serial parallel processing systems,” IEEE Trans. Comput., vol. C-31, pp. 377-384, May 1982.

New York: 1991, pp. 121-154.

New York: Academic, 1983, pp. 15-30.

vol. 46, pp. 1744-1750, Oct. 1958.

612 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 4, NO. 6, JUNE 1993

[14] J. L. Potter, Ed., The Massively Parallel Processor. Cambridge, MA: M.I.T. Press, 1985.

[15] M. J. B. Duff, “Review of the CLIP image processing system,” in Proc. Nut. Comput. Conf, 1978, pp. 1055-1060.

[16] T. J. Fountain, “Toward CLIP 6 - An extra dimension,” in Proc. IEEECS Comput. Architect. for Pattern Anal. and Image Database Management, 1981, pp. 25-30.

[17] -, “Plans for the CLIP 7 chip,” in Integrated Technology for Parallel Image Processing, S. Levialdi, Ed. New York: 1985, pp.

[18] T. J. Fountain, K. N. Matthews, and M. J. B. Duff, “The CLIP7A image processor,” IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp. 310-319, May 1988.

[19] DAP Series Technical Overview, Active Memory Technology Inc., 1989. [20] R. Davis and D. Thomas, “Systolic array chip matches the pace of high-

speed processing,” Electron. Design, vol. 32, no. 22, pp. 207-218, Oct. 1984.

[21] T. M. Silberberg, “The Hough transform on the geometric arithmetic par- allel processor,” in Proc. Comput. Architect. for Pattern Anal. Machine Intell., 1985, pp. 387-393.

[22] L. N. Robinson and W. R. Moore, “A parallel processor array architec- ture and its implementation in silicon,” in Proc. IEEE Custom Integrated Circuits Con$, Rochester, NY, May 1982, pp. 41-45.

[23] T. Sudo and T. Nakashima, “An LSI adaptive array processor,” in Proc. IEEE Int. Solid-States Circuits Conj, San Francisco, CA, Feb. 1982, pp.

[24] A. P. Reeves, “A systematically designed binary array processor,” IEEE Trans. Comput., vol. C-29, pp. 278-287, Apr. 1980.

[25] C. Weems, “Some sample algorithms for the image understanding archi- tecture,” in Proc. DARRA Image Understanding Workshop, Washington

[26] H. Li and M. Maresca, “Polymorphic-torus: A new architecture for vision computation,” in Proc. IEEECS Comput. Architect. for Pattern Anal. Machine Intell., Seattle, WA, Oct. 1987, pp. 176-183.

[27] -, “Polymorphic-torus network,” in Proc. Int. Conj Parallel Pro- cessing, Aug. 1987, pp. 411-414.

[28] R. Miller, V. K. P. Kumar, D. Reisis, and Q. F. Stout, “Meshes with reconfigurable buses,” in Proc. MIT Con$ Advanced Research in VLSI, 1988, pp. 163-178.

[29] W. D. Hillis, The Connection Machine. Cambridge, MA: M.I.T. Press, 1985.

[30] J. R. Nickolis, “The design of the MasPar MP-1: A cost effective massively parallel computer,” in Proc. IEEE Compcon Spring 90, 1990, pp. 25-28.

[31] E. W. Davis and J. H. Reif, “Architecture and operation of the BLITZEN processing element,’’ in Proc. 3rd Int. Conf. Supercomput., vol. 111, 1988, CID. 128-137.

199-214.

122-123, 307.

DC, 1988, pp. 127-138.

[34] J. P. Strong, “The Fourier transform on mesh connected processing array such as the massively parallel processor,” in Proc. IEEECS Comput. Architecture for Pattern Anal. Machine Intell., 1985, pp. 190- 196.

[35] Z. Fang, X. Li, and L. M. Ni, “On the communication complexity of generalized 2-D convolution on array processors,” IEEE Trans. Comput., vol. 38, no. 2, pp. 184-194, Feb. 1989.

[36] A. Rosenfeld and A. C. Kak, Digital Picture Processing. New York: Academic, 1982.

[37] S.-Y. Lee and J. K. Agganval, “Parallel 2-D convolution on a mesh connected array processor,” IEEE Trans. Pattern Anal. Machine Intell.,

[38] Proc. DAREA Image Understanding Benchmark Workshop, Avon, CT, Oct. 1988.

[39] B. Ledbetter, Jr., R. McGarity, E. Quintana, and R. Reininger, “The 68040 integer and floating-point units,’’ in Proc. IEEE Compcon Spring 90, 1990, pp. 259-263.

[40] C. Weems, E. Riseman, A. Hanson, and A. Rosenfeld, “An integrated image understanding benchmark: Recognition of a 2 1/2D mobile,” in Proc. DARRA Image Understanding Workshop, Washington, DC, 1988,

[41] R. Miller and Q. F. Stout, “Mesh computer algorithms for computational geometry,” IEEE Trans. Comput., vol. 38, no. 3, pp. 321-340, Mar. 1989.

vol. PAMI-9, pp. 590-594, July 1987.

pp. 111-126.

Myung Hoon Sunwoo (S’87-M’90) received the B.S. degree in electronic engineering from Sogang University, Seoul, Korea, in 1980, the M.S. degree in electrical and electronic engineering from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1982, and the Ph.D. degree in electrical and computer engineering from the University of Texas at Austin in 1990.

From 1982 to 1985, he was a member of the Research Staff at the Korea Electronics and Telecommunications Research Institute, Taejeon,

Korea. From 1986 to 1990, he worked as a Research Assistant in the Computer and Vision Research Center at the University of Texas at Austin. During 1990-1992 he was a Research Staff Member at the Digital Signal Processor Operations, Motorola Inc., Austin, TX. He is currently an Assistant Professor in the Department of Electronic Engineering, Ajou University, Suwon, Korea. He has been involved in various area research projects in the above institutes. His research interests include parallel architectures and algorithms, computer architectures, VLSI architectures and design, DSP chips, digital cellular system design, and image and speech processing.

r r ~~~ ~~

[32] T. Kushner, A. Y. Wu, and A. Rosenfeld, “Image processing on MPP,” Pattern Recognition, vol. 15, pp. 121-130, 1982.

[33] T. Ericsson and P-E. Danielsson, “LIPP - A SIMD multiprocessor architecture for image processing,” in Proc. 10th Annu. Int. Symp. Comput. Architecture, 1983, pp 395-400.

J. K. Aggarwal (S’62-M’65-SM’74-F’76), for a photograph and biogra- phy, see the March 1993 issue of this TRANSACTIONS, p. 346.