13
The ;31sual lJomputer A scalable high- performance graphics processor: GVIP Tsuneo Ikedo Computer Architecture Laboratory, University of Aizu, Tsuruga, Ikki-machi, Fukushima 965, Japan The GVIP (geometric and TV image pro- cessor) graphics processor, which creates and synthesizes computer graphics and TV images and meets the requirements of multi-media systems, is described. The hardware modules that make up this graphics processor include: a 32-bit em- bedded RISC processor, a Phong and Gouraud shading processor, a texture mapping processor, a hidden surface re- moval processor, an HDTV video image processor, a BitBlt processor, an image- processing module, and an outline font fill generator. These hardware modules fab- ricated using 0.8/zm CMOS standard cells have been placed in three integrated cir- cuit chips. The total number of gates used for one set of chips is approximately 350000. Key words: Graphic processor - multi- media systems - HDTV- Polygon ren- dering Introduction Research and development in the graphics proces- sor field is embarking upon a new epoch as demands for ever higher drawing speeds and multi-media system requirements have made past solutions obsolete. Multi-media systems have to manage as well as synthesize various types of inputs such as computer graphics images, camera images, and sound. For example, data captured by a camera or scanner may have to be syn- thesized in multi-dimensional space and com- bined with character data and computer graphics images using filtering or pattern recognition tech- niques; the graphics output may be displayed on such devices as CRT monitors or color laser printers after converting the pixel primitives into appropriate data formats in real time. Past research and development efforts in the graphics processing area have been mainly in the area of wireframe and polygon rendering tech- niques; they employed special software algorithms which met such graphics standards as PHIGS + or used massively parallel computer architectures to enhance the speed of ray-tracing methods. Even for such extremely demanding applications as multi-media graphics, the trend has been to use massively parallel general-purpose processing ele- ment (PE) architectures or specialized symmetric hardware modules interconnected for parallel processing. For example, geometric images in current rendering systems are often produced by using special data formats (e.g. polynominal equa- tions) for the drawing primitives, symmetric inter- connection networks, and for the processing elements (PEs) one of the following: general-pur- pose RISC processors [1], digital signal proces- sors (DSP) [2], or application-specific integrated circuits (ASICs) [3, 4]. Their performance varies considerably depending on the data type being processed or on the function being carried out; in multi-media applications, they would not fare well because of their inefficiency in handling diverse data types concurrently and their poor utilization of computer resources. In fact, when a drawing speed exceeding a few million polygons per second is required, massively parallel general- purpose PE architectures become extremely cost inefficient because the required number of PEs becomes excessive for practical implementations. Hybrid organizations which combine PEs, used as graphics accelerators, and specialized hardware for display controllers are also used in currently The VisualComputer(1995) 11:121-133 1 r "1 Springer-Verlag 1995 f_. /

A scalable high-performance graphics processor: GVIP

Embed Size (px)

Citation preview

Page 1: A scalable high-performance graphics processor: GVIP

The ;31sual lJomputer

A scalable high- performance graphics processor: GVIP

Tsuneo Ikedo

Computer Architecture Laboratory, University of Aizu, Tsuruga, Ikki-machi, Fukushima 965, Japan

The GVIP (geometric and TV image pro- cessor) graphics processor, which creates and synthesizes computer graphics and TV images and meets the requirements of multi-media systems, is described. The hardware modules that make up this graphics processor include: a 32-bit em- bedded RISC processor, a Phong and Gouraud shading processor, a texture mapping processor, a hidden surface re- moval processor, an HDTV video image processor, a BitBlt processor, an image- processing module, and an outline font fill generator. These hardware modules fab- ricated using 0.8/zm CMOS standard cells have been placed in three integrated cir- cuit chips. The total number of gates used for one set of chips is approximately 350000.

Key words: Graphic processor - multi- media systems - H D T V - Polygon ren- dering

Introduction

Research and development in the graphics proces- sor field is embarking upon a new epoch as demands for ever higher drawing speeds and multi-media system requirements have made past solutions obsolete. Multi-media systems have to manage as well as synthesize various types of inputs such as computer graphics images, camera images, and sound. For example, data captured by a camera or scanner may have to be syn- thesized in multi-dimensional space and com- bined with character data and computer graphics images using filtering or pattern recognition tech- niques; the graphics output may be displayed on such devices as CRT monitors or color laser printers after converting the pixel primitives into appropriate data formats in real time. Past research and development efforts in the graphics processing area have been mainly in the area of wireframe and polygon rendering tech- niques; they employed special software algorithms which met such graphics standards as PHIGS + or used massively parallel computer architectures to enhance the speed of ray-tracing methods. Even for such extremely demanding applications as multi-media graphics, the trend has been to use massively parallel general-purpose processing ele- ment (PE) architectures or specialized symmetric hardware modules interconnected for parallel processing. For example, geometric images in current rendering systems are often produced by using special data formats (e.g. polynominal equa- tions) for the drawing primitives, symmetric inter- connection networks, and for the processing elements (PEs) one of the following: general-pur- pose RISC processors [1], digital signal proces- sors (DSP) [2], or application-specific integrated circuits (ASICs) [3, 4]. Their performance varies considerably depending on the data type being processed or on the function being carried out; in multi-media applications, they would not fare well because of their inefficiency in handling diverse data types concurrently and their poor utilization of computer resources. In fact, when a drawing speed exceeding a few million polygons per second is required, massively parallel general- purpose PE architectures become extremely cost inefficient because the required number of PEs becomes excessive for practical implementations. Hybrid organizations which combine PEs, used as graphics accelerators, and specialized hardware for display controllers are also used in currently

The Visual Computer (1995) 11:121-133 1 r "1 �9 Springer-Verlag 1995 f_. /

Page 2: A scalable high-performance graphics processor: GVIP

ompuCer

available commercial systems [5]. Looking at the cost vs performance issues, the best solution is believed to be a graphics processor built with several highly specialized modules interconnected for parallel processing. In this paper, a new type of graphics processor, the GVIP (geometric and TV, image processor) will be described. This processor is built with multiple hardware modules; each hardware mod- ule has its own specialized function, such as execu- tion of application-specific software code (RISC processor), rendering, BitBlt, image synthesis with geometric and HDTV primitives, hidden surface removal, and texture mapping. It can handle vari- ous data types without performance degradation because hardware resources do not have to be shared. Furthermore, it is scalable and hence drawing speeds exceeding 10 million polygons may be easily achieved.

System overview

GVIP is a graphics system which can manage, synthesize, and display as single or combined images various kinds of data derived from com- puter graphics and TV pictures. The system con- sists of two parts: the graphics accelerator and the drawing primitives generator. In this paper, only the drawing primitives generator will be de- scribed; the GVIP graphics accelerator will be described in a forthcoming paper. The GVIP is scalable and hence its performance may be en- hanced by interconnecting sets of modules in a MIMD architecture. One set of GVIP chips can produce 3 million three-dimensional polylines/s and 1.2 million polygons/s while carrying out 100-pixel hidden surface removal (24 bits depth), texture-mapping, and Phong shading operations. The remarkable feature of GVIP is that it is ca- pable of maintaining this kind of performance even when it receives HDTV images (73 MHz), because it is able to store images in its frame buffer independently. The basic system shown in Fig. 1 consists of three VLSI chips: the graphics processor GVIP01, the BitBlt (bit block transfer) processor GVIP02, and the video controller GVIP03. The GVIP system provides real-time interfaces with various devices without using special buffers. Conventional graphics processing computer architecture ap-

I / t

Sy# m

i ! ! ! ! Ho v r-soa i- (HighDensityTV) I Color Printer ! Camera I VTR- CD i ,

i

cPub

2

Application

ink

r Window 7 F - ~ Multi~Media Layer3 Management J ~ | .Interface nterface

Window ] ~ t e II Image I !Synthesisl ~ --- .. II Segmentation I of I Layer2 i Control Transformation I LHighli~ghtin~g ]

Primitives 7 ~ 7 ~ IFrarne buffer ~, iP~g~!leil Layer1 Attributes / TeT~_ure ~ LZB] I Control i i O . . . . . t i control ~ M i- Cache ~ Net~wn~rk

Hardware GVIP/MIMD Architecture

3

Fig. 1. GVIP system organization

Fig. 2. GVIP graphics display system

Fig. 3. GVIP graphics instruction set

proaches such as pixel arrays, systolic arrays, or special purpose multi-processor organizations which employ microprogramming are not used in GVIP. This chip integrates several different hard- ware modules which perform different functions on different data. Parallel processing and pipelin- ing are used extensively. The drawing speed is virtually unaffected by the type of data processed because there are hardware modules optimized for each data type. Furthermore, GVIP has the capability to operate in parallel at very high

122

Page 3: A scalable high-performance graphics processor: GVIP

omputer speeds in such tasks as simultaneous computer image generation (rendering process) HDTV image transmission, and true color image printing without any additional frame grabbers. The GVIP01 chip is connected to the system bus and acts as the master graphics processor for the system. It has an embedded 32-bit RISC proces- sor and several hardware modules with specia- lized graphics functions. Main memory and program RAM/ROM of the RISC processor are available externally for the user's specified pro- gram and data processing. It analyzes instructions sent from the system processor and then sends the pixel primitives to the following hardware mod- ules: the Gouraud and Phong shading module, the BitBlt module, the hidden surface removal (HSR) module, the TV (HDTV) image module, the texture mapping module, the outline font fill module, the image processing module, and the parallel link channel protocol module. The BitBlt processor chip GVIP02 acts as an interface between the GVIP01 and the frame buf- fer. This architecture, which uses a master graphics processor and several BltBlt processors to manage pixel operations by dividing a frame buffer, has been used in the past [6-8-1. In addi- tion to the usual modules found in BitBlt proces- sors, such as the pixel cache, the interior style generator, the boolean operation unit, and the z cache, the GVIP02 has four-way Phong shading generators, a texture mapping processor, a video signal converter, and a parallel link channel. The transmission speed (including the frame buffer writing time) from the external bus to the GVIP, passing through this chip and terminating in the frame buffer, is 8 ns per pixel (32 bits). This trans- mission speed figure is based on actual experi- mental data using a 60-ns access time DRAM (e.g. MB814260) as the frame buffer and 100 MHz video data frequency. GVIP03 is a video controller chip which gene- rates and controls cursors and video images. It integrates the video signal sent by the GVIP02 chip with cursors and outputs the combined sig- nals to a digital-to-analog converter (DAC). The input/outputs signals of this chip are organized with two-phase timing in order to be able to process high-frequency (100MHz or higher) video signals. The basic organization of the GVIP system with true color capability is shown in Fig. 2; in this figure, the GVIP04 chip which carries

out the parallel Z component comparisons is shown. Detailed descriptions of all the modules in the GVIP system will be given in the subsequent sections. As shown in Fig. 3, the GVIP graphics processing instructions are organized in a hierarchical fashion. The approximately 60 instructions in layer 1 are used to control hardware. Instructions in layer 2 are executed by two different modules: (a) the embedded RISC processor in GVIP01: image processing and highlight pre-processing; and (b) the graphics accelerator (attached to the GVIP01): coordinate transformations, window control, and user-specified highlighting. The instructions in the third layer are used by the system, processor. In total there are nearly 200 instructions: 30 for hardware control, 19 for win- dow control, 56 for primitives, 38 for attributes, 47 for coordinate transformations, and 16 for sound.

Graphics processor GVIP01

A block diagram of the GVIP01 chip is shown in Fig. 4. Many different types of highly specialized pipelined/parallel processors are integrated within the GVIP01 chip. Theses processors are interconnected using a route tree topology. The drawing primitives sent to GVIP01 by the system processor undergo several processes such as coor- dinate transformation, perspective projection, clipping, and highlighting in order to produce device coordinate primitives which are then used to generate pixels. The RISC processor inside the GVIP01 chip could be used to carry out all of these processes using software routines; however, its main role is restricted to the generation of pixels from the device coordinate primitives so that high throughput may be attained. The coor- dinate transformations are processed by a graphics accelerator; its details are not discussed in this paper. Other processes carried out or prep- rocessed by the GVIP01 chip include shading, texture mapping, hidden surface removal, and anti-alias. This chip works as the central graphics system processor, managing and controlling other chips such as GVIP02 and GVIP03. It was de- signed using standard cells and the chip occupies 14mm 2. The GVIP01 chip consists of the follow- ing modules:

123

Page 4: A scalable high-performance graphics processor: GVIP

omputcr

. Port R SC ~ 3 ~ e ~ Z ~ T r a c k i n g 'Shad'ng Pre-Pr0cessor ID" Graphcs" CPU & Core : Processor

I | F I F O II tpr~176 I1', ~ _ �9 L - - ~ H bus , i - ~ el_

- - - - - . vul,~ne.om ,. Pe~rru r' I I : I I Processor" I r ~P~ee~a~atl0n)process(~r -N

. . . . . . . . . . . IP : I _ ~ ' I [ ~ ~or . ~ b Inst ction Dual7axis : ImageProcessing i -

L. RAM / ~ ' i z e r ~ ,: ~ ~1 T . • " - P ,~v~ rn ' - . , I Rende ,l]g P r 0 c ~ s 0 r ~ S n e A d t d o re~ I bUS

' External - ~ u / I Pr~g~emr ~l~ta / I i ~ F . . . . buffer

LROM/RAMJ I I Ad~z dress Generator~'=--~r_ Adress bus I p I[/m age ~ F e s i s - ~ ~ , - - - - ~ F Parallel l ink

i Processe~Dl_Channel Contro l e r ~ i Colqtrol bus

4 Polygon Vertex

& Affributes

J i 2 i l o n . . . . . . . o : - : . . . . . . . . . f : : . . . . . i D'ff " , Adder Boolean ' To

~ r 0 c e sng~,,~ ~ r C ~ T r a c k l n g : Fill Co-effic[er J Cache Operatlonl Buffer: Frame

Light r . . . . . . . . . . . . :

I Source.~ i . . . . . . . . . . . . . . ;

. . . . . . . . . . . . . . . . . . . . . . . . . . . Z ; Frame

nterpolation :Texture ~!oc~ss Pa~ern

[GVlP 0 1 . i Gr iP . 9,4 .

Fig. 4. GVIP01 graphics processor

Fig. 5. Example of rendering processing

- Thirty-two-bit RISC processor - R e n d e r i n g (Gouraud and Phong shading)

preprocessor - Texture mapping processor - Thirty-two-bit hidden surface and line removal

processor - Anti-alias processor - BitBlt and image copy/move processor - Outline font fill processor - Parallel link channel controller - Image processing module - Frame buffer address and timing generator - Video timing generator

Embedded RISC processor

Currently, a 32-bit, 25-MIPS embedded RISC processor is used; it will be upgraded to a 64- bit l l0-MIPS/100-MFLOPS RISC processor (0.6 #m CMOS) during 1994. Its main functions are to interpret instructions/data sent by the sys-

tern processor and to distribute instructions and data to graphics modules; it seldom performs arithmetic operations. In addition to a general purpose register, it has 32 pixel (32-bit/pixel) image buffers which can directly transmit pixel to/from a frame buffer. The execution code and data reside outside the GVIP01 chip; graphics- specific instructions are used to minimize the ma- chine cycle. Separate instruction and data busses are used. The RISC processor also performs func- tions such as control of the multiple local bus and special interrupt and fetch handling for hardware graphics modules. In the rendering mode such as Gouraud or Phong, this processor is used simply to manage data transmission (load and store), to the appropriate hardware modules; other render- ing tasks are handled by the rendering processor. The logic gates used for the RISC processor oc- cupy about 30% of the GVIP01 chip area; this implies that most of the GVIP01 functions are carried out by the specialized graphics modules on the GVIP01 chip.

Rendering processor

This processor works together with the embedded RISC processor in the GVIP01 chip. It consists of the sequence controller, 16 register files, 14 digital differential analyzers (DDAs) grouped into two sets (seven DDAs/set), and a high-speed bus which connects it to the RISC processor. Vertex interpolation are computed for three-dimensional coordinates, normal vectors, and for intensity transformation vectors along the outlines of a polygon. The transformation vector interpola- tions are needed for texture mapping. Two sets of DDAs are used so that it is possible to simulta- neously output the coordinates and the attributes of the right and left branches of a polygon outline; this computation starts at the bottom of a poly- gon vertex and is performed at a rate of two cross-points every 20ns. The cross-points ob- tained by this processor are transferred to the GVIP02 chip where there is circuitry to shade a polygon surface. Outline interpolation for arbi- trary polygons with holes and font fill generation is also executed by this processor. The main pro- cesses involved in Phong shading are shown in Fig. 5. The coordinates and attributes of a poly- gon vertex are input from the system processor to

124

Page 5: A scalable high-performance graphics processor: GVIP

"r~;% T-e N 2~}~ISUa~ ;omputer

the RISC processor and then they are stored in the register file or main memory. The rendering processor interpolates the vertex coordinates and their attributes in order to trace the outlines of a polygon; it then outputs two cross-points per horizontal line which are transferred to the DDAs for further processing. After obtaining all the coordinates and attributes of a polygon surface, the data are sent to the appropriate modules for Phong shading and intensity calcu- lations. The shaded pixel is then transferred to the pixel cache and written into the frame buffer. There are seven pipeline stages between the rendering processor and the pixel cache; the throughput of this pipeline is 20 ns/pixel independent of the texture mapping operation or the transmission/reception rate of the HDTV image.

Texture mapping processor

The generation of texture mapping in the GVIP01 chip is carried out by using an interpolation method which defines three-dimensional unit vec- tors for each specified vertex; these vectors are then used to gauge the transformation rate of rotation, scale division, and perspective division. Texture mapping is executed together with coordinate and attribute interpolations. Using these transformed vertex vectors, the texture map- ping processor calculates the vectors for all points inside a polygon using four of the 14 DDAs used by the rendering processor. These vectors may be used for reverse mapping transformation by mul- tiplying the vectors with an appropriate coordi- nate value in order to get the reference address of the texture pattern RAM. The pattern output from the RAM is concatenated with the shaded variable by the Phong shading module in the GVIP02 chip. The texture pattern RAM resides outside of the graphic processor chips. The texture mapping processor consists of several multipliers and dividers for reverse mapping transformations. Texture mapping is executed in parallel at a rate of 20 ns per pixel (32 bits/pixel) synchronizing with DDA movement. The speed at which texture mapping can be accomplished depends on the read-access time of the pattern RAM.

BitBIt processor

The BitBlt processor has two interfaces: one for direct memory mapped addressing and the other for program I/O between system bus and frame buffers. The number of address bits used in the frame buffer for its x, y, and z coordinates are 16, 16, and 32 bits respectively. There are three differ- ent ways in which pixel arrays can read/write to/from the frame buffer: horizontal (x), square (x, y), and box (x, y, z). The access rate (read and write) of the frame buffer is 140 ns/32 pixels when a video signal running at 100 MHz is used. The pixel array is transferred from the BitBlt proces- sor to the permutation processor for the bound- ary shift operation. At the same time, the mask bits for boolean operations and the auto-warp boundary address (when it occurs) are generated automatically by the BitBlt processor. The BitBlt processor does not use any modules which are part of the RISC processor, the rendering proces- sor, or the DDAs. Pipelining allows the BitBlt processor to work at a rate of 20 ns/pixel.

Image processor

Image processing (e.g. enlargement/reduction, interval and gradation filtering, density pattern conversion) is mainly carried out by the RISC processor using built-in program routincs. To achieve very high pixel transmission rates to/from the frame buffer, a 32-pixels image buffer is pro- vided as a shadow general-purpose register to the RISC processor. This image buffer is directly con- nected to the pixel cache inside the GVIP02 chip via a dedicated I/O channel. The image buffer has attached to it a special RGB multiplier and adder combined with the ALU in the RISC processor for true color enhancement operations. The multi- ply and add operations take at most 80 ns/pixel, including the read access time of the frame buffer. The RISC processor uses 16 special instruction sets for image processing.

Computer graphics/TV image synthesizer

The RISC processor uses the dedicated I/O chan- nel described above for the synthesis of computer

125

Page 6: A scalable high-performance graphics processor: GVIP

The . ~ L%su Somputer

graphics and TV images. A TV image is stored in an arbitrary hidden area of the frame buffer using a parallel link channel. Texture mapping of static and dynamic (1/30 s) video images onto an arbi- trarily shaped, real-time, three-dimensional poly- gon is produced by using a RAM which buffers the TV image from the frame buffer instead of using the texture patter RAM shown in Fig. 5; this is simply done by redrawing the mapped surface at the appropriate video refresh rates. The trans- mission of a TV image from the frame buffer to the special RAM can also be done using the paral- lel link channel available on the GVIP02 chip.

Hidden surface removal processor

There are two possible ways to build an HSR processor using the z buffer method: one is to set the z buffer area in the frame buffer and the other is to use a special buffer. The GVIP system can support both of them. In order to have a common frame buffer configuration, the z buffer area may be arbitrarily defined within the hidden area of frame buffer using the RISC processor; this makes it possible for the RISC processor to access and change the z value regardless of whether the sys- tem is in the HSR mode. Using this technique, it is possible to create a three-dimensional image from a two-dimensional one by assigning appropriate z component values to a two-dimensional image; thus, three-dimensional image combinations of computer graphics and TV images are possible.

Frame-buffer address and video timing gen- erator

The video timing signals for the CRT monitor, printer, and the parallel link are defined and gene- rated on this chip. Most of the geometric image addresses coming from the hardware modules are multiplexed and output to the frame buffer using the frame buffer access timing for synchroniza- tion.

BitBIt processor GVIP02

As shown in Fig. 2, a cluster of GVIP02 chips is connected via a graphics bus to a GVIP01 chip;

I 6

! ~ 1,28#s

I Page mode ~ - 1 4 ~ ~" !~

'~ RACl X RAC2

3 0 0 n s - - ~ I

R~s ~ J - ' : " " " ~ ' ' r ' " = .. . . c~ ~ [ ~ L r ~ - RAS : Row Address ,~o~ - - i ~ - - ~ CAS : Co[urnn Adress

MOE: Memory Output Enable wR ~ I ~ ] - - WR : Write :4 - - 200ns - - :

Fig. 6. GVIP02 BitBlt processor

Fig. 7. Frame buffer timing chart

such cluster arrangements may be expanded with- out limits. The GVIP02 chip is able to trans- mit/receive instructions and partially processed pixels at a rate of 200 Mbytes/s to/from the GVIP01 chip. One GVIP02 chip directly controls four planes of the frame buffer; hence, six GVIP02 chips are used to control the 24 frame buffer planes which are needed for true color rendition. As shown in Fig. 6, the GVIP02 chip consists of the following components:

- Phong/Gouraud shading generator - Line smoothing controller - Texture mapping synthesizer - Three-dimensional pixel cache and FIFO - Boolean operation unit - HSR controller - A 128-pixel parallel-to-serial converter

video refresh - 32-bit bi-directional parallel link

for

126

Page 7: A scalable high-performance graphics processor: GVIP

omputer Polygon fill processor

The polygon fill processor consists of two modules with a pipeline architecture: the outline tracking processor inside GVIP01 and the span processor inside GVIP02. Four sets of DDAs (six DDAs per set) in the span processor concurrently interpolate the cross-points (intersections be- tween a polygon outline and a reference horizon- tal axis) output by the GRIP01 chip; this parallel interpolation process produces pixel coordinates and attributes simultaneously along four horizon- tal axes at a rate of 20 ns/pixel. Texture mapping is also performed in a parallel processing fashion by using four sets of texture mapping circuits and four sets of pattern RAMS. As shown in Fig. 5, the pattern RAMS are located outside the GVIP02 chip and their data is sent directly to the GVIP02 chip in 4-bit slices. The linear interpolation method for computing the intensity of internal surface areas is used in Gouraud shading. Phong shading is also needed to calculate the specular and diffused reflection factors based on surface slopes and the attributes of light sources. A completely processed pixel is obtained by multiplying its surface attributes (sur- face color or pattern) with its reflection coeffi- cients and then adding to it the environment light intensity. The conventional method [9] to calcu- late reflection data is to normalize the xy direction along the z component of the surface and apply it as the address of a special table RAM. Thus, the RAM table size has to be at least 32 Kbytes; this is too large for fast RAM table re-writes. In the GVIP, the surface slopes are pre-defined using horizontal and vertical components relative to the eye point axis. The specular reflection and diffu- sion rates vs surface slope and light source at- tributes which are based on Phong light model equations are prestored in the reflection coeffi- cient RAM table; this table holds horizontal and vertical reflection components separately. The average of the vertical and horizontal factors is computed by the reflection coefficient circuit shown in Fig. 5 without using a divider and the size of the RAM table is only 512 bytes; thus in the GVIP, ASICs may be readily used to build the necessary circuits since the size of the RAM table is small. This configuration can provide other shading effects if the update variable for the light model equation in the RAM can be defined by the

interpolation of the interior polygon. By using the texture mapped pattern as the surface attributes and combining with it the reflection data, the texture-mapped and Phong-shaded pixel may be produced at a rate of 20 ns/pixel. When drawing polylines with line smoothing, the RAM is used as an intensity modulation table based on the line slope (distance method), instead of as a reflection coefficient table.

Pixel cache and frame buffer configuration

The pixel produced by the GVIP system for vec- tor drawing, polygon filling, or BitBlt are stored in the cache register before storing them in the frame buffer. The GVIP system has adopted the multi-dimensional pixel array access configura- tion for the bus architecture of the frame buffer. As has often been pointed out in the past, the frame buffer access time is the bottleneck [10] when high drawing speeds are needed. For example, if a one-dimensional pixel array is used to access the frame buffer, data cannot be re- trieved fast enough unless the bandwidth is made very wide. In the GVIP system, a pixel cache and a FIFO are used between the processor and the frame buffer in order to minimize the bottleneck; the frame buffer access time for three-dimensional 32 pixels data is within 140 ns. Well-known archi- tectural approaches [11, 12, 13] exist for reading or writing a multi-dimensional pixel array within one access time to/from the frame buffer. To ac- cess a pixel array of size m x n or of size m x n x 1, the number of memory modules and their signal lines can be made equal to the number columns and rows respectively of the pixel array. There are two problems associated with this approach: (a) the difficulty in the design of a simple and reliable video refresh circuit which can operate above 100 MHz, and (b) figuring out a good balance between cache size and frame buffer size. A multi- port video DRAM (VRAM) has normally been used for the frame buffer. It has two ports for parallel data access and one serial output port. The serial output port has a 256-bit parallel load shift register to output the video signal. For vari- ous reasons, in the multi-dimensional pixel array system, it is not possible to use this kind of VRAM to access a pixel block unit in one step unless a complex external multiplexer is used with it to

127

Page 8: A scalable high-performance graphics processor: GVIP

Th~ 1 I S U ~ �84

ornp Jcer arrange the fragmented pixel data into an appro- priate sequential video data format. If the re- quired multiplexer and shift register are built with discrete components, video signal rate require- ments exceeding 100 MHz will be difficult to meet and there will be reliability problems as well. In the GVIP system, single-port DRAMs instead of VRAMS are used. When a single port DRAM is used for the frame buffer, pixel writing and video refreshing compete against each other be- cause of the limited machine cycle time. In order to increase the ratio between pixel access cycle time and video refresh cycle time, the number of pixel access cycles within one machine cycle can be increased; however, with this approach the size of the video refresh shift register will have to be increased. The GVIP02 chip has a direct bus connection with the frame buffer and a built-in parallel-to-serial converter (a shift register and a multiplexer) which arranges pixel data appro- priately. A 128 pixels x 24 bits video refresh shift register is used to convert parallel pixel data into a serial format. Using this shift register, the GVIP system is able to read/write pixel data at a rate of 224 pixel/1.28/~s when single-port 60-ns access time DRAMs are used. Hence, the GVIP02 chip is able to quickly access as well as store large amounts of pixel data using its own bus and can produce a continuous pixel data stream without using any external components. Determining the appropriate size of the cache is another problem. For example, the use of an 8 pixel x 8 pixel cache (or smaller) for vector drawing with arbitrary slopes may cause too many wait states. A larger cache would reduce the number of wait states; however, simply increasing cache size is not a good solution because the probability of holding the correct pixel data in an m x m cache is less then or equal to l/m; further- more, there are physical limitations on the num- ber of parallel bus connections to/from the cache. The pixel cache in the GVIP02 chip is composed of a cache register and a first-in first-out (FIFO) memory; the outputs of the cache register feed into the FIFO register. A cache register consists of an 8 pixel x 8 pixel block unit and the FIFO memory is able to hold 32 of these pixel block units. The FIFO stores both the pixel data and their respective xy addresses. With this pixel cache configuration, no wait states occur until overflow occurs in the FIFO. A write into the frame buffer

is done from the FIFO. The FIFO is loaded when the 8 pixel x 8 pixel block unit overflows. The use of the FIFO is particularly effective when there is a set-up time for DDA operations and/or when the frame buffer access time is sufficiently short. Four sets of DDAs and fur sets of pixel catches are used to process the four horizontal scan lines. One set of DDAs is composed of six DDAs for: x, y, z, intensity, tx and ty; tx and ty are used for Phong shading. In the rendering mode, 24 DDAs operate simultaneously and the lower bits are used as cache register addresses. The pixel cache receives pixel information from the texture mapping pro- cessor or the tiling pattern from the interior style RAM. The pixel cache has a bi-directional bus which connects it to the GVIP01 chip; using this bus, it is possible for the pixel cache to read a three-dimensional pixel array from the frame buffer in a manner similar to a pixel cache write operation to the frame buffer.

Boolean operation unit

The image data (pixel) stored in the cache are supplied to the boolean operation unit (BOU) for boolean operations with destination data stored in the frame buffer. The BOU can perform 16 different logical functions. A mask bit register is used to select the pixel(s) which will be used to carry out logic operations. Mask bit data may be generated by the system processor, the BitBlt pro- cessor in GVIP01, the span processor in GVIP02, and the z-component comparison processor in GVIP04.

Parallel link

The bi-directional parallel link in the GVIP02 chip has a 400 Mbytes/s band width and 128-pixel (32-bit) triple buffers for the input and 64-pixel double buffers for the output. The data in these buffers may be accessed by the frame buffer at a rate of 280 ns/pixel for the input buffers and 140 ns/pixel for the output buffer. The GVIP ma- chine cycle is 1.28 #s, which allows seven pixel read/write cycles (980 ns) and one video refresh cycle (300 ns). These timings are shown in Fig. 7. The top timing chart in Fig. 8 shows the random access cycles (RACs) for the cache and video re-

128

Page 9: A scalable high-performance graphics processor: GVIP

i omputer fresh cycle (VRC) within one machine cycle. Dur- ing one RAC, 8 x 4 (m x n) pixels in the cache are transferred within 140 ns, to/from the frame buf- fer. During one VRC, 128 pixels are read within 300 ns from the frame buffer and then loaded into the shift register of the GVIP02 chip. The middle timing chart in Fig. 7 shows the page mode RAC timing; this type of RAC is executed to store the appropriate number of block (m x n) data into FIFO, The number of pages can be either four or eight depending on the amount of data and the group address stored in the FIFO. The bottom timing chart in Fig. 7 shows the system timing during the HSR mode; in this mode, the number of RACs is decreased to five in order to allow write cycles (200ns) for the z-component and pixel values. When separate buffers are used for storing the image and z-component values, six RACs are needed. A separate frame buffer was considered in order to improve the HSR perfor- mance by 15%; however, its cost did not justify the minor improvement in performance. The par- allel architecture described in the following sec- tions is believed to be a better solution than a system which uses a separate z buffer. In case the parallel link receives a 100-MHz data stream, the system requires one write access re- quest (280 ns) to the frame buffer within one ma- chine cycle; this still leaves five cycles for pixel accesses. This means that it is possible to achieve an 8 ns/pixel access rate and still maintain real time display of H D T u images. The GVIP system can also achieve 10 ns/pixel access rates when both an HDTV image and full-color laser printer output (up to 25 MHz) have to be produced si- multaneously. They copy and move operations of the BitBlt processor, which involve reading/writing of the frame buffer, may be performed together with the resizing and/or boolean functions. To do this type of operation, two cache registers are used: one for the source register and the other for the destina- tion register. When either the destination register gets empty or the source register overflows, a frame buffer access occurs. The GVIP02 chip has direct control over these kinds of operations, neither the system processor nor the GVIP01 chip is involved. For the copy or move operations, a pixel is transmitted within 40ns. A 1 K pixels x 1 K pixels video screen area can be copied within 0.025 s.

True Clo[or Video bus

" ~ ~ True Co or V d, Pipe Line Regi. ~

56 Video l Input '

Register ~ Overlay

r4 User Specified Cursor RAM Controller

Fig. 8. GVIP03 video controller

DAC bus

~ e z c a ~ e

For hidden surface removal, a z-axis (depth) cache is commonly used. However, there are problems with its use, especially for very high speed parallel comparisons of multi-dimensional pixel array configurations. The image pixel and its z coordi- nate are usually stored in the same frame buffer. In order to read a block of pixel data in one step, a special two-dimensional bus is necessary. In GVIP, one GVIP02 chip is connected to all the frame buffer memory modules using a 4-bit-slice bus; this method does not allow z coordinate value comparisons to take place within 40 ns/32 pixels because of the propagation delay of the serially cascaded logic modules. To overcome this problem, GVIP uses the GVIP04 chip, which per- forms very high speed z coordinate comparisons. When the GVIP04 chip is used, a z cache becomes unnecessary. Details of the GVIP04 chip will be described in a forthcoming paper.

GVIP03 video controller

As shown in Fig. 8, the GVIP03 chip is directly connected to the GVIP02 chip via the video signal bus and performs cursor generation, plane mask- ing, and plane overlaying. The video signal (100 MHz or higher) sent by the GVIP02 chip is split into two out-of-phase signals, each having half the original frequency, so that TTL device power supply levels may be used for the logic circuits. The output signal is fed into a DAC. The GVIP03 chip generates six different types of cursors: four are user specified (32 x 32 x 2 bits) and two are hardware generated. These cursors

129

Page 10: A scalable high-performance graphics processor: GVIP

l omputer can be displayed simultaneously on the screen. Priority control for the planes is achieved by using a look-up table. This chip is not used in the parallel system, which is described in a subsequent section of this paper.

Minimum GVIP system configuration

The minimum GVIP system configuration is shown in Fig. 2 and its performance is shown in Table 1. The performance values shown in Table 1 are actual performance figures which were ob- tained when 60-ns access time DRAMs and a 55- MHz clock for the GVIP01 chip were used. Six GVIP02 chips are used for true color rendition. This system can perform graphics processing functions ranging from reception of polygon vertices to CRT monitor video signal output gene- ration. The minimum required RAM sizes for data and program memory needed for the RISC processor within the GVIP01 chip are 16 K words (32 bits/word) and 8 K words respectively. The size of the static RAM needed for the user-speci- fiable cursors is 8 Kbytes. The frame buffer is accessed by both GVIP01 and GVIP02. There are 64 K word x 64 K word ad- dresses within the frame buffer. The GVIP02 chip is a building block unit for color tint generation. One GVIP02 chip can produce 16 different colors. Six GVIP02 chips can produce 16.7 million differ- ent colors. System performance is not dependent on the number of colors displayed or on pixel bit length. System performance data for vector draw- ing and polygon rendering, using as parameters the length of a vector/pixel of a polygon and frame buffer bandwidth, are shown in Fig. 9 and 10 respectively. Vector drawing speed as a function of vector length, vector slope, and number of pixel write cycles is shown in Fig. 9; in this figure, the RAC number refers to the number of pixel write cycles within a machine cycle. In the GVIP system, the RAC number is set at seven and the vector draw- ing speed ranges between 2.6 and 5.1 million vec- tors/s when the length of the vector is less than or equal to 8 pixels. Vector drawing performance depends on the slope of the vector being drawn and hence drawing speed is given as a range of upper and lower speed values. If an 8 x 4 cache is used, a modulo 8-pixel-length vector would result

Table 1. Architecture and performance of GVIP

Item Architecture/performance

Processor 32-bit RISC processor Graphics-specialized hardware modules MIMD Pipeline, parallel Polyline, polygon, pixel, spline, character, video Multi-dimensional pixel array access A z buffer (up to 32-bit) Flat, Gouraud, Phong, others (by RISC firmware) Interpolation method Distance method Firmware and specified hardware Dot matrix and outline font (including spline)

Parallelism Topology Primitives

Frame buffer

HRS Shading

Texture mapping Anti-aliasing Image processing Font

Vector 2D/3D 3 million (10 pixels)/s Polygon 3D 1.6 million/s (without HSR) (100 pixels) 1.2 million/s (HRS,

Phong, texture mapping) BitBlt 400 Mbytes/s Character 5000 JIS (24 dots) Video HDTV, NTSC, others

(programmable) Copy/move 0.025 s/1 K x 1 K Others Synthesis of CG and

Camera image

in the highest cache utilization, since no data in the cache are superfluous. The 10-pixel-length vector drawing speed may be estimated by look- ing at the 16-pixel-length vector data points in Fig. 9; GVIP is capable of drawing between 1.1 and 2.7 million 10-pixel-length vectors per second. As can be seen from the graphs in Fig. 10, if vector lengths are long (longer than 24 pixels), drawing speed remain pretty constant. For a long vector, drawing speed may be estimated by simply read- ing off the 10-pixel graph points. For example for the GVIP system, a long vector can be drawn at a rate of 2-4.5 million pixels/s by looking at the 10-pixel/vector data points. The polygon drawing speed when an 8 x 4 cache and four-line interpolation is used is shown in Fig. 10. A drawing speed of 1.2 million polygons/s can be achieved assuming RAC = 5, 100 pixels/polygon, and HRS are used. This perfor- mance is not affected by texture mapping, Phong shading, or polygon shape. The results shown in Fig. 9 and 10 were obtained under the following conditions: (1)Use of

130

Page 11: A scalable high-performance graphics processor: GVIP

. 3]sua| omputer

71

03! 61 " " k

56 51 k

~ 4 5

36 ; ,

~ 25 21

16

H

6

1

I

r s

MIN value when NO RAC=5 O - - MAX value when NoRAC=5 + MEN value when No RAC=7 [ ] MAXvalue when No RAC=7 X MIN value when NO RAC=9 ~, ....... MAX vatue when No_RAC=g :~ - -

x .......

16 24 32 40 48 56 64

leng[h

6 I 34 1

323

300

277 +

254 '

231 " :,

20B

185 O , :

132 '

cache:SX4

No RAC=5 < ~ - - NORAC 7-}- - - . No nAG 9 ~ .....

64 128 192 256 320 384 448 512 576 643 704 768 832 ~96 96O 1024

pixels/polygon

10

11

Fig. 9. Performance of vector drawing

Fig. 10. Performance of polygon rendering

12

MB814260-60 DRAMs for the frame buffer; (2) 50-MHz data clock; (3) 100-MHz video clock; (4) true color rendering; (5) 24-bit z-buffer with common frame buffer configuration; (6) Phong- shaded triangle polygons; and (7) use of a DMA controller to load primitives onto GVIP01. The performance data take into account all the pro- cessing steps (from input handling to frame buffer access) necessary for vector/polygon drawing. A direct connection exists between the GVIP02 chip and the frame buffer. An external shift register is not needed for video refresh operations. Parallel link channels are connected to 32-bit buses and are controlled by signals such as the clock and the horizontal/vertical synchronization inputs. GVIP01, GVIP02, and GVIP03 chips fabricated by Toshiba are shown in Fig. 11. As shown in Fig. 12, the GVIP system chips are mounted on the mother VME board and the true color frame buffer chips are mounted on the daughter VME

13

Fig. 11. GVIP LSI (from left, gvip01, 02 and 03)

Fig. 12. Graphics PCB module

Fig. 13. Massive parallel graphics processor cabinet

board. GVIP chips are put in quad flat packages (QFPs). This VME board has two GVIP01 chips for a dual sub-system configuration and can pro- duce on the average 3 million vectors/s and 1.2 million Phong-shaded polygons with texture mapping per second. A multiple instruction multiple-data (MIMD) GVIP system cabinet is shown in Fig. 13; this 10- sub-system parallel processing configuration is able to produce 10 million polygons/s. Sixteen minimum GVIP system configurations, shown in Fig. 2, may be placed inside this cabinet. A graphics accelerator built with a large number of highly interconnected processing elements is used with a MIMD GVIP system. A large-scale MIMD GVIP system can process HDTV and geometric images in real time and hence is ideal for virtual reality applications. A commercial large-scale MIMD GVIP system is expected to be available during the second quarter of 1994.

1 3 1

Page 12: A scalable high-performance graphics processor: GVIP

GVIP LSubsystem No.1 ~ _ _

GVIP i LSubsystem No.2

F GVIP ! L Subsystem No.3

GVIP ' D Subsystem No.4

- ' [ T Gv,P05 q GwPo~, Z Comparator&Multiplexer___ .~rJ i

GVtP 05 Z Comparator&Multiplexer

To CRT

GVIP 01 14

Fig. 14. Parallel architecture of GVIP

Fig. 15. Multi-media BitBlt connection

15

Frame

HDTV

Frame ~ F ame I I GVIP 02' [ L IGVIP 02

(LGB)

"k'L G B : Local Graphics Bus YrPLNK : Parallel Link

Parallel architecture

The system architecture of a minimum GVIP sys- tem configuration is shown in Fig. 2. For high graphics processing performance, several min- imum GVIP system configurations may be inter- connected using the system bus shown in Fig. 14. Each minimum GVIP system configuration can be considered to be a sub-system of a large-scale GVIP system. Each sub-system works on its own image data (image segment) and stores data in its own frame buffer. If image data dependencies do not exist, frame buffer memory sharing or message passing among sub-systems are unnecessary; in such cases, the system can be readily scaled up by interconnecting sub-systems together. If the band- width of the system bus is sufficiently wide, a sys- tem performance nearly proportional to the num- ber of interconnected sub-systems should be achievable. It is actually quite rare to find instances where image data dependencies must be processed dur- ing the device coordinate stages; thus, although a large-scale GVIP system may suffer from un- even workloads in the various sub-systems due to differences in the modeling of shapes or in the handling of attributes, its big advantage is that the system does not need rules for loading primitives if the HSR and attribute processing are handled independently in each sub-system. A GVIP system with four sub-systems connected by the system bus is shown in Fig. 14. Each sub-

system outputs the image (pi) and the z (zi) coordinate values 128 pixels at a time in 1.28 #s, which matches the video output cycle time. The continuous pixel data streams from the sub-systems are fed into a series of pixel multi- plexers which select the pixel to be displayed based on the z coordinate value. For a GVIP system composed of n sub-systems, n - 1 pixel multiplexers are needed. The selected pixel is sent to the parallel link port of the GVIP02 chip in one of the sub-systems. Since the frame buffers for the GVIP system can be built using single-port DRAMs instead of VRAMs, costs approximately two-thirds, because there are a multitude of vendors of commercial single-port DRAMs and the size of the DRAMs range from 256 K bits to 64 M bits. The outputs of each sub-system are strobed syn- chronously in order to fetch pixel data at the appropriate times. However, no synchronization for data processing exists among the sub-systems because the video refresh output and the pixel cache operations are independent of each other; if fact, different clocks may be used for the sub- systems.

Interconnection among sub-systems

A single minimum GVIP system configuration (a sub-system) could be used to process data from various sources by using appropriate software code with the RISC processor in the GVIP01 chip. However, if several data sources have to be

132

Page 13: A scalable high-performance graphics processor: GVIP

15omputer handled at the same time, the performance of a single sub-system would not be adequate. A dis- tributed parallel processing architecture with sub-systems as PEs could be used in such cases. In a multi-media system, several image sources and destinations may be involved; for example, an HDTV camera, a facsimile, and a scanner may be image sources and laser printers and/or TV moni- tors may be destinations. Interconnections among minimum GVIP system configurations are needed to display and layout multiple images si- multaneously in real-time. As shown in Fig. 15, a parallel link connects all sub-systems and devi- ces together; this allows a sub-system to fetch image data from the frame buffer of another sub- system or to send its own frame buffer data to other sub-systems. A uniform data format is used for the image data stored in all the sub-system frame buffers; in this way, data gathered from several sub-systems can be processed in a uniform manner. The system processor determines the source and destination sub-systems for commun- ication between/among sub-systems. The details of these types of communications are handled by the BitBlt processors in the sub-systems.

Conclusions

Since the GVIP system is a complex and large hardware graphics processor, only an overview has been given in this paper. Details of its archi- tecture and its graphics algorithms will be pre- sented in a forthcoming paper. It is no longer enough to come up with optimized designs for one or two specific graphics processing functions; in a multi-media environment, solutions which can efficiently handle multiple inputs and functions simultaneously are needed. The GVIP system has been designed to operate in exactly such an environment with extremely high performance. The current crop of multi-media personal com- puters or workstations attempt to adapt existing computer platforms to graphics processing tasks by using additional hardware and specialized soft- ware packages. However, their performance is still not satisfactory. The goal of managing multiple functions in a systematic and uniform manner in hardware has not been attained yet. The next generation GVIP system will incorpo- rate specialized functions such as ray-trace

rendering and synchronized sound and image generation. These functions are now being put in a graphics accelerator chip for multi-media and virtual reality applications.

Acknowledgements. The author wishes to thank K. Takahashi, G. Ishii, and K. Tateyama of WIN Corporation for carrying out ASIC simulations. The assistance of W. Chu of the University of Aizu in the performance evaluation of the cache is also acknowledged.

References

1. Ishihata Ineno, Hovi, et al (1990) Architecture of CAPII and memory system. Technical report. Comput Architect IPSJ 83-87:83-97

2. Sasaki S (1993) 3-Dimensional graphics accelerater SUB- ARU. NIKKEI ELECTRONICS: 578:148-151

3. Fuchs H (1988) An introduction to pixel-plane and VLSI- intensive graphics systems. In: Theoretical foundation of computer graphics and CAD, Springer, Berlin Heidelberg New York, pp 675-688

4. Jayasinghe JAKS, Herrmann OE (1991) Two level pipeline of systolic array graphics engines. In: Advances in computer graphics hardware IV. Springer, Berlin Heidelberg New York, 133-148

5. Akeley K (1989) The silicon graphics 4D/240GTX Super- workstation. CG&A 9:71 83

6. Ikedo T (1977) Patent $53-110331 (Japan) 7. Ikedo T (1984) High speed techniques for a 3D color

graphics terminal. IEEE CG&A 4(5): 46-58 8. Carinalli C, ct al (1989) National's advanced graphics chips

set for high-performance graphics. IEEE CG&A 6(10): 40-48 9. Jackel D (1991) A real-time raster scan display for 3D

graphics. Advances in Hardware IV. Springer, Berlin Heidel- berg New York, pp 213-227

10. Poulton J, et al (1992) Breaking the frame buffer bottleneck with logic-enhanced memories. IEEE CG&A 12(6):65-74

11. Sproul F, Sutherland IE, Thompson A, Gupta S, Mint. C (1983) The 8 by 8 display. ACM Trans Graphics 2:32 56

12. Ikedo T (1983) Patent $58-24419 (Japan) 13. Goris A, et al (1987) A configurable pixel cache for fast image

generation. IEEE CG&A 7(3): 24 32

TSUNEO IKEDO Professor, Computer Architecture Labor- atory, Department of Computer Hardware, University of Aizu, Japan. He has done his research on: (1) 50GFlops massively par- allel processing system with pyr- amid internetwork; (2)a one- chip graphics processor with a million polygon/s; (3) a syn- thesis algorithm for geometric and TV image on 3-D space for virtual reality. He is now a pro- ject leader for the above research with companies doing joint re- search in an academic and in-

dustrial complex. He is interested in doing research on the processor architecture responsible for the various data types of a multi-media system.

133