14
Comput. & Graphics, Vol. 21, No. 2, pp. 129-142, 1997 0 1996 ElsevierScience Ltd. All rights reserved Printed in Great Britain PII: soo97-8493(9t5)fMo76-3 009778493/97s17.oo+o.oo Graphics Hardware THE TAYRA 3-D GRAPHICS RASTER PROCESSOR MARTIN WHITE+, MIKE BASSETT, DAIRSIE LATIMER, SHAUN MCCANN, ALEX MAKRIS, MARCUS WALLER, GRAHAM DUNNETT, JOACHIM BINDER and PAUL LISTER University of Sussex, School of Engineering, Centre for VLSI and Computer Graphics, Falmer, Brighton, BNl 9QT, UK e-mail: [email protected] Abstract-This paper describes the architecture of a 3-D Graphics Raster Processor called TAYRA. TAYRA consists of a Graphics Raster Pipeline with five major external interfaces: PC1 Master/Target, Depth, Texture, CoIour and Video Interfaces. The Graphics Raster Pipeline performs all the major OpenGL style raster functions: scan conversion of lines, spans, triangles and rectangles, perspective correction of texture co-ordinates, mip-map level of detail selection, and many other texture modes, alpha blending, and other functionahties. Further, through TAYRA’s fast PC1 to buffer access mechanisms it can do advanced stencilling, multi-pass antiahasing, and other algorithms; all accelerated in hardware with a sustained pixel write speed of 29 Mpixels/s (peak of 33 Mpixels/s). This translates to an estimated peak performance of 890 K/triangles/s for 25 pixel triangles. 0 1997 Elsevier Science Ltd 1. TAYRA FEATURES 3. INTERFACE SPECIFICATION TAYRA is a 3-D Graphics Raster Processor aimed at the high end of 3-D rastergraphics. It is the latest in a generation of 2-D and 3-D graphics processors developed on European funded projects involving several European companies and universities: IM- AGE [1], a 3-D Gouraud shadingand antialiasing ASIC; STEP [2], a 3-D texture mappingASIC; and MEDIA [3], a 2-D graphics and video controller ASIC. TAYRA provides high quality and speed through its extensive functionalities and interfaces, and also providesa low system component count due to its high level of integration. TAYRA’s compre- hensive selection of features are summarizedin Table 1. TAYRA employs five major interfaces to the system environment: PC1 Master/Target, Depth Buffer, Texture Buffer, Colour Buffer, Video Inter- face. We have implemented a 32 bit Master/Target PC1 Local Bus Revision 2.1 Interface [5]. The memory clock frequency is always twice that of the graphics pipeline, which helps to remove the tradi- tional memory access bottleneck. Both the graphics pipeline and the memory systems are clocked independently from the PC1bus. 2. PERFORMANCE ESTIMATES Table 2 gives the estimated performance figures for TAYRA. The performance figures above are basedon a 33 MHz graphicspipeline clock and 66 MHz mem- ory controller clocks.However,we are still PC1input bandwidth limited to a peak 890Ktriangles/s. A future solution to this bottleneck could be the Accelerated Graphics Port (AGP) architecture [4]. AGP is an extensionto the PC1 architecturewhich adds a demultiplexed address bus,pipelined transfers and 133MHz transfer rates to boost graphics performance. Another solution is to integrate a setup engine and vertex level interface to allow triangle strips to be processed. This effectively decreases primitive vertex data crossing the PC1 from three vertices per triangle to about one. We have alsochosen to implement three memory interfaces: texture, depth and colour, giving complete independence between the access protocol used by each and enabling a dedicated and optimized memory systemto be configured for each system’s particular requirements. It alsomeans that the latest memory technology can be employed.For example, the advanced block write feature of SGRAM is useful for buffer clear operations in the depth memory and the advancedBitBLT capabilitiesand block write functions of the new Window RAM (WRAM) can be employed to bestadvantagein the colour buffer. The dual port capability of the WRAM can also be used to optimize the display of the colour data. + Author for correspondence. These memory interfaces have beendesigned in a genericand programmable way, which allows us to support SGRAM, SDRAM, VRAM, DRAM and WRAM, for accessing the depth, texture and colour buffers. These memory controllers all arbitrate between the host (via TAYRA’s PC1 interface through the on-chip communication FIFOs) and the graphics pipeline.Also, a global memory refresh controller which, discounting any memory refresh accesses, refreshes every row every 17ms. 129

The TAYRA 3-D Graphics Raster Processor

Embed Size (px)

Citation preview

Page 1: The TAYRA 3-D Graphics Raster Processor

Comput. & Graphics, Vol. 21, No. 2, pp. 129-142, 1997 0 1996 Elsevier Science Ltd. All rights reserved

Printed in Great Britain

PII: soo97-8493(9t5)fMo76-3

009778493/97 s17.oo+o.oo

Graphics Hardware

THE TAYRA 3-D GRAPHICS RASTER PROCESSOR

MARTIN WHITE+, MIKE BASSETT, DAIRSIE LATIMER, SHAUN MCCANN, ALEX MAKRIS, MARCUS WALLER, GRAHAM DUNNETT, JOACHIM BINDER and PAUL LISTER

University of Sussex, School of Engineering, Centre for VLSI and Computer Graphics, Falmer, Brighton, BNl 9QT, UK

e-mail: [email protected]

Abstract-This paper describes the architecture of a 3-D Graphics Raster Processor called TAYRA. TAYRA consists of a Graphics Raster Pipeline with five major external interfaces: PC1 Master/Target, Depth, Texture, CoIour and Video Interfaces. The Graphics Raster Pipeline performs all the major OpenGL style raster functions: scan conversion of lines, spans, triangles and rectangles, perspective correction of texture co-ordinates, mip-map level of detail selection, and many other texture modes, alpha blending, and other functionahties. Further, through TAYRA’s fast PC1 to buffer access mechanisms it can do advanced stencilling, multi-pass antiahasing, and other algorithms; all accelerated in hardware with a sustained pixel write speed of 29 Mpixels/s (peak of 33 Mpixels/s). This translates to an estimated peak performance of 890 K/triangles/s for 25 pixel triangles. 0 1997 Elsevier Science Ltd

1. TAYRA FEATURES 3. INTERFACE SPECIFICATION

TAYRA is a 3-D Graphics Raster Processor aimed at the high end of 3-D raster graphics. It is the latest in a generation of 2-D and 3-D graphics processors developed on European funded projects involving several European companies and universities: IM- AGE [1], a 3-D Gouraud shading and antialiasing ASIC; STEP [2], a 3-D texture mapping ASIC; and MEDIA [3], a 2-D graphics and video controller ASIC. TAYRA provides high quality and speed through its extensive functionalities and interfaces, and also provides a low system component count due to its high level of integration. TAYRA’s compre- hensive selection of features are summarized in Table 1.

TAYRA employs five major interfaces to the system environment: PC1 Master/Target, Depth Buffer, Texture Buffer, Colour Buffer, Video Inter- face. We have implemented a 32 bit Master/Target PC1 Local Bus Revision 2.1 Interface [5]. The memory clock frequency is always twice that of the graphics pipeline, which helps to remove the tradi- tional memory access bottleneck. Both the graphics pipeline and the memory systems are clocked independently from the PC1 bus.

2. PERFORMANCE ESTIMATES

Table 2 gives the estimated performance figures for TAYRA.

The performance figures above are based on a 33 MHz graphics pipeline clock and 66 MHz mem- ory controller clocks. However, we are still PC1 input bandwidth limited to a peak 890 Ktriangles/s. A future solution to this bottleneck could be the Accelerated Graphics Port (AGP) architecture [4]. AGP is an extension to the PC1 architecture which adds a demultiplexed address bus, pipelined transfers and 133 MHz transfer rates to boost graphics performance. Another solution is to integrate a setup engine and vertex level interface to allow triangle strips to be processed. This effectively decreases primitive vertex data crossing the PC1 from three vertices per triangle to about one.

We have also chosen to implement three memory interfaces: texture, depth and colour, giving complete independence between the access protocol used by each and enabling a dedicated and optimized memory system to be configured for each system’s particular requirements. It also means that the latest memory technology can be employed. For example, the advanced block write feature of SGRAM is useful for buffer clear operations in the depth memory and the advanced BitBLT capabilities and block write functions of the new Window RAM (WRAM) can be employed to best advantage in the colour buffer. The dual port capability of the WRAM can also be used to optimize the display of the colour data.

+ Author for correspondence.

These memory interfaces have been designed in a generic and programmable way, which allows us to support SGRAM, SDRAM, VRAM, DRAM and WRAM, for accessing the depth, texture and colour buffers. These memory controllers all arbitrate between the host (via TAYRA’s PC1 interface through the on-chip communication FIFOs) and the graphics pipeline. Also, a global memory refresh controller which, discounting any memory refresh accesses, refreshes every row every 17 ms.

129

Page 2: The TAYRA 3-D Graphics Raster Processor

130 M. White et al.

Table 1. TAYRA features

Features

32 bit PC1 Local Bus Interface High Performance Communication FIFOs Scan Conversion of Lines, Spans, Triangles and

Rectangles Integratzd Depth Test and Buffer Controller OpenGL Fogging Fast Context Switching Program Register Set Colour Buffer Controller True Colour Double Buffer (Up to 16 bit, 1600x1200) Display Resolution (320x240 to 1600x 1200) Depth Buffer Controller True Floating Point Perspective Correction Hardware Mip-Mapped Level of Detail 32 bit RGBA Texture Mip-Mapping Square, Linear, Point2D M&Map Support

Wrap (W), Invert (In), Ignore (Ig), Clamp (C) Texture Tiling Texture Colour Map Format Modes OpenGL Texture Decal and Modulate 32 bit RGBA Texture Point, Linear, Bilinear, Trilinear

Filtering OpenGL Blending (Transparency, Antialiasing) Multi-pass Operations, e.g. Antialiasing, OpenGL Stencilling Clipping to regions Host to Depth, Texture, Colour Buffer Access Fast 8.5 Gbyte/s Depth Buffer Clear Operation Fast 8.5 Gbyte/s Colour Buffer Clear Operation Depth and Texture Buffer Support (SGRAM or SDRAM) Colour Buffer support (DRAM, VRAM or WRAM) Global Depth, Texture, Colour Refresh Controller

Table 2. TAYRA peak performance comparison

Performance Equivalents

Pixel writes 25 pixels/triangle 50 pixels/triangle

TAYRA* 3-D triangles 32 bit RGBA Gouraud shaded, 32 bit depth buffered, clipped, stencilled, alpha blended, mip-mapped, texture mapped

890 K/s 515 K/s

Pixel writes

TAYRAt lines 32 bit RGBA Gouraud shaded, 24 bit depth buffered, antialiased TAYRA$ block moves 32 bit RGBA and 32 bit de&i

10 pixels/line

1.6 M/s

132 Mbytes/s

*T&c performance figures are based on TAYRA’s PC1 bandwidth and Z-Buffer memory controiler. For example, if TAYRA’s pipeline is clocked at 33 MHz (66 MHz memory controller) then for a triangle whose pixels exhibit no page faults TAYRA can output 33 Mpixels/s. However, TAYRA has to buffer 12% of a triangles pixels due to Z-Bit&r page faults. TAYRA, therefore outputs 29 Mpixels/s. Thus, TAYRA can stream 29 Mpixels/st25 pixels/triangle= 1.16 Mtrian- gles/s past the Z-Buffer controller. But an average textured and i!hnninated triangle at the PC1 input consists of 148 bytes of data loaded by the host. Thus, with a peak PC1 bandwidth of 132 Mbytes/s we can see that for 25 pixel triaugles TAYRA is limited by the PC1 bandwidth to 132 Mbytes/s+148 bytes/triangle= 890 Ktrian&s/s. Similarly, 50 pixel triangles are limited by the pipeline bandwidth.

tLimited by the peak PC1 bandwidth of 132 Mbytes/s. A line on average is composed of about 84 bytes of data. Therefore, 132 Mbytes/s +84 bytes = 1.6 Mlines/s.

$TAYRA’s block move drawing performance is based on the PC1 input bandwidth of 132 Mbytes/s and TAYRA’s WRAM colour memory controller output bandwidth of 132 Mbytes/s. The datapath through TAYRA can also sustain 132 Mbytes/s.

4. ARCHITRCTURAL OVERVIEW

The objectives for TAYRA were to design a high performance 3-D Graphics Raster Processor which implements all the common 3-D graphics raster functionalities. Further, TAYRA should interface to the latest memory technologies in order to achieve unrivalled performance. Ail this is to be integrated in a single VLSI package interfaced to a PC1 based host. An overview of the TAYRA is illustrated in Fig. 1.

4.1. Scan conversion module The scan conversion module exhibits the following

functionality: Fig. 1. TAYRA architecture overview.

Page 3: The TAYRA 3-D Graphics Raster Processor

TAYRA 3-D Graphics Raster Processor 131

Address Register

Colour, ’ Depth ) Depth Depth

Parameter Test Test ~terpoladon Request Return

t 2 Test in Depth Memory Controller 1

Fig. 2. Scan conversion module.

(I) Primitive traversal using linear edge functions [6-

81 (2) Incremental interpolation of all parameters (3) Antialiasing using coverage masks [9] (4) Floating point texture interpolation (5) Floating point perspective correction (6) Integer colour and depth interpolation (7) Depth test integrated with depth memory con-

troller (8) Texture mip-map level of detail calculation

The organization of these functionalities within the scan conversion module [lo] is illustrated in Fig. 2. It

is beyond the scope of this paper to describe in detail all the functional implementations of the scan conversion module. Instead we select several which may prove interesting.

4.1.1. Primitive traversal module. An abstract diagram of the primitive traversal module architec- ture is shown in Fig. 3. Note that the data inputs to the interpolators and the b, El, E2 edge function interpolator outputs are not shown to maintain clarity.

4.1.1.1. Interface description. The primitive identity code format is shown in Fig. 4. This code not only defines the primitive type but also provides

P 1 “ev im sea

Fig. 3. TAYRA scan conversion module architecture.

Page 4: The TAYRA 3-D Graphics Raster Processor

132 M. White et al

bit 6 5 4 3 2 I 0

mpl swap reverse mvefse purpose EWZ x,y y X

ID, ID, ID,

Fig. 4. Primitive ID code.

the information required to map the primitive from the default octant to another.

The three identity bits, ID*+, define the primitive to be drawn. Bits three and four allow the direction of scan conversion to be reversed, essential for octant re-mapping of lines. The default is to increment x and y. Bit five tells the scan converter that it should swap the x and y co-ordinates, so reflecting the primitive in the line y =x. Bit six defines what should occur when the interpolated midpoint line error is equal to zero. This bit allows the user to be consistent when drawing lines in different octants.

In addition to the primitive identity, primID, there are three other control words associated with the scan conversion module: exiCTRL, edgeType, and aaCTRL. The exiCTRL control output--short for external interpolator control--controls interpolators external to the scan conversion module. The control signal is identical to the internal edge function control signal.

The edgeType control input defines the character- istics of triangle edges. There are two bits per edge, one indicating if the edge N is thick or thin, thickE,, the other specifies whether the edge should be antiaiiased, AAN. The various fields of the edgeType control work are defined in Fig. 5.

The aaCTRL control output specifies what the antialiasing unit should do with the pixel, see Fig. 6. It is comprised of the edgeType control word, aaCTRLS4, and bits that specify the nature of the pixel with regard to the three edges. If thickP, is a

bit

logical “I” then the pixel is thick with regard to edge N. Similarly, if insideN is a logical ” 1” then the pixel is totally covered by edge N.

If a pixel is thick with regard to edge N, (thickP,=“l”), but the edge has been defined as a thin edge, (thickEN=“O”), then the pixel must be marked invalid by the antialiasing module, thus not displayed.

4.1.1.2. Traversal FSM descriptions. The primi- tive traversal module implements four traversal algorithms which are used to draw: lines, spans. triangles and rectangles in various modes. The scan conversion algorithms use linear edge functions and the concept of look ahead [I 11. Look ahead in general involves looking at neighbouring pixels in advance in order to determine the next pixel to move to. In the case of TAYRA we look ahead to two neighbouring pixels, one in the horizontal plane, and the other in the vertical plane. The current pixel parameters have already been calcu- lated and are waiting in interpolation registers, and so can be passed on to the next pipeline stage without further computation. However, the para- meter values for the two neighbouring pixels are calculated, and depending on the current state, are stored in interpolation registers at the end of the current cycle. They then become available in the next cycle.

4.1.1.3. Triangle FSM. The eight states of the triangle scan conversion algorithm and its FSM are illustrated in Fig. 7. The current pixel is marked by a

5 4 3 2 I 0

thickE, AA, thickE, AA, thickE, AA,

Fig. 5. Edge type control word (edgeType).

hit II IO 9 R 7 6 5 4

pwposc thickP, inside, thiokP, inside, thickP, inside., thickE$ AA,

Fig. 6. Anti&as control word (mCTRL).

Page 5: The TAYRA 3-D Graphics Raster Processor

TAYRA 3-D Graphics Raster Processor 133

tri-right

tri-back-to-centre

tri-down

tri-right-push

tri-back-to_centre-push

Fig. 7. Scan conversion of triangles and the triangle FSM.

E : edge distance s : slope

21srls ä 5.8 ns 26.2 ns Total ) 53.5 11s

Propagation delays for ES2 ecdm05 library using MAX-IND operating conditions, (industrial worst cast).

E,- Mask % LUT

octant Subpixel mirroliug count

r octaut

mirroriug

Fig. 8. Pixel coverage module.

Page 6: The TAYRA 3-D Graphics Raster Processor

134 M. White et ~1.

r-l-- ~ t-mult

1 sJ3ut

t$ut

wait out

Fig. 9. Architecture of the perspective correction module.

thick boundary and the two neighbouring pixels looked ahead to are marked with a dot at their centres. The lightly shaded pixel is the previous pixel, if there was one. The eight states are described below:

(1) tri-start: This is the first state in the scan conversion of any triangle, and so there is no previous pixel. The scan converter must start at the uppermost point of the triangle. After starting, the scan converter will try to go right, and so looks ahead to the pixel immediately to the right of the start pixel.

(2) tri-r: When in this state the pixel to the right of the previous pixel becomes the current pixel, and is output. The scan converter looks ahead to the pixel to the right of the current pixel.

(3) tri-btc: This state returns the rasterizer to the first pixel visited on the scan line, assumed to be near the centre of the scan line.

(4) tri_l: It is the exact opposite to tri-r. (5) tri-dz It takes the scan converter down to the

next scan line. (6) tri-rp: This state is similar to tri-r but it assumes

that a valid pixel has not been found on the next scan line, and so continues to push the pixel beneath the current pixel into the storage register.

(7) tri-btcp: This state is similar to tri-btc but assumes that a valid pixel has not been found on the next scan line, and so continues to push the pixel beneath the current pixel into the storage register.

(8) tri-fp: This state is similar to tri-1 bit assumes that a valid pixel has not been found on the next scan line, and so continues to push the pixel beneath the current pixel into the storage register.

The algorithm described above is a little inefficient in that the initial pixel in a scan line is always visited twice, and flagged as invalid on the second occasion. This inefficiency can be overcome by adding extra look ahead logic to determine the pixel co-ordinates to the left of the initial pixel in a scan line as well [ Ii]. However, the final choice of algorithm is a cost versus performance trade-off based on the level of look ahead desired.

Our scan conversion algorithm treats all triangle edges as thick. When a thin edge is encountered, some pixels will be produced that are outside that edge. These pixels will be marked invalid in a latter pipeline stage. This approach is necessary when using edge functions to define the boundary of a triangle, otherwise pixels will be omitted from very thin triangles.

4.1.1.4. Line, span and rectangles FSM. The line scan conversion algorithm is a hardware implemen- tation of the midpoint line algorithm described by Foley [12]. The algorithm can only step in two directions from any given pixel, East and Southeast. This only gives lines in one octant but mechtisms are provided to map the line into other octants. There are two types of span: the horizontal and the vertical span, but only the horizontal span is implemented by the scan conversion unit. The

Page 7: The TAYRA 3-D Graphics Raster Processor

TAYRA 3-D Graphics Raster Processor

exponent signifiesad

30 23 22 I5 14

135

I I I I

//8 t-adz

iS

weight

/

v V \

i-table j-table

256x25 256x16

0

exponent sigaitkand

Fig. 10. Architecture of the floating-point reciprocal unit.

vertical span is derived from the horizontal span by swapping the co-ordinates. This is performed by the octant re-mapping mechanisms. The scan conversion of rectangles start at the corner nearest the origin. The scan converter scans right until the rectangle boundary is met, and then moves down to the next scan line. The FSMs for lines, spans and rectangles can easily be derived, and are thus not shown for clarity.

4.1.2. Pixel coverage module. The pixel coverage module implementation is based on the algorithms described in [9], see Fig. 8, which incidentally indicates the propagation delays for the ES2 ECDMOS technology library. Clearly, for the

performance figures given above, pipelining is required.

The coverage values generated are used in the alpha blending module further down the graphics pipeline for antialiasing.

4.13. Perspective correction module. Without going into the theory of why texture mapping in particular suffers from perspective distortion effects, see [13-H], the function of the perspective correction module is to divide the floating point interpolated texture co-ordinates S and T by Q at pixel rate. Fig. 9 illustrates the architecture of the floating point perspective correction module. Note that because the reciprocal calculation and the multiplication

Page 8: The TAYRA 3-D Graphics Raster Processor

136 M. White et ul.

Vi&k bscwed

179/256

1541256

3 4 1281256

E

102Q56

511256

261256

EXP~: ./'= P"' -+ Lookup[(Z,,,Z)(Z,,,Z)]

-0 0,s I I.5 2 2.5 3 3.5 4 4.5 5 5.5 0 h.5 T:

Density x Depth (d.Z)

I

-.+-.. .-;

7.5 8

Fig. Il. Fog factor lookup table.

cannot be done in one clock cycle there is a second The architecture chosen for implementing the

pipeline register set in between them. A more detailed floating point reciprocal components was influenced explanation of how the perspective correction by work done at the IBM Thomas J. Watson

module works and the rationale for its architecture Research Center by Narayanaswami ]17] and at the is given in [ 161. University of Sussex by Westmore [18]. For interests

Program Register Set Program Register Set

Texture Mwry Controkr

Fig. 12. Overview of texture mapping module

Page 9: The TAYRA 3-D Graphics Raster Processor

TAYRA 3-D Graphics Raster Processor 137

So, a fixed-point division is used in any case. The floating-point division is just expanded by some adjustments and exponent calculations.

After studying all the above mentioned alterna- tives in terms of gate count and calculation time the final choice was made for splitting up the division in a floating-point reciprocal unit followed by a floating-point multiplier. In the special case of texture mapping this partitioning was made due to the fact, that the reciprocal unit is used once for calculating l/Q, and the multiplier is used twice for calculating S/Q and T/Q for the texture parameters S and T, respectively. The schematic for the division unit is shown in Fig. 10.

col-result

Fig. 13. co&,,,, in dependence of frac for co& =0 and col, = 255.

sake we indicate the approach used to compute the floating point reciprocal unit. The floating point format of the reciprocal unit is defined by the IEEE Standard 154 on floating-point arithmetic, which defines single-precision floating-point format to be a sign bit, an 8 bit exponent and a 23 bit significand, see [19, 201.

Because a divider cannot be synthesized automa- tically by a synthesis tool (like Synopsys or Auto- Logic II) it has to be designed by hand. Several alternatives can be used (refer to standard texts on the design of arithmetic logic) to realise a divider:

l By sequential shift-subtract/add division l By convergence division, for example Newton-

Raphson-iteration l By an array divider l By splitting up the division in a reciprocator and a

multiplier [ 17, 181

Furthermore, you have the choice of either using fixed-point numbers or floating-point numbers. Note, that if you design a floating-point divider the division of the significands is a fixed-point operation.

col-result

t White black

4.2. Depth test module The depth interpolator is implemented early to

eliminate unnecessary texture and colour accesses. The depth test takes a variable amount of time due to page misses, etc. so we soak up this latency with FIFOs throughout the graphics pipeline. Inciden- tally, only about 19 Kbits of on-chip memory is needed to implement FIFOs, however, without DRAM macrocells this translates into about 130 K gate equivalents. The depth test operations are those supported by OpenGL [21]: GL-NEVER, GL-EQUAL, GL-GREATER, GL-GEQUAL, GL-ALWAYS, GL-NOTEQUAL, GL-LESS, GL-LEQUAL.

4.3. Fogging module We have implemented the OpenGL fogging

modes: GL LINEAR, GL-EXP and GL-EXP2. The expone&als are implemented through optimised lookup tables, see Fig. 11, which illustrates the theory and architecture.

4.4. Texture mapping module The texture mapping module provides point

sampling, linear, bilinear and trilinear filtering, palletized 8 bit colour mode and OpenGL style texture illumination (decal and modulate). Figure 12 provides an overview of the texture mapping

white black

Fig. 14. Correct and wrong values of CO&,,, on a black/white texture map.

Page 10: The TAYRA 3-D Graphics Raster Processor

138 M. White et (11.

cohxlrl 33 - ,

>

colod! 32, - , -

>

coloar3 32, - ,

>

colourd 32, - t

g1zan8x4 32, r I

blucllx4 32,

lb I

-n-

m- ‘-61 TF xl--

Linearc&

1

8/ filkr-,lt 5

5 filterJ3 I

8 7{

fibrJ3

Fig. 15. Texture filter architecture.

module, while shows in detail the architecture of the filter component.

It is beyond the scope of this paper to describe every part of the texture mapping module, however, because texture filtering plays an important part in high quality graphics, we describe this component in more detail.

4.4.1. Texture filter component. The most basic functional unit of the texture filter component is a linear interpolator. The linear interpolator is ar- ranged in various ways to implement linear, bilinear and trilinear interpolation of texture values. In general, a linear interpolation unit is used for interpolating between two given values, the start point and the end point, respectively. In addition to the start and end point there is a third parameter used in linear interpolation, called the weighting

factor which defines the distance of the resulting value to the two others. Geometrically speaking, the resulting value lies on a line between the two others. The mathematical equation is:

valt4e,,it = valueszaP* + weight * (vaZle,,d - value.yl,,l)

In the context of texture filtering, the two input values are texture colour values co& and toll and the weighting factor is called the fraction. So, the above mentioned equation becomes

wt-mi~ = co10 +fiac(cofl - coI0)

where frac is a 8 bit fraction value, and the two colour components co10 and co11 are treated as integer numbers, also frac is in the range 0 to 25.51 256. In contrast to some special applications (for

Page 11: The TAYRA 3-D Graphics Raster Processor

TAYRA 3-D Graphics Raster Processor 139

colo”rO colourl frac

8 8

:

sub

9 diff

A- mult

lin-inter 2

,8 f T- 8

Prod

C add

sum 8 Linear Interpolator

V /

8 colo”rO colourZ colo”rl colour3 frac0 fracl

bilin-colour

Bilinear Interpolator

Fig. 16. Linear and bilinear interpolators used in the BilinearCalc and LinearCalc units of the texture filter.

example the illumination equation, see discussion on alpha blending below, where you want to have FF*O. FF=FF in order to avoid a decrease in bright- ness), here in texture filtering, you want to have FF*O .FF=FE. This is because, for example, suppose that co10 = 0 and col, = FF, then if frac goes from O/ 256 to 2551256, co&,,,* takes all values between 0 and 254. Note that in Fig. 13, for simplification, col,,,l, is drawn as a line but to be precise, it is a step function. It is stressed here again that in texture filtering you don,t want COI~~~~ to be 255 (you want 254) for the input values co&= 0, coil= 255, and frac=255. You will have co&,,, to be 255 if you leave that interval (which is, in this context, the distance from one texel value to the next) and frac becomes 0 for the next interval. Then the start point is color 255 and therefore col,,lr as well, because ,fiac is zero.

If you have a texture map consisting of black and white texels in alternating order, the resulting colour

value would go from 0 to 255 and back to 0 and so on as long as you move from one texel to the next (shown as the straight line in Fig. 14). If co&,,,, was 255 for frac = 255, then you would have the dotted line.

Let us now look at the architecture of the texture filter component (see Fig. 15). The basic components are the BilinearCalc component, and the Line-

arCalc component. These components are con- structed from bilinear interpolators and linear interpolators respectively (see Fig. 16). Because the complete trilinear interpolation cannot be done in one clock cycle the filter module was split up into these components and a second pipeline register set

has been inserted between them. The BilinearCalc

component is constructed from six bilinear inter- polators (see Fig. 16). The LinearCalc component

includes the registers to store the result of the first phase of a conventional RGBA trilinear filtering operation, and the linear interpolators to calculate the trilinear value in the second phase.

Table 3.

filter-mode (1:O) mode8x4 Supported mode

Point sampling 00 0 Linear filtering 01 0 Bilinear filtering 10 0 Conventional RGBA trilinear filtering 11 0 Mode8x4 trilinear filtering 11 1

Page 12: The TAYRA 3-D Graphics Raster Processor

140 M. White et ul.

4.42. Modes supported by the texture filtering module. The texture filtering module supports several modes of filtering. All supported modes are listed in Table 3. To select a specific mode, the 2 bit wide control bus filter-mode and the control line mode8x4 is used.

4.5. Alpha blending module Although it is possible to implement an alpha

blending module in a similar style to that given in the texture filter module, i.e. with linear interpolators, two factors suggest a variation. First, the OpenGL alpha blending operation suggests that a blending factor setup stage and a blending calculation stage be used. The architecture shown in Fig. 17 illustrates the blending calculation stage for one colour component. It also depicts a correction factor which is explained next with respect to the texture filtering described above. The linear interpolation component used for texture filtering interpolates between two 8 bit colour values co&, and toll according to the equation:

cofresu~, = co10 +frac(col, - co&)

As shown for texture filtering, if the equation above is calculated with a certain assignment of input values then CO~,,,,,,~ will result in a value less than it should be. For example, let co10 be 0, toll be 255, and frac be 255/256. Then the multiplication will give you:

Lrl

muadhg

‘i 8 t

bkmiedsdmu

Fig. 17. Block diagram of the blending module.

O.FF*(FF-O)=FE,Ol.

Therefore, when col,, is 0, col,,,~,,,, will be FE. 01.

Even with rounding it is not possible to get FF

because the first binary digit right of the decimal point is zero. To avoid this, a correction factor is needed which expands the range of ,fiac from 0.. .255/256 to 0.. . 1. We note that this correction factor appears in the linear interpolators for illumi- nation (OpenGL modulate equations) and alpha blending logic. It can easily be seen that this correction factor has to be 256/255 = 1 + 11255. Using this correction factor, the above mentioned multi- plication results in:

,~ac*correction~,,,,,*(c~l~- coil)

Using a correction factor of 1+ l/255, we can concatenate as many interpolation units as we want to, and we will have no decrease in the final coiour value. But it is difficult to handle 1 + l/255, which is in hex notation 1,01010101. It is much easier to handle 1 + l/256, which is in hex notation 1,01000000. The error that we make is just

error = (1 + l/255) - (1 + l/256)

= l/255 - l/256 = l/65280

If we multiply this approximated correction factor 1+ l/256 by the fraction ,frac, we will get the corrected fraction frac’ to be:

,frac’ =fkac * (1 + l/256) =.frac * 257/256

5. SOFTWARE CONSlDERATfONS

TAYRA was the first model at the system and algorithmic or functional level in C++. This produced an environment in which rapid (compared to the RTL VHDL) debugging of graphics algo- rithms could be effected. The software simulation was in fact a mix of high level software descriptions and low level software implemented state machines. For example, state machine descriptions of scan conversion algorithms and software modeling of OpenGL style alpha blending. This software envir- onment provided us with a reference model with which to compare our hardware results from the RTL VHDL simulations, see below. The software environment included a detailed model of TAY RA’s register set which would allow potential users to start building device drivers.

TAYRA macrocells are currently being revised to map into an FPGA based system architecture. In this new project we are implementing the reference board, device drivers and example applications. The device drivers being implemented include an OpenGL 1.1 MCD for Windows NT 4.0, and support for Microsoft Direct3-D when it becomes standard on Windows NT 4.0. Because OpenGL is

Page 13: The TAYRA 3-D Graphics Raster Processor

TAYRA 3-D Graphics Raster Processor

(a)

(b)

(c)

(4

(e)

Fig. 18. TAYRA VHDL RTL simulation results. (a) Texture mapped plane tiling and exhibiting linear fog, (b) exponential fog, (c) exponential squared fog, (d) Gouraud shaded teapot, (e) video backdrop with two Gouraud shaded and antialiased triangles, (f) triangles from (e) above with antialiasing magnified.

Page 14: The TAYRA 3-D Graphics Raster Processor

142 M. White et ul.

more mature we have concentrated on providing device drivers and applications for OpenGL. This implies we are targeting OpenGL style markets.

6. IWRDWARE SIMULATION BFSULTS

The images depicted in Fig. 18 are the result of hardware simulations which took more than 3 h each on a 133 MHz Pentium using Model Technology’s V-System VHDL simulator. The simulation was all RTL VHDL. All simulations were performed on a 133 MHz Pentium using Model Technology’s V- System VHDL simulator. Figure 18(e) is interesting in that it shows an image which has been manipu- lated with Adobe Photoshop to produce the lens flare effect, this was then used as a video backdrop (instead of the normal background colour), and then the two triangles were rendered with transparency and antialiasing. This illustrates the use of TAYRA two channels-graphics pipeline channel and PC1 to buffers channel.

7. CONCLUSIONS

TAYRA has been fully implemented in C code to provide algorithmic simulation verification, and in VHDL at the register transfer level (RTL). Gate level simulations have been performed on most modules after technology mapping to the ES2 ECDMOS 0.6 micron library, which indicates a total gate equiva- lent count of around 270 K gates and about 19 Kbits of on-chip DRAM are required. We are currently mapping TAYRA’s VHDL models to an FPGA based (with off-the shelf system components) system architecture to provide a working prototype. Further, we are developing TAYRA’s modules into a library of macrocells for future design reuse.

Acknowledgements-We would like to acknowledge the European Commission for providing part of the funding for this work under an Esprit initiative called the Monograph Project. We would also like to extend our thanks to past and present members of the Monograph consortium, and independent consultants to the project, namely, Michael Malms (IBM Germany), Bengt-Olaf Schneider, Chandra- sehar Naraynaswami, Wolfgang Bultmann (IBM Thomas J. Watson Research Center), Andreas &hilling, Anders Kugler, Wolfgang Strasser (Universitgt Tubingen), and many other people too numerous to mention, but never- theless contributed in no small way to the Monograph Project.

REFERENCES

1. Dunnett, G. J., White, M., Lister, P. F., Grimsdale, R. L. and Glemot, F., The IMAGE chip for high performance 3-D rendering. IEEE Computer Gruphics and Applications, 1992, 12,41-52.

2.

3.

4.

5.

6.

7.

8.

9.

IO.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

Centre for VLSI and Computer Graphics, The Sussex Texture Processor (STEP), http://www.susx.ac.ukkJ engg/research/vlsi/index.html. Pearce, S. F., Bassett, M. C. and Lister, P. F., The Sussex Multimediu Macrocell. ASK and MCM. The European Design and Test Conference 1996, User Forum, Paris, France, 1 l-14 March 1996. Intel Corporation, Accelerated Graphics Port Speczjka- bon, Revision 0.9. 9 May 1996. Lister, P., PC1 Bus Toolkit VO. 1 Data Sheet. Centre for VLSI and Computer Graphics, University of Sussex. Schilling, A., Some practical aspects of rendering. In Advunces in Computer Gruphics Hurdwure V. Springer- Verlag, 1992, pp. 54-66. Juan, P.. A parallel algorithm for polygon rasterization. Computer Graphics, 1988, 22, 17-20. Fuchs, H., Poulton, J., Eyles, J. and Greer, T. Coarse- grain and fine-grain parallelism in the next generation Pixel-planes graphics system. In Parallel Processing for Computer Vision, ed. P. M. Dew, R. A. Earnshaw and T. R. Heywood. Addison-Wesley Publishing Company, 1989, pp. 241-253. Schilling, A. G., A new simple and efficient antialiasing with subpixel masks. Computer Graphics, 1991.25, 133- 141. Waller, M., White, M. and Lister, P., TAYRA Scan Conversion Module. Technical Report Sussex/Mono- graph/044, Centre for VLSI and Computer Graphics, University of Sussex, 12 October 1995. White, M., Three-dimensional computer graphics ren- dering. Doctor of Philosophy, T’he University of Sussex, Centre for VLSI and Computer Graphics, February 1994. Foley, J. D., Van Dam, A., Feiner, S. K. and Hughes, J. F., Computer Graphics Principles and Pructice. Addison Wesley, 1990. Wolberg, G., Digital Image Warping. IEEE Computer Society Press, 1990. Heckbert, P. and Moreton, H. P., interpolation for polygon texture mapping and shading. In State of the Art in Computer Graphics Vizuulistztion and Modeling, ed. D. Rogers and R. Eamshaw, Springer Verlag, 1991. Blinn, J. F., Hyperbolic Interpolation,. IEEE Computer Graphics and Applications, 1992, 12, 89-94. Binder, J. W., Wailer, M. D., White, M. and Lister, P. F., TAYRA architecture: perspective correction mod- ule. Monograph Project, Centre for VLSI and Com- puter Graphics, University of Sussex. Narayanaswami, C. and Luken, W., Efficient evalua- tion -of I/x for texture mapping. IBM Thomas J. Watson Research Center. Yorktown Heights. NY, USA, 25 August 1995. Westmore, R. J., Real time texture synthesis in computer generated imagery. Doctor of Philosophy, The University of Sussex, Centre for VLSI and Computer Graphics, November 1983. Omondi, A. R., Computer Arithmetic Systems Algo- rithms, Architecture und Implementutions. Prentice Hall, 1994. Feldman, J. M. and Retter, C. T., Compurcr Architcc- ture: A Designer’s Text Based on a Generic RISC McGraw-Hill, 1994. Neider, J., Davis, T. and Woo, M., OpenGL Program- ming Guide. The O&ial Guide to Learning OpenGL, Release 1, Addison Wesley, 1993.