15

A specification for a reconfigurable optoelectronic VLSI processor

Embed Size (px)

Citation preview

Page 1: A specification for a reconfigurable optoelectronic VLSI processor

A speci�cation for a recon�gurable optoelectronic

VLSI processor suitable for digital signal processing

D� Fey� B� Kasche� C� Burkert

Friedrich�Schiller�University Jena� Department for Computer Architectue and �communicationErnst Abbe Platz ���� ����� Jena� Germany

O� Tsch�ache

Friedrich�Alexander�University Erlangen�Nurnberg� Department for Computer StructuresMartensstr �� ���� Erlangen� Germany

�April �� �����revised version September �� ����

Keywords� smart pixel� optoelectronic VLSI� digital signal processor� CORDIC

Abstract

A concept for a parallel digital signal processor basedon optical interconnections and optoelectronic very largescale integrated circuits is presented� It is shownthat the right combination of optical communication�architecture and algorithms allows a throughput thatoutperforms pure electronic solutions�

The usefulness of low�level algorithms from theadd�and�shift class is emphasized� These algorithmslead to ne�grain� massively parallel on�chip processorarchitectures with high demands for optical o��chipinterconnections� A comparative performance analysisshows the superiority of a bit serial architecture� Thisarchitecture is mapped onto an optoelectronic ��Dcircuit and the necessary optical interconnection schemeis speci ed�

�� Introduction

At present one of major problems in current VLSIsystems is limited bandwidth because of too few andtoo slow external pins�� Optoelectronic very largescale integrated �VLSI� circuits��� using high�denseoptical interconnection��� schemes o�er the potentialto eliminate these bottlenecks� The success ofoptoelectronic VLSI systems depends largely uponapplications with a need for high input�output �I�O�bandwidth� Digital signal processing �DSP� is oneof such bandwidth intensive computational tasks�Typical signal processing tasks� as for example medicalimage processing and detection of radar signals� ischaracterized by high data accesses as well as highcomputing performance� An e�cient solution fora DSP requires the right combination of processor

architecture� low�level algorithms and powerful o��chipcommunication�

In this paper we present an architecturefor a recon gurable parallel digital signal processorsolving the communication bottleneck by using opticalinterconnections� The recon gurability in our solutionis realized by setting the content of our processor statemachine to a speci c value� Depending on this value ourprocessor carries out in hardware one of eight di�erentelementary functions� which are often used in signalprocessing tasks� To these functions belong exponentialfunction� logarithm� sine� cosine� arc tangent� squareroot� multiplication and division� Signal processors oftenuse convergence procedures as CORDIC or so�called bitalgorithms for a hardwired solution of such functions�As long as such algorithms are implemented in a singleprocessor this makes no serious di�culties with currentmicroelectronics technology� In contrast to this a parallelimplementation of such recon gurable DSP elementsis di�cult to achieve� due to area consuming units�The reason for that is that area consuming units astables and barrel shifters are needed� Tables and barrelshifters are fundamental for an e�cient implementationof such low�level algorithms� The table stores constantvalues needed in the algorithm� The barrel shifteris necessary to carry out bit shifts about variablebit lengths in constant time� Moreover� long on�chipinterconnections are necessary to distribute data thathave to be broadcasted� In the paper we will showthat some of the mentioned problems are avoidable bymeans of parallel optical o��chip interconnections� Thususing optoelectronic technology an implementation of amassively parallel DSP system becomes possible�

As solution for the optoelectronic VLSI �OE�VLSI�circuits we laid our focus on a H�ybrid��SEED

Page 2: A specification for a reconfigurable optoelectronic VLSI processor

technology� This technology seems to be one of the mostmature OE�VLSI techniques so far� at least if we judgethis in the sense of public availability via the CO�OPconsortium� However� our architecture proposal can bemapped also without di�culties onto OE�VLSI circuitsbased on vertical�cavity surface�emitting laser �VCSEL�arrays�� For the H�SEED technology we determine thecomputing performance to expect versus the clock cycleand compare this result with existing pure�electronicDSP systems� Since we aspire a ne�grain processorarchitecture with as much as possible DSP elements onchip� we consider fault�tolerant mechanisms for essential�A rst step towards this direction is the introduction of afault model in the VHDL description of our DSP systemto investigate optical cross talk between neighboredoptical detectors�

The paper is organized as follows� Section reviewsthe functionality of the low�level algorithms we favorfor our parallel optoelectronic DSP elements� Moreover�we present in Section � di�erent alternatives forarchitectures implementing thoselow�level algorithms�The structure of the processing elements �PEs� inclusivetheir interconnection requirements are speci ed in moredetails in Section �� In Section � the di�erentarchitecture models are evaluated by a theoreticalanalysis of their computing performance to expect� InSection � that architecture model� we evaluated asthe best one� is mapped onto an OE�VLSI circuit�Furthermore the corresponding optical interconnectionscheme is speci ed� The mapping process was supportedby a high�level synthesis of a VHDL description ofour processor system using the design tool Synopsys��The usefulness of a high�level description like VHDL isdemonstrated in Section �� There we present a faultmodel introduced in the VHDL description to simulatethe e�ects of optical cross talk between neighboreddetectors� Finally we nish with a summary and anoutlook�

�� Algorithm study

In this section we give a short overview of the used classof algorithms and we will take a closer look to the basicideas of the mathematical background� Our researchaimed at a development of algorithms well suited forsmart pixel architectures� That means� we were lookingfor algorithms which match very well the boundaryconditions given by the optoelectronic hardware� Toexploit the potential of optical interconnections moste�ciently we are favoring ne grained architectures�These architectures are characterized by a high need

for o��chip communication� Hence� the PEs shouldbe as small as possible to integrate as much aspossible PEs onto one chip� The bene ts from opticalinterconnections for ne grained smart pixel PEs werealso stressed in literature in the past�����

A� Add�and�shift algorithms

Our aim was to nd a uniform optoelectronic hardwareto calculate standard functions like sine� cosine� arctangent� square root� logarithm� exponential� multiplyand division function� This can be done by a Taylorseries development using �oating point multiplicationunits �FPMUs�� Since FPMUs need comparatively mucharea on chip we didn�t pursued this possibility�

Instead of using FPMUs we investigated di�erentarchitectures based on add�and�shift operations� Thosealgorithms are using only simple operations such asadding � shifting and multiplexing� CORDIC algorithms�being well known in signal processing� belong to theclass of add�and�shift algorithms� They have beendeveloped and investigated for a long time especiallyfor trigonometric functions��� CORDIC is an iterativeprocedure as well as so called bit algorithms� which alsobelong to the class of add�and�shift algorithms� Theyare called bit algorithms because within each iteration anew bit� starting from the most signi cant bit position�becomes valid� Bit algorithms are more appropriate forthe calculation of square root� multiplication� division�exponential and logarithmic function�

By comparing bit algorithms and CORDIC we haveobserved that the calculation scheme of CORDIC showsa more uniform structure� But for the non trigonometricfunctions it is sometimes necessary to use the CORDICscheme twice� i�e� we need n iteration steps� On theother hand bit algorithms deliver the results in n steps�But they show lower e�ciency to calculate trigonometricfunctions�

Unfortunately in literature a uniform scheme to integrateall the eight functions� mentioned above� was not found�It was our task to modify them in such a way thatthey could be mapped onto a ne grained optoelectronichardware structure� We combined bit and CORDICalgorithms and got as result the scheme shown in Fig� ��

To our knowledge� such a uniform algorithm scheme�which allows to carry out eight di�erent standardfunctions simultaneously� can not be found in literature�

Page 3: A specification for a reconfigurable optoelectronic VLSI processor

We developed these algortihms especially for anoptoelectronic smart pixel architecture� A uniformalgorithm structure will lead to a regular hardware setupof a PE� and additionally� to a more space�invariantoptical o��chip interconnection scheme� The higher thespace invariance is� the more dense the interconnectionsshould be� We pursued this approach to exploitin our DSP architecture one strength of optics overelectronics� the high space bandwidth� This will lead to aparalllelism that is not possible to achieve in electronics�In electronics there is no stringent requirement foruniform structures� Mostly barrel shifters and fastadders are used� This is the way to prefer if the intentionis just to integrate one or two processors� But ourintention is to implement a massively parallel DSP� ThisDSP must be realized as OE�VLSI circuit and since thistechnology is young� the optical interconnection schemeshould be as simple as possible�

The calculations� shown in Fig� �� are carried out on atriple �x� y� z�� The idea of our algorithms is that they component goes to zero and the other componentsconverge to the searched function values simultaneously�The procedure starts with �a� a shift operation on twocomponents of the triple� Next we have to perform �b�

three add operations on x� y and z� Finally �c� we checkthe value of y to measure the progress of convergence�Within the CORDIC algorithms we have a two sideconvergence� That means� if y is below zero we haveto increase it� i�e� we have to add in the next iteration�Otherwise� if y is above zero we have to carry out asubtraction� Within the bit algorithms there is a one sideconvergence� If y is below zero the calculations carriedout in �a� and �b� are obsolete� Hence� we have tocontinue with the old values in the next iteration�

The concrete iteration scheme for the eight functions�which we have developed� is shown in Tab� ��The rst ve algorithms belong to the class of bitalgorithms� The last three are CORDIC procedures�The multiplications with �i are replaced by shifts tothe right� The values for arctan��i� and log�� � �i�are precalculated constant values� as well as the initialvalue � �

Qn��i� �

p� � ��i��� required in the CORDIC

procedures�

The square root function seems to be outside of auniform scheme due to an additional add operation� Theaddend �����i�� �� which is to be seen in the y variable�can be realized by a bit set operation� This is possibledue to the fact� that the �i � �th bit is always zero�when the add operation is performed�

If the condition given in the last column of Tab� � is

ful lled then for the rst ve functions the calculationkeeps valid� otherwise not� In case of the last threefunctions the calculation is always done� but the abovealgebraic signs are used for the next iteration if the resultof the question is true� But it is still a uniform structureas it shows Fig� �� Hence� the inquiry can be alwaysdone at the end of each iteration on the y variable of thejust nished iteration�

B� The logarithm function as an exapmle

As an exapmle for all eight functions we want to showthe derivation of the algorithm calculating the functionvalues of the logarithm f��� � log���� For that we usethe additivity of the logarithm values of each factor� ifthe logarithm of a product is to carry out�

log�nYi�

yi� �nXi�

log�yi� ���

Using ��� it is practical to express � as a product ofcommon values� So we start with the basic approach

� �

nYk�

���kk � ��

whereby �k � �� � �k� and �k � f�� �g� Rewriting ��we get

� � �

nYk�

��kk � ���

Consequently we have �� ful lled if and only if ��� isful lled as well� Hence� we can de ne yn with

yn � �

nYk�

��kk � ���

To approximate � the series �yk�nk� shall converge to

�� Rewriting ��� we deduce that yn�� � yn � �n�� and nally

yn�� � yn � yn � ��n�� � ���

By approximating � with the productQn

i� ���kk there

will be a residue R��� n� such that

� �

nYk�

���kk �R��� n� � ���

Applying the logarithm to both sides we get

log��� � log�nY

k�

���kk �R��� n�� � ���

Page 4: A specification for a reconfigurable optoelectronic VLSI processor

If yn�� goes to � it is ensured that the residue R��� n�goes to zero� such that

log��� � log�nY

k�

���kk �

� �nX

k�

log���kk � � ���

So it is self�evident to de ne xn as this sum� such thatwe get

xn�� � xn � log��nn � ���

Since the series �yk�nk� is a monoton increasing

series� that means if yn�� is greater than �� the lastapproximation step would be irreversible� Hence� thelast iteration step has to be canceled� This is expressedin a formula language by

�k �

�� if yk�� � �

� else����

We still have to determine the initial values for xk and yk�Since we add in the x�variable the precalculated valuesof a table value we have to initialize x� with zero� Forthe y�variable it holds the following� due to the validityof ��� we get y� � ��� � ��� and deduced by ��� y� �y� � y� � ��� such that we get

x� � � y� � � ����

Since this algorithm inquires about y greater than � itis not the used version� We selected another algorithm�using the condition y greater than �� but its derivationis more complicated and longer� It can be read inliterature������ In a more or less complicated way allthe other � algorithms for standard functions can bederivated� All of them force to a uniform hardwarestructure���

�� Architecture model for the DSP System

As can be seen in our algorithms of Tab� � theDSP system requires only simple operations as shifting�followed by adding and nally checking a condition� Forthat we use an array of PEs �see Fig� � which will bemore speci ed in the next section� In each row of thisarray exactly one iteration of our algorithms is carriedout� So n rows are required� each one upon the other�The communication structure is a NEWS structure� i�e�to the � points of the compass� Within one row thereare also long lines �not shown in Fig� � to carry out a

data transfer over a distance of two or more processingelement positions�

For our DSP architecture we want to investigate a bitparallel and a bit serial approach�

A� Bit parallel approach

For the bit parallel processing we focused on twoalternatives for the addition process a ripple carry adder�RCA� and a conditional sum adder �CSA��

The most simple way to add two numbers would bea RCA� The carry bit is always given to the next bitposition to the left �see Fig� ��� So the PEs shown inFig� correspond to ��bit full adders� The functionalityis shown in ����

s � ai xor bi xor ci��

c � aibi or aici�� or bici�����

All communication takes place only between nextneighbors� So a communication over a longer distanceis not required� But the whole add operation takes nsteps to nish� An additional step is necessary to carryout the shift operation� To process the whole algorithmit takes

steps � n � �n� �� ����

This disadvantage of n steps per addition is avoidedby using a conditional sum adder� but for the price ofa more complicated interconnection scheme� The CSAexploits the hardware most e�ciently in relation to thetime necessary for one add operation������ Although anadder based on carry look ahead techniques can carryout an addition in constant number of time steps aCSA has a better gate per bit ratio� The same holdsfor look�ahead adders with logarithmic time costs������Hence� we don�t consider look�ahead techniques sincefor massively parallel structures using a CSA will leadto better throughput rates� The general idea behind aCSA is to carry out for each bit ��bit additions at thebeginning� considering both no carry occurs ���� and acarry ���� occurs from the nearest right position�

S� � ab� ab

C� � ab����

S� � ab� ab

C� � a� b����

Page 5: A specification for a reconfigurable optoelectronic VLSI processor

Afterwards the valid bits are selected by multiplexoperations�

S�out � C�inS� � C�inS�

C�out � C�inC� � C�inC�

S�out � C�inS� � C�inS�

C�out � C�inC� � C�inC�

����

The multiplexing starts in blocks of one bit width� Thewidth of a block increases consequently from � bit widthto ������� and so on �see Fig� ��� Finally one operationis nished in logarithmic time� exactly speaking after�� � ldn� steps�

The interconnections of the modules with such afunctionality yield the sum of a and b at S� and thedi�erence of a and b at S�� assuming the �complementrepresentation of the numbers� The most signi cant bit�MSB� is on the left and the least signi cant bit �LSB�is on the right� For each bit position there is one PE�In this case each PE of Fig� realizes the half adderfunctionality shown in ���� and ����� and the multiplexeroperation shown in �����

B� Bit serial approach

This approach of the bit and CORDIC algorithmsintegrates a complete iteration in one PE combining theoperations adding� shifting and checking a condition� Incontrast to the bit parallel approach the PEs expect theirdata in a serial fashion�

The functionality of one row of Fig� is realized by onePE� Such a PE consists of a full adder followed by ashift register to store the sum bits of the add operation�A vector of n vertically arranged PEs calculates exactlyone of the eight functions of Tab� �� Hence� several ofthese vectors arranged side by side� can execute di�erentfunctions simultaneously� The input bits are fed into thefull adder step by step starting with the LSB� The sumbit is stored in the rst �ip �op of the shift register andthe carry bit goes to the carry �ip �op �see Fig� ��� Inthe next step each bit in the shift register is moved to theleft� All PEs of a vector are pipelined� After nishingthe add operation for the MSB it is possible to decide themultiplex condition �see Tab� � right most column�� Todecide whether the value of the corresponding variableis greater or below zero we just have to check the signbit� The following shift can be realized by a so calledmemory grabber functionality� Due to calculating onlyone iteration within each PE the grab position is clearly

determined and the shift operation can be realized with xed interconnections�

�� Structural description of one PE

A� Bit parallel method

The speci cation for one PE for a bit parallel approachusing the conditional sum adder is described in thissection�

We have already shown realizing a PE takes onlyhardware for shifting� adding and a control unit� Toprevent long lines on chip we want to use opticalinterconnections� Long lines are necessary for therealization of the bit shift operation� the transfer of thecarry bits and the sign bit� Then we need less area andcan operate with a higher clock cycle�

The PEs require one optical receiver �OR� and oneoptical transmitter �OT� for the shift and for the controlunit� The adder takes two OT and two OR for itsfunctionality �see Fig� ��� Each row is responsible forone iteration� So after nishing the calculation of oneiteration the result is transferred to the next neighboredPE in the north� Since this is a next neighborhoodcommunication this can be done electronically� Dueto the principles of a CSA we have to transfer thecarry bits C� and C� �see ����� ����� and Fig� ��within a row and not only to the next neighbors�The communication paths� running always from a lesssigni cant bit to a more signi cant bit �see Fig� ��� shallbe optically realized� We also need a shift operation� Thecharacteristic within our approach is� that there is alwayswithin one row the same distance of shift positions�

Depending on the sign bit of y �see Tab� �� the controlunit of each PE has the task to select the right operandsfor the add and shift unit� Due to the �complementrepresentation of the numbers the leftmost PE of a rowcan determine whether the value of y is less than zero�So this PE can broadcast the sign bit over the wholerow to the right to the control units of the other PEs�see Fig� ��� Then all other PEs know the result ofthe selection condition� This allows to determine for thefunctions ������ of Tab� � if the old or the new values arevalid� For all the other functions the operator if we haveto add or to subtract is inverted or can be kept� Thiswas already described in more details in Section A�

The necessary optical wiring for the bit shifting� thetransfer of the carry bits and the sign bit is shown in

Page 6: A specification for a reconfigurable optoelectronic VLSI processor

Fig� � schematically� The optical interconnections canbe realized for example in a glass substrate or waveguide�which is located above the electronic PE array� Usingoptical interconnections has the following advantages�The transfer of the carry bits in a kind of segmentedbusses allows to save hardware because the multiplexer�shardware for each PE �see Fig� �� is needed onlyonce� Furthermore realizing the bit shift and the signbit transfer in optics saves area since long lines are notnecessary anymore�

As mentioned above we need �� � ldn� steps for anaddition� To carry out the shift operation we needtwo further steps� Hence� we get nally for the wholecalculation

steps � n � �ldn� �� � ����

B� Bit serial method

Next we explain in more detail the setup of the PE ofthe bit serial architecture approach �see Fig� ���

The PE needs � electrical ports for I�O� �� inputs and�� outputs� and � optical inputs serving the PE with theglobal clock and the values of precalculated constants forlog�� � �i� and arctan��i� �see Section A�� Duringthe rst clock cycle the input ports are used to selectthe function� which the PE should do on the followingbit stream� When the complete stream is read� the PEgenerates at its output ports the code� which programsa following PE to select the right function� In the nextcycle the calculated operands are shifted out� Since thecompatibility between input and output ports all PEscan be consecutively connected to build the completeiteration scheme as shown in Fig� �� The PEs areannounced � xed� because every PE only works at theiteration stage it was build for� Hence� we also needno dynamic shift operations due to the xed numberof bits to be shifted� Since the shift operation can beimplemented with xed interconnections it will not takemore time than a single ripple carry step� The shiftoperation is hardwired instead of using a multiplexerwith the number of inputs corresponding to the numberof bits in the mantissa� So� PE xed to stage � willnot calculate correct results for any other iteration stagethan number �� The advantage of xing the PE to aniteration stage is an optimization leading to signi cantlyless chip area�

Additional to the � xed� attribute all PEs worksynchronously� This approach allows the usage of onetable generator for all PE vectors shown in Fig� ��

Internally the PE mainly consists of eight shift registersholding the operands �� � and x� y and z with theircorresponding values of the previous iteration� The shiftregisters for the new calculated operands x� y and z gettheir input from an add�sub unit� This unit adds orsubtracts the two input signals� A multiplexer with �inputs selects the operand according to the programedfunction� The operands are shifted in with the LSB rst�After the MSB initiates the last add�sub operation� thecontroller checks the selection condition and switches themultiplexers� so that they select the right value shiftedout to the next PE� The sel port is used to select theadd or sub operation for the CORDIC algorithms in thenext stage� The sync port works like a reset for thecontroller and is activated everytime the PE receivesthe start of a new calculation� The PEs operate fullypipelined meaning that the input stream feeds new datainto the PE without any wait states� Therefore� thecalculation of a function needs one clock cycle for eachbit in the mantissa plus one additional clock cycle forthe programming of the PE� The result has a latency ofthe number of iterations n times the number of bits inthe mantissa n plus ��

steps � n�n� �� ����

�� Performance comparison of the three architecture

models

As we have seen there are di�erent possibilities forthe realization of architectures in which the presentedlow�level algorithms are hardwired implemented� Themaximum throughput � we can achieve in such anarchitecture is expressed by the relation of the numberof parallel operations p versus the number of necessarycalculation steps s for one iteration�

throughput� �� parallel calculations

� steps per one iteration�

p

s ����

Our ultimate aim is to nd that model that deliversthe highest computing performance with regard tothroughput� To nd an answer we determine thenecessary number of steps for one iteration at rst� Tab� shows this for the di�erent models� We assumed ane�ective wordlength of n� ldn instead of n to get a bitprecision of �n� Hence� we have to replace all n in ��������� and ����� which correspond to the wordlength n�by n � ldn to get the number of steps for the di�erentarchitecture models�

The number of calculations we can carry out in paralleldepends on the number of PEs that are needed in each

Page 7: A specification for a reconfigurable optoelectronic VLSI processor

architecture model to carry out one calculation� Thisnumber is di�erent for the three architecture modelsbecause we need one PE for one iteration in the bitserial case �bs� and we work with one PE per bit in bothbit parallel architecture models �bpRCA and bpCS�� Ofcourse the PEs are di�erent in size� Hence� we get theformula in the right most column of Tab� for thenumber of parallel calculations p� if Abs� AbpRCA andAbpCS are the corresponding sizes for one PE and Achip

is the available circuit area�

With the expressions of Tab� it is now possible tocalculate the throughputs for the di�erent architecturemodels� But this presumes that we know the necessarysizes for the PEs� These are technology depending gures we don�t know so far� To compare thethroughputs independently of the technology we considerthe relations of the throughputs for the bpRCA caseversus the bs case ��a�� the bpCS case versus the bscase ��b�� and the bpRCA case versus the bpCS case��c��

�bpRCA�bs ��bpRCA

�bs�

Abs

AbpRCA� �

ldn� n

� � �

n� ldn

��a�

�bpCS�bs ��bpCS

�bs

�Abs

AbpCS� ldn� n� �

�n� ldn��ld�n� ldn� � ��

� � ldn� n� �

�n� ldn��ld�n� ldn� � ��

��b�

�bpRCA�bpCS ��bpRCA

�bpCS

�AbpCS

AbpRCA� ld�n� ldn� � �

ldn� n� �

� � � ld�n� ldn� � �

ldn� n� �

��c�

For that� these relational throughput values areexpressed in dependence on the relational PE sizes � and � � Although these relational PE sizes are stilltechnology dependent gures� they help us to nd a rst answer to the question� which architecture deliversthe highest throughput� This is demonstrated withthe curves for �bpRCA�bs��bpCS�bs� and �bpRCA�bpCS �shown in Fig� � for a wordlength n � ��

For example� the bpRCA architecture model o�ers onlya higher throughput versus the bs architecture� if thePE size of the bs architecture will be about �� times

greater than the PE in the bpRCA model� A di�erentsituation is given for the comparison between the bpCSand the bs model� As long as the PE size in the bs caseis not about eight times greater than the PE in the bpCSmodel� the bs architecture is predominant versus thebpCS architecture model� A similar relationship yieldsfor the comparison bpCS to bpRCA� Only if the size ofthe PE� which exploits the conditional sum algorithm� ismore than ve times greater than the PE in the bpRCAmodel� the more simple ripple carry architecture bpRCAis advantageous�

Hence� by using the relational PE sizes we have reducedthe question� which architecture is the most e�cient one�to a comparison of PE sizes� This comparison we carrythrough with the corresponding gate numbers for eachPE� These gate numbers were determined by a high�levelsynthesis of a VHDL description of our PEs� Due to thebad theoretical results of the bit parallel method using aripple carry adder� this model is not taken into accountany more� A synthesis for the bit serial �bs� approachwith the design analyzer of Synopsys� generated a gatelevel model with �� cells of a standard library� For thebit parallel architecture model using the CSA �bpCSA�this number was ���� cells�

Consequently we favour the bit serial approach since itdelivers the highest throughput�

�� Mapping onto optoelectronic hardware

In this section we discuss how the bs approach canbe realized with real optoelectronic hardware� Weare interested in the costs for the logic measured byan estimated transistor number� The task of theoptical interconnection scheme is the broadcasting ofthe necessary constant values and the clock signal�There are two reasons why we would realize these linksby optical means� The rst is to avoid long on�chipinterconnections because they increase the clock cycletime� This concerns the clock signal line itself and thedistribution of the constant bits within one row of thePE array� Furthermore chip area is saved if the memoryfor the constant is realized outside the circuit and thedata can be fetched by a parallel data access�

The necessary fan�out for the optical broadcast links isdetermined by the number of PEs that can be integratedin one row� To determine this number it is necessary toestimate the area requirements for one PE� In the bsmodel we need six shift registers to store the values forx� y� z and their addends� For a mantissa length with

Page 8: A specification for a reconfigurable optoelectronic VLSI processor

� bits and an accuracy of ��� we would need � bits�Two more bits are necessary to avoid over�ow and anadditional one to store the sign bit� Hence� we need�x���� �ip �ops� which can be realized as dynamicdelay �ip �ops �D�FF�� A single dynamic D�FF needs inCMOS � transistors� if they are cycled continuously witha high clock rate� Then we need ���� transistors for thememory�

Besides� we have to integrate the addsub� the control andthe two �to�� multiplexers� and the ��to�� multiplexer�see Fig� ��� Realized as complex gate the full adder inthe addsub unit needs � transistors� Including the costsfor generating the complement and a further multiplexercomponent we need about �� transistors for the addsubunit� Using transmission gates the �to�� multiplexersrequire � and the ��to�� multiplexer needs � transistors�The control unit consists mainly of a counter and a nitestate machine to recon gure the addsub unit dependingon the function we would like to execute� The controlunit requires about �� AND and OR gates in the VHDLspeci cation� The cost for that is estimated with about��� transistors� So� let us summarize the total transistorcosts� Since the addsub and the multiplexer units existthree times� � � ������� � �� transistors are necessary�plus the transistor costs for control unit and memory� wereceive nally about ����������� � �� transistors�Furthermore we assume about ��� additional area forwiring� So we get about ��� transistors for one PE�

In a current design��� which was made for a H�SEEDcircuit o�ered by the CO�OP consortium� we obtainedan average transistor size of about ��m�� This currentdesign represents a good mix� because ��� transistorswere included in an area of �� � ����m containingmemory� logic gates� wires and also unused chip area�The technology was a ����m CMOS process� Assuminga �����m technology� which is o�ered by the currentCO�OP foundry run� we can double our transistornumber� This is possible according to the relation� thatthe number of devices on chip increases by the squareof the scaling factor � � ���

��� � ����� This would

lead to ��m� average transistor size� For our rstorder estimation we will calculate with an average size of��m� per transistor� Hence� the area for one PE is then��� � ��m� � �����m�� This would t exactly in anarea of ��� � ����m size� which allows to integrate anarray of ���� PEs on a chip with ������cm area� Thatmeans� we can execute �� calculations simultaneously�We calculate only with � PEs in the vertical direction�since this is the minimum number of iterations we haveto accomplish� The further � bits are only presentto get the required accuracy� Assuming a completely

lled pipeline the maximum performance is equal to thecalculation time for one iteration� i�e� the time one PEneeds� This time is �� �� bits � � step for control code�times the reciprocal of the used clock frequency� Hence�the maximum computing performance we can achieve isshown in ����

P � �� � �

�� � �clock cycle

���

In Fig� �� we compare the computing performance withpure electronic systems�We can see� that for a clock cycle of ��� MHz our singleoptoelectronic chip performs �� Mega operations persecond �MOPS�� This outperforms systems like CISC�RISC and signal processors with the exception of theCray supercomputer� We think� that ��� MHz are easyto achieve� since we need not more than �� gate delays inour PE and �ns gate delay is a reserved assumption� Theresults show that our approach o�ers enough potentialto get a competitive edge versus pure electronic designs�

Let us now take a closer look to the required opticalinterconnection scheme� As already mentioned onlythree optical inputs are needed per PE� two for theconstant values and one for the clock� If we refer to thedefaults of the CO�OP workshop again� we can assumea vertical and horizontal distance of ��� � ���m forthe self�electrooptic e�ect device �SEED� elements usedas optical �pins�� So� we can arrange the three opticalinputs in vertical direction with a pitch size of ���m�Fig� �� shows schematically the optical �wiring� and thesetup for our bit serial parallel signal processor�

In addition to the processor array we need a memorycircuit containing the constant values� The memoryitself can be realized again as a H�SEED circuit� Asalternative also an one�dimensional array of VCSELs�ip�chip bonded on a silicon circuit is conceivable� Thesilicon circuit is structured in rows� In the ith row theconstant values for the ith iteration is stored� The bits ofthe constants are retained in a ring shifter� So� in eachclock period the corresponding bit is shifted out and isused to switch on the VCSEL� The emitted power mustbe su�cient to drive �� receivers� which are assignedto the PEs in one row of the processor circuit� Thelight has to be broadened horizontally to realize thefan�out� Such an interconnection scheme had to berealized two times for each PE to broadcast the twoconstant bits� In the same way we organize the clockdistribution for all PEs in one row� The correspondingcircuit to generate the clock can be integrated directlyin the memory chip� We emphasize that the number ofoptical links running between the memory and processor

Page 9: A specification for a reconfigurable optoelectronic VLSI processor

circuit is ������ � ����� what is impossible to achievewith current electronic technology�

If we assume an optical output power of ��mW perVCSEL� the dissipated power in the one�dimensionalVCSEL array should be handled without any problems�An important question on the receiver side concernsthe necessity for an ampli cation in the receiver circuit�Using SPICE simulations for a ��� �m CMOS process weshowed that about ����A photo current are su�cient toswitch an inverter with ��MHz directly by the opticalinput signal� Assuming a responsivity of about ��� A�W�as it is characteristic for SEED diodes� we need ���Woptical input power to get a photo current of ����A�Then each VCSEL had to emit unrealistic ��mW for afan�out of ��� To operate with reasonable values a photocurrent of ���A must be su�cient� Hence� in this casean ampli cation of the photo current should be providedto avoid mistakes caused by noise currents� Then theoptical power budget available with VCSELs is su�cientto realize the broadcast connection�

For the light coupling onto the chip we propose theuse of multiple beam splitters��� Special binary phasehologramms with a period number of N� known asDammann gratings� can equally distribute an input lightintensity onto N�� points on a line� As alternativeto Dammann gratings devices of the Talbot type are ofinterest�� for two� or one�dimensional array illumination�too� They avoid Dammann handicaps as an additionalFourier lens and the lambda drift sensivity� Detailsconcerning the realization are to discuss with the opticscommunity� We hope� that in the mid�term future�optical systems exploiting the Talbot e�ect for large��to�N interconnections� can be realized either in planaroptics techniques or with stacked optics to integratebroadcast interconnectioncs with OE�VLSI circuits�

� Remarks on fault tolerance

Since we aspire a massively parallel system faulttolerated mechanisms are essential� Therefore wedeveloped an approach to simulate fault models foroptoelectronic circuits using VHDL� VHDL is wellknown for the design of digital systems� The completearchitectures presented in Section � and � was validatedby VHDL simulation� A testbench generates stimuli forthe input ports of the PE array� Comparing the outputof the system with the expected results was the indicatorfor a bugfree system� The VHDL model of the systemis able to be processed by the Synopsys� synthesis tool�

This tool maps the low�level behavior description of thePEs to a cell library of a chip manufacturer integratingthe functionality in silicon�

The new aspect of this section is a VHDL modelfor the optical communication and the optoelectronictransmitters and receivers� Our model integratesthe possibilities to simulate the system with faultytransmitters�receivers and cross talk between opticalcommunication channels with the aim to investigate theimpact of these faults to the system�

To simulate di�erent fault sources in receivers we canconsider either a stuck at � or a stuck at � fault model� Ifthe receiver delivers always a zero� even it is illuminated�a stuck at � model can be used� This holds also for anoptical sender that emits no light although the thresholdcurrent is su�cient� To consider also fault models indi�erential coded optical inputs� which are mostly usedin SEED based systems� a stuck at � fault model is touse in the simulation�

Since cross talk depends on the layout of the opticalchannels� models of the optics are coupled very closely tothe architecture� By carrying out simulations based onan architectural speci c fault model we want to derivatetolerances concerning the optical devices� For examplewe want to nd out at which level of cross talk the systemfails� As a result of our simulations we want to de nespeci cations for quality of optical devices� These valuescan serve as inputs for physicists�

For the simulation and evaluation of the abovementioned faults we use the tool VERIFY����� VERIFYis a tool which injects single faults during the simulationof a system� e�g� stuck at ��� or cross talk� In addition�VERIFY evaluates the impact of the injected faultsby comparing the waveforms of the output signals ofa golden run �no faults are injected� with every runbelonging to a speci c fault� VERIFY was used for thistask because it works on VHDL code and the extensionof the VHDL models with the above mentioned faultmodels is very easy�

The model of the optoelectronic system can be seperatedinto two parts� One part covers the model of theelectronic components� the PEs� This part is designedwithout respect to the optics� Hence� the designof the PEs itself is exactly the same process likeconventional circuit design� The other part� the opticalmodel� covers the model of the optical interconnectionsand tranformation of the optical interfaces to digitalinterfaces� the optical sender and receiver� In thefollowing paragraphs only the optical model will be

Page 10: A specification for a reconfigurable optoelectronic VLSI processor

described�The optical interconnections are realized with integersignals� For each possible optical link exists a parameterdescribing the amount of light coming in if the senderis on� Connections representing a real connection froma sender to a receiver are weighted high� full line inFig� �� connections representing cross talk are weightedlow� dashed lines in Fig� �� Summing up all weightsfrom one optical receiver to every active sender is theamount of light reaching the receiver� The receiver onlysends a � to the digital interface if the sum is greaterthan a threshold� else it sends a �� State of the art isthat we generate the interconnection parameters witha program� calculating the parameter depending on thedistance from the emitter to the receiver where cross talkis less weighted than desired communication� Workingtogether with physicists more tting parameters forthis model will be searched to program algorithms forautomatic generation of the parameter matrix�

� Summary and outlook

Our goal was the design of a recon gurable parallel signalprocessor using optical interconnections to avoid longon�chip lines� Fine grained architectures are preferableto exploit the potential of high space bandwidth inoptics� Therefore we developed uniform low�levelalgorithms from the add�and�shift class to calculatetypical operations for signal processing� These arethe eight operations sine� cosine� exponential function�logarithm� arctangent� square root� multiplication anddivision� This can be achieved by using only simpleoperations like adding� shifting and an access to a table�No sophisticated multiplication units are necessary�

The low�level algorithms were mapped onto threedi�erent optoelectronic processor architectures for ahardwired implementation� By a theoretical analysiswe determined that a bit serial architecture approachdelivers the highest throughput� but the bit parallelapproach based on a conditional sum adder shows thebest latency behavior� Since we were interested inmaximizing the throughput we mapped the bit serialmodel onto two OE�VLSI circuits linked together byoptical chip to chip interconnections� One circuitcontains the ne grained processor array and the otherone the memory� Supported by a VHDL synthesiswe deduced an size of � � �� for the processorarray� This results in a computing performanceof �� Mega operations per second for a moderateclock speed of ��� MHz� Comparing this to currentpure electronic solutions this approach signi es an

improvement� Further improvements are possiblesince our processor architecture o�ers still potentialfor optimization� The requirements for the opticalinterconnection scheme should represent no unsolvabledi�culties� We need �� optical senders with a fanoutcapability of ��� The emitted light of each sender has tobe broadcasted along one row of the array�

Since we aspire massively parallel systems we think faulttolerant mechanisms are necessary� A rst step towardsthis direction is the development of a concept to consideroptical cross talk and stuck at � fault models for opticaltransmitters and receivers in the VHDL speci cation ofan optoelectronic system�

One of the next steps we intend to do in the future isto carry out simulation experiments to derivate resultsconcerning the allowable tolerance of the optical system�Furthermore� since the gained performance measuresfor our architecture are promising and we judge therequirements for optics and optoelectronics as moderate�we think a hardware realization must be one of the nextgoals� To achieve an e�cient solution� especially forthe realization of the optical interconnection system� aclose cooperation between physics and computer scienceis indispensable� Such interdisciplinary work is anywayone of the main future tasks to lead optoelectroniccomputing to success�

�� Acknowledgements

The authors wish to acknowledge support for this workfrom the Deutsche Forschungsgemeinschaft �DFG� �

��

Page 11: A specification for a reconfigurable optoelectronic VLSI processor

Figures

initialize

for i � � to n

�a� shift � shift �

�b� ADDSUB x ADDSUB y ADDSUB z

�c� if �y� � � then

calculation �b� is valid �bit algorithm�

or

keep add�sub operation signs �CORDIC�

else

use old values for next iteration again

�bit algorithm�

or

invert the add�sub operation signs for

next iteration �CORDIC�

Fig� �� Digit algorithm structure

iteration 1

input

output

iteration 2

iteration 3

iteration n

Fig� �� Processor array of the DSP system�

FA

a bs

c

FA

a bs

c

FA

a bs

c

FA

a bs

c

FA

a bs

c

FA

a bs

c

FA

a bs

c

2 1122 1

ADD_Y

ADD_Z

PE

Fig� �� One PE with the communication structure of aripple carry adder�

b a1 1

b a8 8

MUX MUX

MUX MUXMUX MUX

MUX MUX MUX MUX

MUX MUX

S1 S0

C1 C0

S1 S0

C1 C0

MUX MUX

MUX MUXMUX MUX

MUX MUX MUX MUX

MUX MUXMUX MUX

MUX MUXMUX MUX

MUX MUX MUX MUX

MUX MUX

Fig� �� Communication structure of a � bit conditionalsum adder in a time�PE diagram�

shift register

FAFFFFFF

C

FF . . . X adder

Y adder

Z adder

bit serial

processing element

iteration 1

iteration n

FAFFFFFF

C

FF . . . X adder

Y adder

Z adder

iteration n

iteration 1

PE vector

Fig� �� Bit serial PE vectors for a DSP system arrangedto a matrix�

��

Page 12: A specification for a reconfigurable optoelectronic VLSI processor

adder

iteration 1

iteration 2

iteration 3

iteration n

sh

ift

control

Fig� � Partitioning of a PE with the corresponding optical I�Os

realization of

bit shift

realization of

carry bit C0 transfer

realization of

sign bit transfer

PE1PE8

optical receiver

optical sender (light is always send to the left for

carry bit transfer and shift operation)

realization of

carry bit C1 transfer

Fig� �� Optical interconnection scheme for one row ofPEs�

FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF

MU

X

FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF

MU

X

AD

DSU

B

MU

X8

Controller

log atn clk

xshifted

x

sel

yyshifted

zzshifted

ashiftedbshifted

sync

xshifted

x

sel

yyshifted

zzshifted

ashiftedbshifted

sync

optic inputs

Bitserial Processing Element

Fig� �� Setup of a PE of bit serial approach�

Fig� �� Comparison of the throughputs for each pair ofbit parallel ripple carry�� bit parallel CSA� and bit serialapproach

Fig� ��� Comparison of the expected performance of theoptoelectronic bit serial DSP system with pure electronic architectures

Fig� ��� Schematic of the optoelectronic bit serial DSParchitecture

Page 13: A specification for a reconfigurable optoelectronic VLSI processor

optical model

digital interface

digi

tal i

nter

face

desired communication

optic

al r

ecei

ver

optical sender

low

cro

ss ta

lk

high

cro

ss ta

lk

Fig� ��� Cross talk of one light beam to several receivers

Tables

function iteration formulae init condition

log�� � �� xi�� � xi � log�� � ��i� x� � �

yi��

�� �

yi�� � yi � ��izi y� � �

zi�� � zi � ��izi z� � �

e� xi�� � xi � xi��i x� � �

yi��

�� �

yi�� � yi � log�� � ��i� y� � �

p� xi�� � xi � ���i��� x� � �

yi��

�� �

yi�� � yi � xi���i� � ����i��� y� � �

� � � xi�� � xi � ���i x� � �yi��

�� �

yi�� � yi � ��i y� � �

�xi�� � xi � ���i x� � �

yi��

�� �

yi�� � yi � ���i y� � �

sin��� xi�� � xi � ��izi x� � �

yi�� �zi�� � zi � ��ixi y� � �

yi�� � yi � arctan���i� z� � �

cos��� zi�� � zi � ��ixi x� � �

yi�� �xi�� � xi � ��izi y� � �

yi�� � yi � arctan���i� z� � �

arctan��� xi�� � xi � arctan���i� x� � �

yi�� �yi�� � yi � ��izi y� � �

zi�� � zi � ��iyi z� � �

Table �� Developed uniform lowlevel algorithms for standard functions

architecture �steps per �PEs for whole �parallelmodel iteration calculation calculations

bit serial n� ldn� � nAChip

n � Abs

bit parallelripple carry n� ldn� � �n� ldn� � nAChip

AbpRCA � �n� ldn� � n

bit parallelconditional sum ld�n� ldn� � � �n� ldn� � nAChip

AbpCS � �n� ldn� � n

Table �� Number of steps for one iteration and for thewhole calculation for the di�erent investigated architecturemodels

��

Page 14: A specification for a reconfigurable optoelectronic VLSI processor

List of Figures

� Digit algorithm structure � � � � � � � � � � Processor array of the DSP system� � � � � �� One PE with the communication struc�

ture of a ripple carry adder� � � � � � � � � �� Communication structure of a � bit con�

ditional sum adder in a time�PE diagram� �� Bit serial PE vectors for a DSP system

arranged to a matrix� � � � � � � � � � � � �� Partitioning of a PE with the correspond�

ing optical I�Os � � � � � � � � � � � � � � � �� Optical interconnection scheme for one

row of PEs� � � � � � � � � � � � � � � � � � �� Setup of a PE of bit serial approach� � � � ��� Comparison of the throughputs for each

pair of bit parallel �ripple carry�� bit par�allel �CSA� and bit serial approach � � � � ��

�� Comparison of the expected performanceof the optoelectronic bit serial DSP sys�tem with pure electronic architectures � � ��

�� Schematic of the optoelectronic bit serialDSP architecture � � � � � � � � � � � � � � ��

� Cross talk of one light beam to severalreceivers � � � � � � � � � � � � � � � � � � � �

List of Tables

� Developed uniform low�level algorithmsfor standard functions � � � � � � � � � � � ��

Number of steps for one iteration and forthe whole calculation for the di�erent in�vestigated architecture models � � � � � � � ��

�� G� SaiHalasz� Performance Trends in HighEnd Processors� Proceeding of the IEEE� �� �������� Jan� �����

�� K�W� Goossen� J�A� Walker� L�A� D�Asaro� S�P� Hui�B� Tseng� R� Leibenguth� D� Kossives� D�D� Bacon� D�Dahringer� L�M�F� Chirovsky� A�L� Lentine� and D�A�B�Miller� GaAs MQW Modulators Integrated with SiliconCMOS� IEEE Phot Tech Lett� �������� �����

�� A�V� Krishnamoorthy and D�A�B� Miller� ScalingOptoelectronicVLSI Circuits into the ��st Century� ATechnology Roadmap� IEEE Journal of Selected Topisin Quantum Electronics� � ��� April ���� pages �����

�� T�K� Woodward� VSLIcompatible smartpixel circuits and technology� In Smart Pixel Digital Digest�IEEE�LEOS Summer Topical Meetings� page �� Keystone Colorado� August ����

�� S� Araki� M� Kajita� K� Kubota� K� Kurihara� I� Redmond� E� Schenfeld� and T� Suzaki� Experimental FreeSpace Optical Network for Massively Parallel Computers� Applied Optics� �� ������������ March ����

� J� Jahns� Planar Packaging of Free Space Optical Interconnections� Proc of the IEEE� ����������� �����

�� F�B� MCCormick et al� Experimental investigation of afreespace optical switching network by using symmetricselfelectrooptice�ect devices � Applied Optics� ������������ �����

�� J�A� Ne�� C� Chen� T� McLaren� C�C� Mao� A� Fedor�W� Berseth� Y�C� Lee� and V� Morozov� VCSEL�CMOSSmart Pixel Arrays for FreeSpace Optical Interconnects� In Proceedings of MPPOI���� A� Gottlieb� Y� Li�and E� Schenfeld� eds��� IEEE CS Press Los Alamitos�California� October ���� pages �������

�� The mention of brand names in this paper is for information purposes only and does not constitude an endorsement of the product by the authors or their institutions�SYNOPSYS Corp���

��� A�V� Krishnamoorthy� P�J� Marchand� F�E� Kiamilev�and S�C� Esener� Grainsize considerations for optoelectronic multistage interconnection networks� Applied Op�tics� �� ������������� Sep� �����

��� D� Fey and W� Erhard� Algorithms for High�Performance Computing with Smart Pixels� In Appli�cations of Photonic Technology� Lampropoulos� G�A�et al�� eds��� pages ������� New York� ����� PlenumPress�

��� D� Fey� A� Kurschat� B� Kasche� and W� Erhard� A�D Optoelectronic Processor For Smart Pixel ProcessingUnits� In Proceedings of MPPOI���� A� Gottlieb� Y� Li�and E� Schenfeld� eds��� IEEE CS Press Los Alamitos�California� October ���� pages �������

��� M� Ishikawa� System Architecture for Integrated Optoelectronic Computing� Optoelectronics � Devices andTechnology� � �������� �����

��� J� Volder� The CORDIC trigonometric computing technique� IRE Transaction on Electronic Computers EC� �No� �� pages ������� Sep� �����

��� B� Kasche and D� Fey� Optimale Algorithmen zurBerechnung von Standardfunktionen mittels Smart PixelRechenwerke� Technical Report� University of Jena� W�Erhard editor�� volume � ��� February ���� ISSN ���������

�� I� Koren� Computer Arithmetic Algorithms� PrenticeHall Inc�� Englewood Cli�s� New Jersey� �����

��� K� Hwang� Computer Arithmetic � Principles� Architec�ture and Design� J� Wiley � Sons Inc�� New York� �����

��� A� R� Omondi� Computer Arithmetic Systems� Algo�rithms� Architecture and Implementation� Prentice HallInc�� Englewood Cli�s� New Jersey� �����

��� J� Slansky� An Evaluation of Several TwoSummand Binary Adders� In IRE Trans� EC�� No��� pages �������June ����

��� G� Grimm� Entwicklung eines VLSI Layouts fur op�

��

Page 15: A specification for a reconfigurable optoelectronic VLSI processor

toelektronische programmierbare Schaltkreise� TechnicalReport� University of Jena� W� Erhard editor�� volume� ���� April ����� ISSN ���������

��� U� Krackhardt� and N� Streibl� Design of Dammanngratings for array generation� In Optics communication�No� ��� pages ��� �����

��� A�W� Lohmann� and J�A� Thomas� Making an array illuminator based on the Talbot e�ect� In Appl Opt� No���� pages �������� �����

��� V� Sieh� O� Tsch�ache� and F� Balbach� Evalutaion ofDependable Systems using VERIFY� In Preprints of �thConference on Dependable Computing for Critical Appli�cations �DCCA���� M� Dal Cin editor�� Grainau� Germany� March ����� pp� ����

��� V� Sieh� O� Tsch�ache� and F� Balbach� VERIFY� Evaluation of Reliability Using VHDLModels with Embedded Fault Descriptions� In The Twenty�Seventh AnnualInternational Symposium on Fault�Tolerant Computing�FTCS����� P� Storms editor�� IEEE� Seattle� Washington� June ����� pp� ����

��