EXTENSIONS TO THE C PROGRAMMING LANGUAGE FOR SIMD…hj/conferences/89.pdf · 2014-02-04 · I. Introduction There are currently two general strategies for programming parallel processors

EXTENSIONS TO THE C PROGRAMMING LANGUAGEFOR SIMD/MIMD PARALLELISM

James T. KuehnHoward Jay Siegel

PASM Parallel Processing LaboratorySchool of Electrical Engineering

Purdue UniversityWest Lafayette, Indiana 47907 USA

AbstractA superset of the C programming language that is applica-

ble to the SIMD/MIMD mode processing environment of PASM isdescribed. The language extensions for SIMD mode include thedefinition of parallel variables, functions, and expressions; ascheme for accessing parallel variables; and extended controlstructure semantics. Extensions for MIMD mode are realized bydefining a preprocessor to convert a generalized CSP-likelanguage to standard C with operating system calls inserted tosupport the parallelism. Extensions to the libraries of I/O andoperating systems functions used in parallel mode are also dis-cussed. The PASM parallel processing system is used as an exam-ple target machine.

I. IntroductionThere are currently two general strategies for programming

parallel processors. One is the definition of languages that allowthe programmer to express the parallelism of a problem expli-citly. Parallel languages that have been proposed/implementedthus far have either been targeted toward SIMD machines wherea given number of processors is assumed at compile time (e.g.,Glypnir [10], Actus [14, 15], Lucas [4], Vector C [11]) or towardmultiprocessing systems using the Concurrent Sequential Process(CSP) model (e.g., Concurrent C [13], Occam [5], Ada taskingfacilities [3]). Another strategy is the adoption of existing seriallanguages for parallel machines, relying on extensive analysisduring compilation to extract the parallelism (e.g., KAP andVAST [2], Parafrase [7]). This paper employs the first strategy bydefining extensions to the C programming language [6] for theSIMD and MIMD modes of parallelism [8].

PASM is a partitionable SJMD/MIMD system [17] whichcan be structured as one or more independent SIMD and/orMIMD machines. PASM consists of N Processing Element!(PEs), Q Microcontrollers (MCs), a partitionable multistageinterconnection network, multiple secondary storage units, andadditional processors dedicated to job scheduling and I/O. Thereare N/Q PEs associated with each MC. A virtual SIMD or MIMDmachine of MN/Q PEs is formed by combining the efforts of MMCs. The MCs act as control units for their PEs in SIMD modeand perform scheduling and coordination duties in MIMD mode.Each PE consists of a processor and a substantial-sized localmemory. A prototype PASM system with N = 16 PEs, Q=4 MCs,an Extra-Stage Cube interconnection network [1], fiveWinchester-technology secondary-storage devices, and relatedscheduling and memory management processors is being con-structed at Purdue University using Motorola MC68010 proces-sors and other off-the-shelf components [12].

Some of parallel-C's goals and philosophies are outlined inSection II. In Section III, new constructs are described for SIMDmode. Section IV discusses support for MIMD mode.

II. Goals and PhilosophiesIn keeping with the tradition of C, the language extensions

proposed attempt to minimize the number of new keywords, con-structs, and idioms. Extensions that can reasonably be imple-mented as a call to a system- or user-supplied function are pre-ferred over new operators that would make the compiler morecomplex. The C programming language has found favor with sys-tems and applications programmers alike because it is highlyefficiently compiled, has modern control structures, and allows

This research was supported by the Rome Air Development Center undercontract number F30602-83-K-0119 and by an IBM Graduate Fellowship.

the user to get very close to the machine hardware if desired. Thelaissez-faire tradition of C is also continued; execution continuesunless a program error is non-recoverable: i.e, accesses to non-existent or protected memory, communication with PEs notassigned to the task, etc.

Processors are a resource just as is memory. In seriallanguages, the variable declarations indicate how much memoryis to be allocated to the program. If additional memory is neededduring execution, a system function can be called to provide it. Inpractice however, the amount of memory desired may outstripthe physical memory available. Virtual memory techniques areused to overcome this difficulty. Extending this concept to paral-lelism introduces the notion of a Virtual PE (VPE). If a parallelprogram has a declaration of an array containing N elements, it ismost natural to map the array onto N physical PEs (one elementper PE) so that the elements can be accessed and processed inparallel. However, the amount of parallelism desired may belarger than that physically available. A parallel machine withonly N/2 PEs can operate on the N-element array if each physicalPE simulates two VPEs. Similarly, a uniprocessor can simulatethe N VPEs. It is natural therefore, to have the program indicatethe extent of the parallelism (number of VPEs desired) and for thecompiler to map the needs onto the existing hardware. Tech-niques for performing this mapping have been described for otherlanguages (e.g., [11, 14, 15]). Parallel-C provides a way to specifythe initial allocation of VPEs desired at compile time and allowssystem functions to be called to allocate/deallocate VPEs asnecessary at run time.

HI. Extensions for SIMD Mode

Declarations of parallel variables and functionsSimilar to other parallel languages, both scalar (normal) and

parallel variables may be defined. These types of variables areintroduced with two new keywords: scalar (the default) andparallel. When parallel variables are introduced, they are usuallydefined in their extent. The parallel keyword is followed by oneor more subscripts that define the shape of the parallelism.Assuming N and MAXLINE are compile-time constant expres-sions, some declarations of parallel variables are:

232

0190-3918/85/0000/0232$01.00 © 1985 IEEE

which declares 'a' to be an integer in each of 8 VPEs configured asa 2-by-2-by-2 array. Multi-dimensional parallelism often aids theformulation and coding of certain multi-dimensional problems(e.g., image processing).

which define for each of N VPEs one integer 'a,' an array of MAX-LINE chars, an array of 100 nodes, and a node pointer. The shapeof the parallelism may also be multi-dimensional as in:

while the 100 "cnodes" are allocated in the control unit (or to eachPASM MC participating in the virtual SIMD machine). The lastdeclaration allocates 100N nodes in total, 100 in each of N VPEs.

The type of variable a function returns is known as the"type of a function." Extending this concept to parallel func-tions, suppose that the square root of a parallel variable is to becalculated. Each VPE will calculate the square root of one ele-ment of the parallel variable. The function will return a doubleprecision floating point (double) value in each of the VPEs; there-fore, its type declaration in the calling function will be:

is used. Note that from the point of view of the caller, the last twoforms return the same result. However, some functions that arepassed different ranges of parallelism, i.e., [*] vs. [even], may pro-duce different side effects due to inter-PE data transfers, using theextent of parallelism in expressions, etc.

All active VPEs evaluate their arguments when a function iscalled. (Some VPEs might be temporarily inactive due to condi-tional statements as described later.) Associated with each actualparallel parameter in the parameter list, there is a selector thatindicates which elements of the variable are to be used in expres-sions. A copy of the value of the parameter and its associatedselector is placed on the stack of all active VPEs. In the first twocases above, y's associated selector is [*]; therefore, each activeVPE stacks the value of 'y' followed by the corresponding ele-ment of the selector (each VPE stacks a l ) . In the third case, y'sselector is [evens]; therefore, all active VPEs stack 'y' D u t theneven-numbered VPEs stack Is and odd-numbered VPEs stack Os.

For each formal parameter of a given type declared in thecalled function there must be a corresponding actual parameterofthe same type in the call. Assume that within the psqrtQ func-tion, fpy is the name of the formal parameter. While every VPEactive at the time of the call to psqrtQ "executes" the function,accesses to 'fpy' are restricted to those VPEs for which 'fpy' isselected (VPEs having the corresponding selector ' 1 ' on thestack). For example, if the actual parameter 'y' had selector[evens] as in "psqrt(y[evens])," but inside the function, 'fpy' wasused in an expression as "fpy[*]," the value of 'fpy' would beselected only for the even-numbered VPEs. If inside the function,'fpy' was used in an expression like ufpy[first4]," where 'first4'selected VPEs numbered 0-3, the value of 'fpy' would be definedonly for VPEs 0 and 2. This is because within the scope of thecalled function, 'fpy' was defined only for even-numbered VPEs;the lfirst4' selector further narrows the selection set within thefunction. This mechanism allows selection of expressions withina function.

233

Note that the function declaration and definition have emptybrackets. This is because functions can be defined to work for anyextent of parallelism; the actual extent is determined at run timebased on the extents of the function's arguments and returnedvalue. The notation "answer[*]" selects "all of the currentextent" and is discussed in the next section.

Selecting parallel variablesWhen operating on parallel variables, references along the

parallel dimension(s) can be done at the same time and will bereferred to as selection. Selection is typically done using specialvariables called selectors (like the selectors in [4]). A selector isboolean bit vector of any shape that is used to define the set ofVPEs that will perform a certain operation. For some targetmachines, the selector keyword allows the compiler to generatespecial instructions so that hardware dedicated to enabling anddisabling VPEs may be used. In others, selectors are compiled asa special type of parallel variable. For example:

define mask l to be a selector for an array of N VPEs and mask2 tobe a selector for a 2-by-4 array of VPEs. Arrays of selectors,pointers to selectors, etc. may also be defined.

The initialization of selectors can be a tedious process sincethe extent of the parallelism can be large. Therefore, a simplifiednotation based on PE Address Masks [17] is allowed. For NVPEs, a log2N = n -position mask (a PE address mask) specifieswhich of the N VPEs are to be selected. Each position of the maskcorresponds to a bit position in the logical numbering of the VPEsand consists of a 0, 1, or X (don't care) specification. VPEs whoseaddresses match the mask (0 matches 0, 1 matches 1, and 0 or 1match X) are selected. Mask specifications are grouped withbraces, expressions indicate repetition factors, and square brack-ets denote a complete mask specification. For example,In-l{X}{O}] enables all even-numbered VPEs, while[n-i{O}i{X}] enables VPEs 0 to 2' - 1.

Examples of selector initialization are:

The first example expands the PE address mask into the aggre-gate initializer {0, 1,0, 1, 0, ...} and assigns it to the selector. Thesecond example shows how the aggregate initializer can bespecified directly (without a mask). Selectors may be comple-mented, "or"-ed, "and"-ed, "xor"-ed, or differenced to obtain thecomplement, union, intersection, equivalence, or set difference. Aspecial mask, "[*]." selects all VPEs. In Actus and Vector-C, the"start:(increment)finish" construct (or its equivalent) and the setoperators are used to build "index sets." This is an equivalentmechanism for expressing the extent of parallelism, but theresulting index sets are static for these languages. Selectors aredynamic and if specified using masks, more space-efficient.

When a parallel variable of a certain shape is referenced, theselector(s) used must match the shape. For example,

Finally, to perform the operation in only the even-numberedVPEs and to select only the even-numbered VPEs' results:

To perform the operation in all VPEs, but to select the returnedresults from only even-numbered VPEs, the following would beused:

causes 'a' to be selected in VPEs where the local element of 'b' isnon-zero. Indexing of array 'a' within a VPE is determined by thelocal value of 'c '

The type casting operator of C is extended to change thetype and shape of parallel variables. The convention is that thereshaping is done assuming a row-major storage order (rightmostsubscript varies fastest) just as it is for standard array indexingand storage in C.

Since functions can return parallel data types, the returnedvalues are selected just like ordinary parallel variables. Forexample, to have each VPE calculate and return the square rootof 'y' and to select all of the returned values, the following wouldbe used:

selects the value of *a[0]' in the odd-numbered VPEs. "0" is theindex of the first element of the array 'a' within a VPE.

When scalar variables or constants are used to select parallelvariables, only one element of the parallel variable is accessed.Parallel variables may be used to select and index other parallelvariables. For example,

selects the value of 'a' in the odd-numbered VPEs. Similarly,

The conventions for returning values are similar. Anexpression returned by a function may be defined for all VPEs,e.g., "return(answer[*J);," but only those results in the VPEsselected by the calling function are used.

Expressions and assignmentScalar and parallel variables in expressions and assignments

can be combined as long as no parallel shape conflicts occur.Expressions containing parallel variables always result in a paral-lel value. The rules are similar to those employed in other SIMDlanguages; they are:1. Left side scalar; right side scalar. This is the normal C

assignment statement.2. Left side scalar; right side parallel. The right side should

have one and only one element selected. Selecting otherthan one element is not an error, but the results areundefined.

3. Left side parallel; right side scalar. The right side is pro-moted to the parallel type and shape of the left side.Selected elements of the left side are assigned the value ofthe scalar.

4. Left side parallel; right side parallel. The selected elementsof the left side are assigned the corresponding values of theright side expression. The range of the selected items on theright side should equal or overlap that of the left for theresults to be defined.In an assignment where both the right and left sides are

parallel, corresponding elements are assigned. For example,

address of the MIMD program is outside AR. The PEs (indepen-dently) return to SIMD mode by jumping into the address rangeAR. When all PEs have returned, SIMD processing continues. Asan alternative, the control unit could broadcast a "jump subrou-tine" instruction to begin MIMD mode and the PEs could"return" to SIMD mode.

This is one example of the effect of allowing both modes ofparallelism in a single system. Optimizations such as this are alsopossible for the switch-case and the looping constructs.

Functions for data alignment and I/OThere are no "standard" or "built-in" functions in C:

libraries of functions written in C and assembly language are pro-vided at load time. Users may define or redefine these functionsby calling the underlying operating system primitives directly. Inthis section, a small subset of operating system primitives anduser-level functions for PASM will be described.

System Calls. Functions for allowing a PASM PE toobtain its virtual and physical address are necessary because eachPE generates its own interconnection network routing requests.For example, if PE i wants to communicate with PE i + 2, it mustfirst obtain its i, add 2, and write the resulting address to therouting register associated with its port to the network.

parallel[] int getvirprocnf), getphyprocn ;are two functions that can be used to obtain the information.

The network used in PASM [16] is set for an allowable one-to-one or broadcast (one-to-many) permutation when PEs estab-lish connections by writing the desired destination addresses andbroadcast tags to their network routing registers. The functionused is

icnbsetfdestaddr, broadcasttnask)where both destaddr and broadcastmask are parallel unsignedintegers.

The transfer of a parallel integer variable "data" using anestablished setting of the network is performed by

icntransferfdata) ,which returns a parallel integer.

Blocks of data are transferred using an established setting ofthe network by

icnwrttefptrsrc, ptrdest, nbytes)where "ptrsrc" and "ptrdest" are parallel character pointers indi-cating the source and destination addresses of the block to betransferred and "nbytes" is a scalar integer indicating the blocksize.

User-level Calls. User-level functions provide a higher-level interface than do the system calls given above. For example,

cubefdata, i, N)performs a cube, interconnection function (transfer data betweenN PEs whose addresses differ only in the ith bit position [16])using parallel integer variable "data," returning a parallelinteger. This function calls icnbsetQ and icntransferQ to carryout the transfer.

Similarly,shuffkwritefptrsrc, ptrdest, nbytes, N)

sets up the shuffle interconnection function [18] among N PEs andtransfers blocks of "nbytes" bytes among them. Because thePASM network cannot perform the shuffle function in one pass,the shufflewrite is implemented with multiple calls to icnbsetQand icnwrite{).

There are dozens of other functions at this level for dataalignment (e.g., shift and rotate), for testing (e.g., obtain the"first" non-zero element of a selector, determine if all" VPEs areactive), and for parallel I/O and mathematical functions.

IV. Extensions for MIMD ModeThe constructs given in the last section allowed a single pro-

gram (instruction stream) to access and manipulate data itemslocated in multiple VPEs. MIMD mode implies that each PE willexecute its own instructions from its own memory; thus none ofthe SIMD constructs are necessary in MIMD mode: MIMD pro-grams will compile using the serial C compiler. However, MIMDprocesses must coordinate by communicating with each otherand by synchronizing the order of their execution. Furthermore,processes operating on shared data structures must do so in amutually exclusive manner to maintain data integrity.

234

does not move the value of 'b' from VPE 3 to 'a' in VPE 2. In factVPE 2's 'a' becomes undefined since no element of the right handside is selected for VPE 2. Special functions (described later) areused to move data values from VPE to VPE.

Control structureC has five different constructs for affecting control flow: if-

then else, switch-case, while, for, and do. The first two are usedto select different execution paths, while the remaining constructscontrol repetition of statements. The semantics of each of theseconstructs have been extended for parallel mode in ways identicalto to earlier SIMD languages (e.g., Actus, LUCAS). For example,in the parallel equivalent of an "if-then-else" construct:

the < parallel-expression> can be thought to define a selectorwhich is "true-valued" for some VPEs and "false-valued" for oth-ers. Thus some of the VPEs will perform <then-block> whileothers will execute <else-block>.

In a strict SIMD environment, <then-block> and <else-block> cannot be executed by the VPEs simultaneously sincethere is a single instruction stream. Thus, side effects from<then-block> may affect <else-block>. PASM, which canoperate in SIMD or MIMD mode, can temporarily switch toMIMD mode so that <then-block> could be executed in parallelwith <else-block>. Any references to parallel variables withinthe block are narrowed to the extent of parallelism within theblock. For example, if the <then-block> is executed only byeven-numbered VPEs, and a parallel variable within the block isreferenced with [*], only the even-numbered elements of the vari-able are selected.

In SIMD mode, PEs receive their instructions from the con-trol unit and process data in their local memories. In MIMDmode, PEs have both data and program in their local memories.For PASM, whenever a PE accesses a certain instruction addressrange AR, it operates in SIMD mode and an instruction request isissued to the control unit. When all active PEs associated withthe control unit have requested, the control unit broadcasts aninstruction to them. The control unit switches the PEs fromSIMD to MIMD mode by broadcasting an unconditional jump tothe beginning of the MIMD program to the PEs, where the

We propose to use the approach suggested by Concurrent C[13]: to define a C meta-language with new constructs and key-words that provides general mechanisms for process interaction,control, concurrent execution, event supervision, and sharing.An operating-system-specific preprocessor is used to convert theextended language to serial C with operating system calls to sup-port the parallelism. Therefore, the MIMD extensions would notaffect the C compiler. Of course, MIMD programs may call func-tions to start up SIMD processes and vice-versa, so the eventualgoal is a preprocessor that accepts a set of serial/SIMD/MIMDprograms, converts the MIMD portion to serial C (with systemcalls), and passes the serial/SIMD portions through unchanged.The extended (SIMD) compiler would then be called to compilethe resulting serial/SIMD program.

V. SummaryA superset of the C programming language that is applica-

ble to the SIMD/MIMD mode processing environment of PASMhas been described. The language extensions proposed for SIMDmode are, in general, similar in concept to those introduced inother SIMD mode languages. Some of the new ideas here includethe dynamically-settable selectors that indicate the currentextent of parallelism, the use of PE address masks, and the han-dling of the "if-then-else" construct in MIMD mode. The exten-sions for MIMD mode are much more operating-system-relatedthan language-related. This is a desirable attribute since thedevelopment of the compiler for the SIMD extensions and thetranslator for the MIMD extensions can proceed independently.

We propose to develop a compiler for the language using thePASM prototype as a target. The target assembly language andobject file formats for PASM prototype serial, SIMD, and MIMDprograms have already been developed and have been in use forsome time [9]. The first steps toward the compiler extensions forSIMD mode, modifying the lexical and syntactic analysis rou-tines, are being made. The set of operating system calls and user-level functions required to support the parallelism in both SIMDand MIMD modes is also being developed and refined. Work onthe code generation phases of the compiler will proceed in stages.Initially, the compiler will assume a static allocation of physicalPEs (i.e., VPE = physical PE). Next, dynamic allocation of VPEsup to the maximum physical machine size will be allowed. Thefirst "production" compiler will allow dynamic allocation ofVPEs to any number, but will assume a fixed-size physicalmachine is available. Therefore, code compiled for a physicalmachine of 64 PEs will run only when 64 PEs are available. Aneventual research goal is the development of a machine-size-independent program representation that allows the operatingsystem to choose the physical machine size on which to run theprogram. This is desirable in the PASM environment where thesizes of the available virtual machines vary.

References[I] G. B. Adams III and H. J. Siegel, "The extra stage cube: a

fault-tolerant interconnection network for Supersystems,"IEEE Trans. Comp., Vol. C-31, May 1982, pp. 443-454.

[2] C. N. Arnold, "Performance evaluation of three automaticvectorizer packages," 1982 Int'I. Conf. Parallel Processing,Aug. 1982, pp. 235-242.

[3] J. G. P. Barnes, Programming in Ada, Addison-Wesley, Lon-don, England, 1982.

[4] C. Fernstrom, "Programming techniques on the LUCASassociative array computer," 1988 Int I. Conf. Parallel Pro-cessing, Aug. 1982, pp. 253-261.

[5] Inmos Corporation, Occam, Inmos Corporation, ColoradoSprings, CO, 1983.

[6] B. W. Kernighan and D. M. Ritchie, The C ProgrammingLanguage, Prentice-Hall, Englewood Cliffs, NJ, 1978.

[7] D. J. Kuck, A. H. Sameh, R. Cytron, A. V. Veidenbaum, C.D. Polychronopoulos, G. Lee, T. McDaniel, B. R. Leasure, C.Beckman, J. R. B. Davies, and C. P. Kruskal, "The effects ofprogram restructuring, algorithm change, and architecturechoice on program performance," 1984 Intl. Conf. ParallelProcessing, Aug. 1984, pp. 129-138.

[8] J. T. Kuehn, Parallel C Language Extensions for PASM,Internal report, unpublished, 1984.

[9] J. T. Kuehn, PA68 Assembler Reference Manual, Internalreport, unpublished, 1984.

[10] D. H. Lawrie, T. Layman, D. Baer, and J. M. Randall,"Glypnir-a programming language for Illiac IV," Communi-cations of the ACM, Vol. 18, Mar. 1975, pp. 157-164.

[II] K-C. Li, Vector C - a Programming Language for Vector Pro-cessing, Ph.D. Dissertation, Department of Computer Sci-ence, Purdue University, 1984.

[12] D. G. Meyer, H. J. Siegel, T. Schwederski, N. J. Davis IV, andJ. T. Kuehn, "The PASM parallel system prototype," IEEEComp. Soc. Spring Compcon 85, Feb. 1985, pp. 429-434.

[13] R. Naeini, "A few statement types adapt C language toparallel processing," Electronics, June 28 1984, pp. 125-129.

[14] R. H. Perrott, "A language for array and vector processors,"ACM Trans. Programming Languages and Systems, Vol. 1,Oct. 1979, pp. 177-195.

[15] R. H. Perrott, D. Crookes, P. Milligan, and W. R. M. Purdy,"A compiler for an array and vector processing language,"IEEE Trans. Software Engineering, Vol. SE-11, May 1985,pp. 471-478.

[16] H. J. Siegel, Interconnection Networks for Large-Scale Paral-lel Processing: Theory and Case Studies, Lexington Books,Lexington, MA, 1985.

[17] H. J. Siegel, L. J. Siegel, F. C. Kemmerer, P. T. Mueller, Jr.,H. E. Smalley, Jr., and S. D. Smith, "PASM: a partitionableSIMD/MIMD system for image processing and patternrecognition," IEEE Trans. Comp., Vol. C-30, Dec. 1981, pp.934-947.

[18] H. S. Stone, "Parallel processing with the perfect shuffle,"IEEE Trans. Comp., Vol. C-20, Feb. 1971, pp. 153-161.

235

Documents

EXTENSIONS TO THE C PROGRAMMING LANGUAGE FOR SIMD…hj/conferences/89.pdf · 2014-02-04 · I. Introduction There are currently two general strategies for programming parallel processors