Flat Parlog: A basis for comparison

International Journal of Parallel Programming, VoL 16, No. 2, 1987

Flat Parlog" A Basis for Comparison

lan Foster 1 and Stephen Taylor 2

Received May 1987; Accepted September 1987

Three similar parallel logic programming languages have been proposed; Parlog, Flat Concurrent Prolog, and Guarded Horn Clauses. Quantitative comparison of the languages has not previously been possible since they employ different execution models and implementation techniques. In order to uncover the effects of semantic differences on efficiency, a common basis is required for experimentation. This paper presents a subset of the language Parlog called Flat Parlog which provides a basis for quantitative comparison. The language combines the directional semantics of Parlog with the simple execution model of Flat Concurrent Prolog. A performance comparison between Flat Parlog and Flat Concurrent Prolog based on new implementations of both languages is presented. These new implementations are identical except for optimizations that are possible by virtue of semantic differences. Benchmark results indicate that Flat Parlog is more efficient; experiments have been able to quantify and explain this performance differential. A detailed description of the abstract machine for Flat Parlog is presented to illustrate the simplicity of the language.

KEY WORDS: Concurrent logic programming; performance evaluation; Concurrent Prolog; Parlog.

1. I N T R O D U C T I O N

Three principal parallel logic p rog ramming languages have been proposed, Flat Concur ren t Prolog (FCP) , (1"2) Parlog, (3) and Guarded H o r n Clauses. r Programs in these languages are expressed as sets of logical

axioms that have a declarative reading. They have an alternative

behavioral reading in which they define systems of processes. These execute concurrently, communica te through shared logical variables and synchronize using dataflow constraints.

The languages share many c o m m o n features but have slightly different

1 Department of Computing, Imperial College, London SW7 2BZ, England. 2 Department of Computer Science, Weizmann Institute of Science, Rehovot 76100, Israel.

87

0885-7458/87/0400-0087505,00/0 �9 1987 Plenum Publishing Corporation

828/16/2-1

88 Foster and Taylor

semantics which reflects in their expressiveness and ease of implementation. (5~ Although substantial implementation efforts are in progress little quantitative comparison has been performed. This is unfor- tunate as implementation efforts can benefit from a clear understanding of the impact on performance of semantic variations. This permits important implementation issues to be identified and performance tradeoffs to be made. It also facilitates the sharing of knowledge and implementation technology.

Current Parlog implementations use an And-Or tree model in which the state of a computation is viewed as a tree. In contrast, FCP is implemented using a flat pool of processes. These implementations do not permit quantitative comparison as they employ differing approaches and optimizations that obscure deeper semantic differences. A common framework is required in which the languages can be compared at the same level of optimization. This permits the impact on performance of semantic variations to be measured and compared.

To establish this common framework a subset of Parlog termed Flat Parlog has been defined. New abstract machines for both this subset and FCP have been designed and implemented. The advantage of this approach is that FCP and Flat Parlog are closely related and can be implemented using the same techniques. In addition, it is known that Parlog can be compiled into Flat Parlog which is thus no less expressive than the full language. The new implementations are of interest in their own right since they outperform existing implementations of both languages. Both implementations exist at the same level of optimization, which permits evaluations of the two languages' uniprocessor performance.

This paper presents a performance comparison of the two languages. Both the new and previous implementations are benchmarked. The differences in performance are examined and their causes in terms of language differences are explained. As pointed out by Takeuchi and Furukawa, 15) Flat Parlog is essentially equivalent to Flat GHC; thus the results presented also apply to this language. The paper also presents a detailed description of the Flat Parlog abstract machine which has been benchmarked. This is derived from the FCP machine presented in Ref. 6, differing only in terms of simplifications made by virtue of Flat Parlog semantics. These simplifications are described and discussed.

2. LANGUAGE OVERVIEW

A parallel logic program is a collection of clauses that have the form:

H*--GI,..., GmIB1 ..... Bn. m,n>~O

Flat Parlog: A Basis for Comparison 89

where H is the clause's head, '[' is the commit operator and the G's and B's are goals. Each clause can be read declaratively as: "H is true if the Gs and Bs are true." Clauses with the same name and number of arguments can be grouped into procedures.

The Gi goals comprise the guard of the clause. A clause's guard must be solved before the clause can be used to reduce a process. If arbitrary goals are permitted in clause guards, support for a process hierarchy is required in the implementation. (v) In contrast, the fiat languages considered in this paper, Flat Parlog and Flat Concurrent Prolog (FCP) restrict the operations that may be performed in clause guards to predefined primitive operations. As a result, no process hierarchy needs to be maintained; this leads to a simple operational model. The state of a computation is represented by a pool of processes. Computation proceeds by repeatedly selecting a process and attempting to reduce it using the clauses in the associated procedure. The semantics of the languages allows nondeterministic process and clause selection; these operations may thus be performed in any order or in parallel.

A clause can be used to reduce a process if its head unifies (matches) with the process arguments (or data state) and all guard operations succeed. Dataflow constraints are used to restrict unification to permit process synchronization. If one or more clauses are found to be capable of reducing a goal, then one is nondeterministically selected and its body goals added to the process pool. The following example illustrates the formalism using simple dictionary and lookup procedures:

dictionary(Is) ~ dictionary(Is, [ ] ).

dictionary([{Key, Data} [ Is], Dict) *-- lookup(Key, Data, Dict, NewDict), dictionary(Is, NewDict ).

dictionary( [ ], Dict).

% initially empty dictionary

% receive request % look up Key in Dict for Data % recurse with new dictionary % close dictionary

lookup(Key, Data, [{Key, Data } l Ds ], [{Key, Data } [ Ds ] ). % found

lookup(Key, Data, [ { Key l, Datal }1 Ds], [ { Key l, Datal }l Dsl ]) % not found

Key # Keyl [ % check Key lookup(Key, Data, Ds, Ds 1 ). % continue

lookup(Key, Data, [ ], [{Key, Data}]). % enter

The notation [Head[ Tail] denotes a list structure with a Head and Tail. Structured data can also be represented as tuples which are written as {Arg 1 .... Argu} . Strings beginning with uppercase letters denote variables, while those with lower case denote constants.


The dictionary can be viewed as a process which receives a stream of requests (Is) and maintains an internal state (Dict). When a message of the form {Key, Data} is received, a lookup process is spawned to search the internal state for a Data item associated with the appropriate Key. The process inspects the internal state recursively. If the Key is found the data item is returned. If it is not present the Key and Data are entered into the dictionary and the updated dictionary is returned.

The dictionary process illustrates the need for dataflow constraints: it should not reduce until its first argument is bound to a list. The syntactic annotations used to describe these constraints are shown later.

The tookup-procedure illustrates clause selection. The first clause can be selected if the Key specified in the message (the first argument) unifies with that of the first element in the dictionary. The second clause is selected if the guard test r succeeds, indicating that they do not unify. The last clause is selected when the end of the dictionary is encountered.

Unifying a process and clause head can potentially modify the process data state by binding variables. For example, in the dictionary program, if the Data component of a message is a variable, the first clause of lookup may bind it. Such bindings will be incorrect if it is subsequently determined that the clause cannot be used to reduce the process. The major distinction between Flat Parlog and FCP is the manner in which incorrect variable bindings are avoided.

Flat Parlog does not allow process variables to be bound during a clause try. Instead, clause tries are restricted to testing the data state. Process variables can only be bound once a clause has been selected to reduce the process; this avoids conflicting variable bindings. This restriction leads to a two-phase model of execution, in which input data is tested and then output data generated. A syntactic addition to the language, called mode declarations, specifies which head argument positions define input tests and which define output unifications.

In contrast, FCP permits process variables to be bound during a clause try. This requires a mechanism to implement unification as an atomic operation. It can be achieved by maintaining multiple environments or alternatively by recording bindings to process variables so that they can be reset if necessary.

The technique used to constrain the binding of process variables affects both the expressiveness of the language and the ease of implementation. The ability to perform unification in the guard makes it possible to write elegant programs, such as concise meta-intepreters, in FCP. It has also led to the development of a number of elegant programming techniques. In most applications, however, Flat Parlog's simple matching scheme is sufficient. Flat Parlog has a simple semantics which, as this paper demonstrates, permits it to be implemented more efficiently.


3. F L A T P A R L O G

Flat Parlog is a subset of the parallel logic language Parlog (8) in which guard calls are restricted to be primitive test predicates. Other differences between Flat Parlog and Parlog are discussed in the final section of this paper.

A Flat Parlog procedure consists of a number of guarded logic clauses plus a mode declaration. For example, the dictionary and lookup procedures given are augmented with the following mode declarations:

mode dictionary(Requests?, Dictionary?).

mode lookup(Key?, Data'F, Dictionary?, NewDictionary T ).

The mode declaration for lookup specifies that the first and third arguments of the clauses defining the procedure are to be used for input (mode?) and the second and fourth for output (modeT). The argument names are comments of no semantic significance.

A procedure's mode declaration defines its translation to a standard form. (3) In this form head arguments are all distinct variables and unification is performed by explicit calls to unification and matching primitives.

The standard form of the lookup procedure is:

lookup(Key, Data, A 3, A4) ~- E {Keyl, Datal } I Ds] ~ A 3, Keyl = = Key l Data = Datal, A4 = [ {Keyl, Datal } I Ds].

lookup(Key, Data, A 3, A4) [ {Keyl, Datal } I Ds] ~ A 3, Keyl ~ Key l A 4 = [ { Keyl, Datal } I Dsl ], lookup(Key, Data, Ds, Dsl ).

lookup(Key, Data, A 3, A4) [ ] = A 3 I A4 = [{Key, Data}].

Input arguments are compiled to calls to a matching primitive ~ in the guard. A call to this primitive suspends until its right-hand argument is sufficiently instantiated to determine whether it matches the left-hand argument. It then either succeeds or fails. Multiple occurrences of a variable in input mode positions are compiled to calls to an equality test - - which suspends until it can determine that both arguments match. Output arguments are compiled to calls to a unification primitive = in the body of the clause. This performs general unification and may bind variables in either of its arguments.

In the first clause of lookup the input argument [{Keyl, Datal }lDs] is compiled to the call [{Keyl, Datal } IDs] ~ A 3 . This suspends until A3


is instantiated. If A 3 is bound to the appropriate structure, the call binds the local variables Keyl, Datal and Ds to components of this structure. The call fails if A 3 is not of the required form. An explicit call to the test unification primitve = = is also added to the guard to check the equality of the first argument Key and the local variable Keyl because the variable Key occurs twice in the source clause. Finally, the output arguments are compiled into two calls to the general unification primitive = .

3.1. Sequential Guard Evaluation

The guard of a Flat Parlog clause, expressed in standard form, specifies a number of simple tests on the process data state that must be performed before the clause can be used to reduce a process. Semantically these tests can be performed in any order or in parallel. If one component of the input data is missing the testing does not suspend but continues to test other components.

A uniprocessor algorithm that simulates this parallel evaluation strategy has the advantage that a goal will fail as soon as possible. It has the disadvantage of making process suspension more expensive; all tests for a clause must be performed even when an earlier test suspends. Recognizing data dependencies between tests can reduce the number of tests that must be performed after an initial suspension but requires a more complex machine capable of representing these dependencies.

An alternative strategy is to perform testing sequentially, suspending a clause try as soon as a single test suspends. This simplifies implementation but may cause a goal to suspend when it could fail. For example, consider a clause:

f(a, b).

and the goal: G 1: f(A, a)

Although G 1 can fail immediately, sequential evaluation will cause it to suspend on A; failure is thus delayed until A is bound.

Sequential guard evaluation cannot however prevent a goal from succeeding; a clause cannot reduce until all necessary data items are available. The order in which they become available is not therefore important. Consider the clause shown previously with the following goal:

G2: f(A, b)

G2, which can potentially succeed, cannot do so until A is bound. The sequential evaluation strategy cannot therefore delay success.


In summary, a sequential evaluation strategy provides the same success set as a parallel strategy but a potentially reduced failure set. In a flat language, process failure is not particularly interesting since it necessarily leads to the entire conjunctive goal failing. The Flat Parlog machine described in this paper incorporates a sequential evaluation strategy.

3.2. C o m p i l a t i o n of Flat Par log

Mapping Parlog to its standard form may introduce calls to Parlog's three matching and unification primitives, ~ , and = . Gregory (7) describes how to compile calls to ~ so that they only perform matching on a single level of structure. This permits one-way unification to be compiled into primitive nonrecursive test operations. He also shows how calls to - - and = in which both arguments are known tO be nonvariables at compile time can be partially compiled into calls on substructures.

Unification can be further compiled into primitive nonrecursive binding and data construction operations when one of the arguments is known to be a nonvariable term at compile time. The recursive unification primitive only needs to be called when both arguments are variables. This compilation is trivial for constants, but for structures requires the generation of unification graphs. Alternative branches in these graphs encode substructure unifications. This compilation strategy is described in detail in. (8)

The compilation of the first lookup clause using the previous techniques is shown here. Recall the standard form of this clause:

lookup(Key, Data, A3, A4) .-- [{Keyl, Datal } IDs] ~ A 3 , Keyl = = Keyl Data = Datal, A4 = [ { Keyl, Datal } I Ds].

This indicates that four unification operations must be performed. Two of these are known at compile time to involve complex structures and can thus be compiled to primitve nonrecursive operations. Figure 1 outlines the result of the compilation process.

The call to ~ is compiled to a sequence of primitive test operations. The call A 4 = [{Keyl, Datal} IDs] is compiled to a decision graph with two branches. The first treats the simplest and most common case: unification with a variable. In this branch the entire structure is built immediately using primitive variable binding and structure building operations. The second branch deals with unification of two lists. In this


Iookup(Key,Data,A3,A4) [P I Ds] <= A3

{Keyl,Datal} <= P Key == Key 1 Data = Datal A4=[WfX] ?

X := Ds W = {U,V} ? W := {U,V} U := Key1 ~var / N ~ e / 2 V := Datal J

~ = U := Key l U Key1

Ds

Fig. 1. Decision graph for the lookup clause.

case a matching operation is required to access the head and tail of the list. Unification of the head structure requires another subgraph. To contain code size this open-coding strategy is only applied to the top few layers of complex structures.

4. FLAT C O N C U R R E N T PROLOG

FCP differs from Flat Parlog in two principal respects. As noted, general unification is permited during a clause try. Data-flow synchronization is specified by annotating variables to be read-only occurrences (for example, X?); processes which attempt to bind a read-only occurrence of a variable suspend until it is bound. The use of these mechanisms is illustrated using the dictionary program as follows:


dictionary(Is) *-- dictionary(Is?, [ ] ).

dictionary([{Key, Data}l Is], Dict ) lookup(Key?, Data, Dict?, NewDict), dictionary(Is?, New Dict? ).

dictionary([ ], Dict).

% initially empty dictionary

% receive request % look up Key in Dict for Data % recurse with new dictionary % close dictionary

lookup(Key, Data, [ {Key, Data }] Ds ], [ {Key, Data } l Ds ] ). % found

lookup(Key, Data, [ { Keyl, Datal } [ Ds], [ { Keyl, Datal }[ Dsl ]) *-- % not found

Key v a Keyl ] % check Key lookup(Key?, Data, Ds?, Dsl ). % continue

lookup(Key, Data, [ ], [ { Key, Data } ] ). % enter

In this program, an initial call to the dictionary of the form:

dictionary(Is)

reduces to a call: dictionary(Is?, [ ]) is which the argument Is is annotated as a read-only occurrence. This causes the dictionary process to suspend until Is is bound. When input is availabe the process can reduce. Recursive calls to the dictionary also carry read-only occurrences and thus act in a similar manner.

As the example illustrates, data-flow constraints are specified in the call to a procedure rather than in the procedure itself. This has the consequence that in the absence of global program analysis and mode declarations no information is available on the use of a procedure and hence whether unification will be used to match or construct terms. A runtime check is therefore required in the implementation.

Fortunately, it is possible to use global analysis of FCP programs to deduce at compile-time that certain arguments are always called with a read-only annotation. This avoids the need for the run-time check on input and permits the arguments to be compiled to the same test operations used to encode Parlog's input arguments. As explained previously (under: Com- pilation of Flat Parlog), open-coding of nonvariable arguments for which the mode is not known at compile time allows the run-time check to be avoided in these cases also (e.g., output binding). The current state of research on these strategies in reflcted in benchmarks in this paper. The global analysis used only derives information on the top level of structures used in arguments.

Since FCP uses general unification during clause tries, the input and output phases of a reduction are combined in a single phase which must be implemented as an atomic action. The implementation of unification in


FCP thus requires either multiple environments and a mechanism for exporting bindings or, if process and clause tries are performed sequentially, a trail to record bindings. The trail allows any bindings made to be retracted later if necessary. The latter approach has been used to date and appears the more efficient.

4.1. Sequential Guard Evaluation in FCP

Recall that Flat Parlog can use a sequential evaluation strategy to evaluate guards. Flat Parlog has an order independent semantics in the sense that a reduction is deterministic; its outcome is known irrespective of the order in which operations are performed.

The sequential evaluation strategy suspends a clause try as soon as input data is found to be unavailable. Naively applying this optimization in FCP leads to order-dependence when programs both test and bind the same variable. For example, consider the program:

f(P, P2) ~ P1 =a, P 2 = a l true

and the two apparently equivalent goals:

f(X? X) and f(X, X?)

The first suspends but the second succeeds. Although the example appears contrived it has important consequences for the compiler writer. There exists a class of compiler optimizations ~8) and partial evaluation strategies, ~9) which heavily utilize parallel semantics in order to produce more efficient code. Order dependencies complicate these tasks and reduce the likelihood that program transformations can be proved correct.

Two order-independent solutions to this problem have been proposed. The first is an eager solution in which both of these goals always succeed. The second solution would cause both to suspend. ~2)

The implementation reported by Houri (1~ uses the naive method for efficiency. This paper reports results which compare both this initial method and the eager order-independent solution. The second order- independent solution is under investigation.

5. BENCHMARK RESULTS

This section presents benchmark results for five language implementations. The primary comparisons are those of the Flat Parlog (FP) abstract machine described in this paper and two FCP abstract machines; FCPa and FCPi. FCPa implements the order-dependent semantics

Flat Parlog: A Basis for Comparison

Table I. The Benchmark Programs

97

Reductions S u s p e n s i o n s Suspensions/ Name (1000s) (1000s) Reduction

Assembler 35.9 10.1 0.28 Takeuchi 63.6 12.1 0.19 QSort 16.9 9.4 0.56 Queens 62.1 0 0.00 Reverse 126.2 0.5 0.00

described in Ref. 10. FCPi implements the eager order-independent semantics. In addition, performance figures are presented for previous implementations of the Sequential Parlog Machine (11) (SPM), and Houri 's FCP machine (1~ (Emu).

Five benchmark programs were executed on each implementation. A program proposed by Takeuchi (Takeuchi), quicksort of a 1000 element list (QSort), all solutions to the eight queens problem (Queens), naive reverse of a 500 element list (Reverse) and the FCP assembler assembling itself (Assembler). The last four benchmark programs are presented in Appen- dix 2. Other test data is available on request. Table I characterizes the programs by presenting the number of process reductions performed and process suspensions that occurred during the program executions. All of the programs are deterministic in nature and thus the number of reductions does not vary between implementations. The ratio of suspensions to reductions depends largely on the process scheduling algorithm; the figures in the table refer to the algorithm employed by the FP, FCPa, FCPi and Emu implementations.

Naive reverse is a tight computational loop that performs many reductions but little matching, unification or process suspension. The queens program is a search program that performs more complex matching and unification but no suspension. Quick sort and Takeuchi's benchmark each

Table II. Performance in RPS. Mean of 10 runs

FP FCPd FCPi Emu SPM

Assembler 5703 5012 4400 4215 - - Takeuchi 8284 7488 6738 3394 1537 QSort 7388 6428 4584 4042 2074 Queens 6640 6262 6036 2755 2318 Reverse 17045 16157 15930 14914 4774


generate many suspensions. The assembler is a typical application program which uses common logic programming techniques such as incomplete messages (2) and difference lists. (12) Table II presents the performance results, in reductions per second, for the programs executed on a CCI (Power 6) processor by each implementation. These are accurate to within 0.5%.

6. COMPARISON OF PREVIOUS IMPLEMENTATIONS

The benchmark results show a number of inconsistencies between the performance of the SPM and Emu implementations. The SPM performs at one third the speed of Emu when executing the simplest benchmark (Reverse) but at approximately the same speed on the more complex Queens problem.

These inconsistencies are due to differing computational models and implementation techniques. The impact of these differences on performance can be better understood if the factors which determine the cost of a reduction are considered individually. Speed, as measured in reductions per second, is determined by the number of abstract machine instructions executed and the time taken to execute an instruction. Table III gives the mean time to execute an abstract machine instruction in each implementation.

These numbers are dependent on instruction mix and thus vary according to the application; however, the figures for a particular application shows how average instruction cost varies according to the implementation. These figures highlight the overhead of the SPM And-Or tree model: each abstract instruction is considerably more expensive in the SPM. This accounts for the poor performance of the SPM on Reverse but not for its apparent efficiency on the Queens problem. Table IV gives the number of abstract machine instructions executed per reduction by each implementation.

Table III. Mean Time to Execute an Instruct ion ( Microseconds)

Emu SPM

Assembler 7.53 - - Takeuchi 8.09 47.1 QSort 7.53 34.7 Queens 7.55 21.0 Reverse 5.14 19.0


Table IV. Abstract Machine Instructions per Reduction

Emu SPM

Assembler 31.5 -- Takeuchi 36.4 13.8 QSort 32.9 13.9 Queens 48.1 20.5 Reverse 13.1 l 1.0

99

Table IV reveals the reason for improved speed on the Queens problem: the SPM executes considerably fewer instructions than Emu on this program. Many other factors also influence the relative performance. For example, the Emu machine uses a byte encoding scheme for code whilst the SPM uses word encoding. This affects both code size and the speed at which instructions can be decoded. Different data structure formats, scheduling algorithms and suspension mechanisms are employed. In the Emu machine, clause search is performed sequentially; the SPM allows any combination of sequential, concurrent, programmer or compiler controlled clause search. The type of clause search employed in the SPM benchmark figures is loosely analogous to that used by other implementations; alternative strategies can lead to markedly different figures.

It is clear that the comparison of previous implementations does not provide meaningful information about the impact of language differences on performance. Performance differences are not related to language characteristics, but are instead due to differences in abstract machine design. Differing implementation and optimization techniques accentuate these differences.

7. A B A S I S F O R C O M P A R I S O N

A comparative study of Flat Parlog and FCP performance requires language implementations that are identical in terms of compilation strategy, computational model, implementation techniques and level of optimization.

Implementations that meet these criteria have been constructed; the Flat Parlog (FP), FCPd and FCPi emulators benchmarked here are implemented using the same model and techniques. Indeed, removing nonshared code from the implementations reveals that they share approximately 90% of their code. The remaining 10% is principally concerned with additional unification mechanisms required by FCPd and


FCPi. Great care has been taken to ensure that the 10% of nonshared code employs the same data structures and optimizations. The only remaining differences are those due to simplifications which can be made by virtue of Parlog's simpler semantics. Differences in the FCPi implementation are only concerned with achieving order independence during unification. Table II shows that these new implementations do not com- promise performance for language compatibility: they execute significantly faster than previous implementations.

The similarly of the implementations allows even small differences in performance to be judged significant when evaluating benchmark results. A study of the implementations allows the source of these differences to be localized and associated with specific features of the language semantics.

7.1. C o m p a r i s o n o f FP w i t h FCPa

Table V compares the FP and FCPa implementations. The first column summarises the information from Table II by showing the performance of FP relative to FCPd. The last column shows the mean instruction execution time of FCPd relative to FP. FP executes only 5 % faster on the simple Reverse benchmark but between 10 and 15% faster on more complex problems.

Recall that the essential differences between the two languages are that FCP, unlike Flat Parlog, uses and manipulates read-only occurrences of variables and allows unification in the guard of a clause. Flat Parlog uses simpler matching and binding operations to encode unification. A proportion of the unification involved in FCP can be compiled into similar operations using global program analysis. Poor FCP performance can thus be attributed to two factors; added complexity in the abstract machine and uncompiled unification.

Table V. Performance Differences between FP and F C P d

Performance Instr. Time FD/FCPd FCPa/FP

Assembler 1.14 1.11 Takeuchi 1.10 1.12 QSort 1.15 1.15 Queens 1.06 1.06 Reverse 1.05 1.05


It is possible t o systematically remove the overheads involved in uncompiled unification. This is achieved by explicit coding of input unification in FCP programs, for example the clause:

f ( [ {msg, Data}l Is])*- ...

can be coded to perform explicit input matching on all levels of structure as folows:

f(A)+-- A?= [Hlls], H ? = {Msg, Data}, Msg?=msgl ...

This allows FCP programs to be compiled into abstract instruction sequences equivalent to those employed by Flat Parlog. This strategy was only necessary in the Assembler benchmark; the other programs do not involve complex structural unifications.

The performance differences shown in column one of Table V can thus be attributed to differences in the architecture of the abstract instruction set. This is emphasized by column two which shows that FCPd takes con- sistently longer to execute abstract instructions. A study of the abstract machines and inspection of the implementation source code isolated the areas of potential overhead which result from semantic differences.

7. 1.1. Read-Only Occurrences of Variables

In order to support read-only occurrences of variables FCP requires a more costly dereferencing algorithm. In addition, abstract machine instructions that attempt to bind variables must check if a read-only reference was detected during dereferencing. Additional abstract instructions which allocate read-only occurrences of variables when spawning processes are required. The use of this type of instruction can hinder tail recursion optimization. For example, in the following program:

f ( X ) ,- g(X) I f (X?)

the Flat Parlog machine does not need to manipulate the variable X before executing the tail recursive call to f In FCP, an instruction is required to specify that the argument of f is a read-only occurrence of X.

7.1.2. Guard Unification

Guard unification requires a mechanism for undoing variable bindings when a clause try fails. This requires a trail data structure to record bindings made during a clause try. Since processes may be suspended on these variables an activation list is required to record processes to be activated if


the clause try succeeds. In addition, a heap backtracking pointer is required to record the top of the heap at the beginning of a clause try. This is used to discard structures constructed during an unsuccessful clause try. These data structures introduce two sources of overhead. Control instructions in the abstract machine must check and reset the structures. Abstract machine instructions that bind variables must modify the data structures.

In Flat Parlog, unification and binding of variables only occurs in the body of a clause. It is thus not necessary to trail variable bindings or delay the activation of processes. In addition, Flat Parlog can only create structures during guard evaluation with explicit calls to guard predicates. These calls occur rarely and have negligible space overhead; thus in Flat Parlog no backtracking pointer is necessary.

7.2, Quant i f icat ion

The previous analysis makes it possible to isolate where language differences incur overheads. These costs are very low level, constituting a small proportion of the cost of individual abstract machine instructions. As a result, it is not possible to accurately measure the overhead but only to give an indication of which are the most important.

The overheads fall into two main categories. The first comprises overheads incurred whenever particular abstract machine instructions are executed. These result from routine maintenance of data structures and are thus largely application independent. They include checking and resetting the trail and associated pointers and checking for read-only references when binding or unifying. The second category comprises those overheads associated with operations whose frequency is application dependent. These include overheads associated with the binding of variables and suspending or waking processes. A third possible source of overhead is slower dereferencing due to read-only variables.

An inspection of the source code for the implementations reveals three operations which incur particular overheads: process suspension, clause try failure and trail operations. Columns three and four to Table VI shows how frequently these operations are performed. The first column repeats the performance numbers in Table II and the second shows the proportion of instructions executed which were expensive instructions.

Only general trends can be inferred from this data. The proportion of expensive instructions is similar in all cases. There is a marked decrease in FCP performance when suspension and to a lesser extent trailing and failure occur. The figures for Reverse and Queens involve no suspension overheads and indicate that the cost of residual overheads to simply maintain FCP's additional data structures is approximately 5 %.


Table VI. Expensive Operations

103

Perf. Expensive Susp's/ Fails + Trails/ FP/FCPd Instrns Reduction Reduction

Assembler 1.14 0.32 0.28 2.53 Takeuchi 1.10 0.34 0.19 1,88 QSort 1.15 0.24 0.56 3.48 Queens 1.06 0.31 0 2.06 Reverse 1.05 0.27 0 0.99

Z2. I. Dereferencing

Logic programming languages use references to bind variables. A dereferencing operation is required to follow a (possibly zero-length) chain of references to reach a term. In FCP two forms of dereferencing operations can be performed. Slow dereferencing returns a result which indicates whether a read-only reference was detected during dereferencing. Fast dereferencing can be used when this result is not required; it is of comparable performance to the operation used in Flat Parlog.

Table VII presents statistics on dereferencing in the FP and FCPa machines. Column one shows the average number of dereferences per abstract machine instruction and column two the average length of a reference chain. Column three shows the proport ion of fast dereferences in FCPa. In the Flat Parlog machine, all dereferences are fast.

Table VII indicates that reference chains are generally short and that most of the dereferencing operations performed by FCPa are fast. The number of slow dereferences performed is not considered to be a significant source of overhead for FCPa compared to FP,

Table VII. Dereferencing

Dereferences/ Fast Derefs/ Instruction Length Total

Assembler 0.54 0,95 0.88 Takeuchi 0.48 0.85 0.86 QSort 0.39 0.70 0.85 Quens 0.58 0.34 0.67 Reverse 0.19 1,01 0.51

828/16/2-2


7.3. Comparison of Compilation Techniques

Recall that the benchmark figures for the Assembler were obtained by explicit coding of input matching which removed overheads due to uncompiled unification. To quantity this overhead a straightforward encoding of the program can be used. The compilation strategy uses global analysis to infer which arguments are used for input matching. Since it is not always possible to infer mode information at compile time the performance measured in this manner indicates the quality of the current FCP compilation scheme.

Table VIII shows the run-time behavior of the Assembler benchmark in terms of the number of testing and binding operations performed by ,FP and FCPd (1000 unit measure). These instructions encode the languages' matching and unification operations.

Recall that the directional unification in Flat Parlog allows input mode arguments to be compiled entirely to testing operations. This is reflected in the high proportion of test instructions (88 %) as compared to unification instructions. The goal of FCP compilation is to generate the less expensive test instructions whenever possible, Table VIII indicates that this is actually achieved in 55 % of the cases for the Assembler benchmark. The present global analysis technique for FCP, whilst able to infer the top level of structure for lists, generates unification instructions for subsequent levels of structure. Flat Parlog represents an upper bound on the extent to which testing can be used.

Unification is more expensive than testing for a number of reasons. The abstract machine instructions are more complex and require more sophisticated dereferencing. Unification of structures leads to more complex instruction sequences. These factors account for a 6.5 % increase in instruction counts. The performance of the Assembler dropped to 4658 RPS; a 7.1% reduction due to uncompiled unification.

Table VIII. Unification-related Operations

FP FCP d Matching Unifying Matching Unifying

Nil ( [ ] ) 7.5 0 7.5 0 String 26.3 0 0.8 25.6 List 41.3 7.0 41.3 7.0 Tuple 78.7 0 66.8 31.7 - = / = 15.3 16.5 0 31.8


8. D I S C U S S I O N O F FCPi R E S U L T S

Table IX compares the performance of FCP~ to that of FP and FCPa. FCP~ uses the eager form of order independent semantics reported in

Ref. 13. This semantics allows unification to continue even if it is detected that a subterm unification suspends. The desicion on whether a process should suspend or succeed is thus delayed until commit time. This permits unifications which first suspend on and subsequently bind a variable to be detected. Since this arises rarely in practice unification can simply be repeated under the current bindings to determine if the process should succeed. As a result, both the processes shown earlier:

f (X?, X) and f (X , X?)

succeed when unified with the clause:

f (A , B) ~ A = a, B = a l true

Table IX shows that the performance of this semantics is comparable to that of FCPu on the Queens and Reverse examples, where there are no suspensions. The performance degrades with the number of suspensions and in particular when there are a large number of suspensions (30% to 50%). An inspection of the ratio of instructions executed by FCPi to that of FCPa identifies the reason for this performance degradation. FCPi executes considerably more instructions when suspensions arise. This is expected to occur because the abstract machine continues to execute instructions after the first suspension occurs in order to achieve order independence.

It should be noted that the semantics only incurs overhead in the event that a clause try suspends. The Qsort and Assembler benchmarks accentuate the overheads because they involve frequent suspensions. This is visible in Table I from the ratio of suspensions to reductions for these

Table IX. Benchmark Comparison for FCP i

Performance P e r f o r m a n c e Instructions FP/FCPi FCPd/FCPi FCPi/FCPd

Assembler 1.30 1.14 1.09 Takeuchi 1.23 1.11 1.11 QSort 1.61 1.40 t.32 Queens 1.10 1.04 1.00 Reverse 1.07 1.01 1.01


applications; 0.56 and 0.28 respectively. In stream-based applications where processes iterate over data structures, these overheads would be less pronounced.

Finally, one aspct of the semantics is that it does not allow variables in the environment of a process to bind with read-only occurreences of variables in the same environment. This simplifies the parallel implementation of the language/13) From a pragmatic viewpoint the main effect of this restriction is that it may cause additional process suspensions. The Assembler benchmark uses difference lists and variable to variable binding extensively suggesting that this feature of the semantics would be pronounced. The actual measured increase in the number of suspensions was 0.8 %.

9. THE FLAT PARLOG MACHINE

The Flat Parlog abstract machine used in the previous experiments is presented here to demonstrate the simplicity of the Flat Parlog language and its computational model. In addition, the description permits the experiments to be repeated.

An abstract machine for a language implements its computational model. It provides an abstract architecture that represents the state of a computation in terms of the chosen computational model and executes abstract machine instructions that encode tests on or modifications to the computation state. Programs to be executed on the abstract machine are encoded as sequences of abstract machine instructions.

9.1. Computational Model

The Flat Parlog machine implements the process pool computational model described previously, with two important optimizations. It introduces a scheduling structure to avoid the overhead of repeatedly attempting to reduce suspended processes (busy waiting) and supports tail recursion, to avoid the overhead of selecting a process from the process pool at every reduction.

The scheduling structure consists of a single active queue containing reducible processes plus multiple suspension lists which link together processes that require particular data. Suspension lists enable suspended processes to be located and moved to the active queue when the variable that they are suspended on is instantiated. The suspension structure used is based on that proposed for FCP by Houri] 1~ which permits a process to be suspended on more than one variable. Flat Parlog's simpler semantics allows certain optimizations. For example, when binding a variable suspen-


ded processes can be moved to the active queue immediately. In FCP, this decision must be delayed until a clause is selected.

The tail recursion optimization is a generalization of that commonly used in Prolog implementations. When a clause with body goals is used to reduce a process, reduction can continue with one of those goals. This saves the overhead of adding that process to the process pool and subsequently selecting it.

It is useful for the operational semantics of Flat Parlog to guarantee that reduction is just: is, that any process capable of being reduced will eventually be reduced. The tail recursion optimization is therefore only applied a finite number of times before the current process is moved to the end of the active queue and a new process selected for reduction. The number of tail recursive calls permitted before such a process switch occurs is a process timeslice.

9.2. M a c h i n e A r c h i t e c t u r e

The principal data area in the Flat Parlog machine is a heap. This holds tagged words representing both Parlog data structures and process records representing Flat Parlog processes. Valid data types are constants (integers, reals, strings and the special constant nil), compound structures (tuples and lists), variables and references to other data structures. Process records contain pointers to the code that the process is executing, its sibling in the scheduling structure and a fixed number of arguments.

The only other data structure used by the Flat Parlog machine is the suspension table used during a reduction attempt to record variables to suspend upon if no clause try succeeds.

The current state of the abstract machine is recorded in various registers. These form three distinct groups according to the time at which their values are relevant. General registers are used for storing global aspects of the machine state. Process try registers are only used during a reduction attempt. Clause try registers are used at each clause try.

General Registers

HP Heap Pointer, points to the top of the heap. QF Queue Front, points to the first process in the active queue. QB Queue Back, points to the last process in the active queue.

Process Try Registers

CP Current Process, points to the process currently being reduced. TS Time Slice, the remaining time slice for the current process.


PC

STP

Xi

Program Counter, the instruction pointer. Contains the address of the next instruction to be executed. Suspension Top Pointer, points to the top of the suspension table. X registers, a set of heap words used to hold process arguments and temporary values.

Clause Try Registers

FL Failure Label, the address of the instruction to branch to in case of clause try failure.

SP Structure Pointer, pointer used for building structures.

9.3. Abs t rac t Instruct ion Set

The instruction set of the Flat Parlog machine includes instructions that encode the primitive unification operations generated when the language is compiled to the standard form. Control instructions are used to encode Flat Parlog's computational model.

Some sample encodings are presented in Appendix 3. The first clause of the lookup procedure is encoded into the following sequence of abstract instructions:

Lookup~4: % load(4) % arguments in registers XO-X3 try~ne_else(Lookup4) % start 1st clause test_list_(2, 4) % [PIDict] ~ A3, X 4 = P, X5 = Q test_tuple(4, 2, 6) % {Keyl, Data1 } ~ P equal(6, O) % Key = = Keyl unify(l, 7) % Data = Datal bind_list(3, Lookup l ) % A4 = [ WI X] ? get(8) % A4 is variable. Save W put_value(5) % X := Dict put_tuple(2, 8) % W := { U, V} put_value(O) % U := Key put_value(7) % V := Data halt % End of unification and clause

Lookupl: % A4 is instantiated to a list get(2, 8, 11 ) % get head and tail in R and S bind_tuple(8, 2, Lookup3) % R = { Y, Z} ? put_value(O) % R is variable. Y := A 1 put_value(7) % Z := 2


Lookup2: % unify(ll, 5) % Dict = R halt % End of unification and clause

Lookup3: % R is instantiated to a tuple get(2, 9, 10) % get tuple atgs into M and N unify(9, O) % M = Key unify( lO, 7) % N = Data goto(Lookup2) % go to unify tail

Lookup4: % Next clause

The load instruction encodes the start of the procedure. It loads the process arguments into machine registers. The try~,ne_eIse instruction encodes the beginning of the clause try. This has a label as an argument, which indicates where execution should continue if the clause try fails. Test and equal instructions follow. These encode the testing component of the clause try. Next, bind, unify and put instructions encode calls to binding, unification and data construction primitives generated by compilation. As no special actions need to be performed when a clause is selected to reduce in Flat Parlog, no 'commit' operator is required to separate these from the test instructions. Note the use of the bind_list and bind_tuple instructions which encode branchpoints in the execution graph presented in the section on compilation. These have a label as a argument, which indicates where execution should continue if a structure is detected at runtime. Finally, as the clause does not spawn body goals, a halt instruction is used to terminate the clause.

Execution proceeds to a suspend instruction if no clause try succeeds. This suspends the process on any entries in the suspension table and signals process failure if the table is empty.

The spawning of body goals and tail recursion is encoded using additional instructions; this is illustrated by the example eneodings in Appendix 3. A summary of the abstract instruction set is presented in Appendix 1.

9.4, Test Ins t ruc t ions

These instructions encode the tests introduced by compilation to standard form. Four instructions are used to test for simple types; tesCnU(Reg), test_integer(Reg, Val), test_string(Reg, Strp) and tesCreal(Reg, Real). Each instruction operates on a temporary register (Reg) and tests for a particular value. The test_integer instruction is emulated as follows:


p := address_of(X[Reg] ) deref ( p ) case tag_of(p)

{vat : wait(p); PC:=FL int : if (value_of(p)= Val) PC : = P C + 1

else fail default : fa i l

}

The other simple test instructions are emulated in a similar fashion. Two instructions, test_list(Regl, Reg2) and test_tuple(Regl, Arity, Reg2) are used for testing and decomposing compound structures. Compilation of calls to ~ ensures that a primitive test on a compound structure is always a single level in depth and always contains the first occurrence of variables. test_list and test_tuple are thus able to place pointers to substructures in contiguous registers, beginning at a register (Reg2) specified by the compiler, test_list is emulated by:

p := address_of(X[Regl ] ) deref(p) case tag_of(p)

{var : wait(p); PC :=FL list :

X[ Reg2 ] := ( ref p + 1 ) X[Reg2+ 1] := (ref, p + 2 ) PC:= P C + 1

default :fail }

9.5. Bind Ins t ruc t ions

These instructions encode output unification with nonvariable terms. Four instructions are used to unify with constants: bind_nil(Reg), bind_in- teger(Reg, Val), bind_str(Reg, Strp) and bind_xeal(Reg, Real). Each instruction operates on a temporary register Reg. If this contains a variable, it is bound to the constant specified by the instruction and any processes suspended on the variable are moved to the active queue. Otherwise, its tag and contents are checked to determine whether it matches the constant. If it does not, the entire computation is aborted. The bind_integer instruction is emulated by:


p := address o f (X[Reg]) deref ( p ) case tag_of(p)

{var : wake_up(p) heap_at(p):= ( int, Val) PC := PC + 1

int : i f (value_of(p)= Val) then P C : = P C + I else abort

default : abort }

The other simple bind instructions are emulated in a similar fashion. Two instructions, bind_list(Reg, Label) and blnd_tuple(Reg, Arity, Label) are used for performing output unification with compound structures. These instructions test the tag of the contents of the temporary register Reg. If this is the correct kind of compound structure, the structure pointer is set to point to its first element and program execution continues at Label If it is a variable, a structure is created on the heap, the structure pointer is set to point to its first element and a reference to this structure is assigned to the variable. Execution continues at the next instruction. bind_list is emulated by:

p := adress_of(X[Reg] ) deref(p) case tag_of(p)

{var : wake_up(p) heap_at(p) := (ref, H P ) heap_at(HP) := (list, null) S P : = H P + 1 HP := HP + ListSize PC := PC + 1

list : SP := SP + 1 PC := adress of(Label)

default : abort )

The other simple bind instructions are emulated in a similar fashion.

9.6. U n i f y I n s t r u c t i o n

unify(Reg, Reg) encodes Flat Parlog's general unification primitive =. It performs general unification. In Flat Parlog, unify can only be called in


the body of a clause. There is therefore no need to trail bindings made: if unification fails, the auxiliary instruction abort is called to fail the entire computation. The implementation of unify can exploit the fact that compilation of Flat Parlog to standard form ensures that its first argument will generally be a variable.

9.7. Equal Ins t ruc t ion

equal(Reg, Reg) encodes Flat Parlog's test unification primitive - - It recursively tests two structures for equality and suspends if it encounters a variable. If the equality test fails, the current clause try fails, equals is used to encode input matching on two variables.

9.8. Put Inst ruct ions

Put instructions are used to build structures on the heap. They differ in the structures that they build and where they place references to them. The put instructions for simple data types create a term on the heap at the position pointed to by SP, then increment SP. put_real creates a term at the end of the heap and places a pointer to it at the position pointed to by SP. put_list and put_tuple create a new term at the end of the heap and place a pointer to this term in a temporary register, put_arg_var creates a variable at the end of the heap and places pointers to it both at SP and in a temporary register. This is used for example to allocate a variable that occurs as an argument to a process; variables are not placed inside process records in order to allow tail recursion optimization.

put_nil places the constant nil at the position pointed to by SP.

heap_at(SP) := (nil, null) SP := SP + 1 PC := PC+ 1

% put nil at SP % do next argument % do next instruction

put_integer(Int) places an integer in the heap at the position pointed to by SP.

heap_at(SP) := (/nt, Int ) SP := SP + 1 PC := PC+ 1

% put integer at SP % do next argument % do next instruction


put_string(Add) places a pointer to a string in the heap at the position pointed to by SP.

heap_at(SP) := (ref, Add) SP := SP + 1 PC := PC+ 1

% point to string from SP % do next argument % do next instruction

put_real(Real) places the real number Real on the end of the heap and places a reference to it at the position pointed to by SP.

heap_at(SP) := (ref, HP) heap_at( HP) := ( r e a l RealSize ) heap_at(HP + 1 ) := Real HP := HP + RealSize SP := SP+ 1 PC := PC + 1

% point to real from SP % place real header % place real on heap % point after real % do next argument % do next instruction

put_var(Reg) allocates a variable on the heap at the position pointed to by SP and makes Reg point to it. This instruction is used to encode the first occurrence of a variable.

heap_at(SP) := (var, null) X[Reg] := (ref, SP) SP := SP + 1 PC : = P C + 1

% create 1st occurrence % Reg points to var % do next argument % do next instruction

put_value(Reg) copies the heap word in Reg to the heap at the location pointed to by SP. The instruction is used for subsequent references to a variable.

heap_at( SP) := X[ Reg ] SP := SP + 1 PC := PC + 1

% refer to variable % do next argument % do next instruction

put_tuple(Arity, Reg) allocates a tuple of size Arity on the end of the heap, makes SP point to its first argument and places a reference to the tuple in the heap at the location pointed to by register Reg.

p := pointer_value_of(X[Reg] ) heap_at(p) := (ref, HP) heap_(HP) := ( tuple, Arity ) SP : = H P + 1 HP := HP + (Arity + 1) PC := PC + 1

% get the pointer % point to tuple from Reg % allocate tuple % set structure pointer % past tuple % do next instruction


put_list(Reg) allocates a list structure at the end of the heap, makes SP point to the head of the list and places a reference to the list in the heap at the place pointed to by register Reg.

p := pointer_value_of(X[Reg] ) heap_at(p) := <ref, HP> heap_at(HP) := (list, null> SP := HP + 1 HP := HP + ListSize PC := PC+ 1

% get pointer % point to list from Reg % allocate list header % SP points at head of list % allocate list % do next instruction

put_arg_var(Reg) allocates a variable at the end of the heap and places reference to it in the heap at SP and in Reg.

heap_at(SP) := (re f HP) heap_at(HP) := (var, null> Y[ Reg ] := ( ref, HP > H P : = H P + I SP := SP + 1 PC : = P C + 1

% reference end of heap % create 1st occurrence % Reg points to variable % point past new var % do next argument % do next instruction

9.9. Manipula t ion of the St ructure Pointer

The following instructions manipulate get(N, Regl,... RegN) sets the registers Regl values beginning at SP.

PC := PC + 2 for i : = 1 t o N d o

{ Reg := value_at(PC) X[ Reg] := value_at(SP) PC : = P C + 1 SP := SP + 1

}

SP.

the structure pointer SP. to RegN to N consecutive

get(lleg) sets the register Reg to the contents of SP and increments

X[Reg] := <ref, SP> SP := SP + 1 PC := PC+ 1

skip increments SP.

% get SP % increment SP % do next instruction

SP := SP + 1 % increment SP


load(N) loads the N arguments beginning at SP into consecutive registers beginning with register 0. When a process is scheduled SP is set to point at its first argument.

for i := 0 to ( N - 1) do {X[i] := value_at(SP) SP := SP + 1

} PC : = P C + 1

set(Reg) makes SP point to the end of the heap and stores SP in the register Reg.

SP := l iP X[Reg] := <ref, S P ) PC := PC + 1

9.10. C o n t r o l

The control instructions are based on those developed by Houri. ~1~ spawn(Label) constructs a process record and places it into the active

queue. The code pointer of the process is set to Label and SP is set to point at the first argument.

execute(Label) is used for tail recursion optimization. It uses the current process record for the next reduction and thus saves process scheduling. If the time slice is over then the process is enqueued to the active queue with its code pointer set to Label otherwise the time slice is decremented and execution proceeds from Label.

iterate operates in the same manner as execute but uses the current procedure and thus saves storing the code pointer in the current process record.

prepare sets SP to point to the first argument of the current process in preparation for tail recursion optimization. Execution continues at the next instruction.

SP := ptr_to_first_arg_of ( CP) PC:= P C + 1

halt places the current process in the process free list and invokes the scheduler for another process.

suspend causes the process to suspend on the variables pointed to from the suspension table. If this is empty, it signals process failure.


kernel(Index, Regl, Reg2, Reg3) calls the guard predicate designated by the number Index. Its arguments are assumed to be placed in registers Regl, Reg2 and Reg3 prior to the call. No kernel predicate uses more than three arguments.

try_me_else(G1) is used to encode conditional execution. It sets the failure label (EL) to G1 and causes execution to continue at the next instruction.

goto(G1) is used to return to the main code for a clause after a bind_list or bing_tuple instruction encounters a structure. It causes execution of continue at the instruction at label G 1.

9.11. Auxi l iary Functions

Certain instructions are defined in terms of auxiliary functions. These are defined here.

fail If a clause try fails, execution proceeds at the current failure label (FL).

PC := FL

wait(p) places a pointer to the variable at p in the suspension table and increments the suspension table pointer.

value_at(STP) := p STP := STP + 1

wake_up(p) moves all processes suspended on the variable at p to the active queue.

abort fails the entire computation. If the control metacall is implemented, the subcomputation of which the current process is a part.

9.12. Process Schedul ing

When a process is scheduled it is removed from the active queue and the current process (CP) register is set to point at it. Some initialization is performed and execution begins from the place pointed to by the code pointer in the process record.

CP := &queue( ) reset STP


PC := PC + value_at(CP + offset_to_code_ptr) SP:= CP + offset_to firskargument TS := some_constant_value

10. RELATED W O R K

The abstract machine described in this paper has been strongly influenced by the work of Gregory, (~1) Houri, (~~ and Warren. (14) Although using a different computational model, the Flat Parlog machine has similarities to the Sequential Parlog Machine (SPM). (H) Its instruction set like that of the SPM reflects the directional nature of Parlog programs. This permits the concept of mode found in the Warren machine to be discarded. The current machine differs from the SPM in compiling output unification as well as input matching.

The Parlog language developed by Clark and Gregory (3) differs from the Flat Parlog language described here in three respects. The most important is its O~parallel execution which allows user defined procedures to be called in the guard of a clause. The second is the sequential conjunction operator (&) which when used in a conjunction A &B delays reduction of the goal B until evaluation of A has successfully terminated. The last is a sequential clause search operator (;) which is used to separate clauses in a procedure definition. This constrains a reduction so that clauses after the operator are only tried if all previous clause tries fail.

Most Parlog programs are also Flat Parlog programs. Those that are not can be compiled to Flat Parlog using two techniques. Gregory (1) shows how Or-parallel evaluation of user-defined calls can be compiled to And- parallel evaluation using a control metacall. This is a language primitive that permits execution of a user call to be controlled and its termination detected. Codish (~s) shows how source to source transformation techniques can be used to achieve a similar effect without the introduction of a language primitive. Further research is required to determine the relative efficiency of the two approaches. These techniques can also be used to implement Parlog's sequential operators.

11. C O N C L U S I O N S A N D F U T U R E W O R K

This paper has presented a subset of the parallel logic programming language Parlog called Flat Parlog. This language has a number of interesting characteristics. It combines the well-defined semantics of Parlog with the simple computational model of FCP. This permits an efficient


implementation. In addition, as Parlog can be compiled into it, it is sufficiently expressive for most applications.

Benchmarks under carefully controlled conditions indicate that Flat Parlog executes 5 to 15 % faster than FCP. On a nontrivial application, the Assembler benchmark, Flat Parlog performed 14% faster than FCP. The experiments enabled the reasons for performance differences to be isolated and quantified. FCP's use of read-only occurrences of variables and guard unification leads to a more complex abstract machine. Simply maintaining more complex data structures leads to an overhead of approximately 5 %. Process suspensions increase this overhead substantially. Dereferencing is not considered to be a significant source of overhead for FCP relative to Flat Parlog.

A major reason for Flat Parlog's efficiency is that mode declarations permit unification to be compiled to simple test and binding operations. In certain cases, global analysis of FCP programs can be used to deduce modes and generate code approaching this efficiency. Flat Parlog provides an upper bound on the performance that can be attained using these techniques. Uncompiled unification requires more complex instructions and leads to higher instruction counts. This resulted in an additional 7% performance decrease in the Assembler benchmark.

An interesting feature of Flat Parlog that can be explored is source to source transformations on the procedure graph which allow replicated tests to be removed. In general, these require the ability to reorder clauses and test operations. This is possible in Flat Parlog because of its simple, order- independent semantics. Application of the same techniques to FCP is com- plicated if an order-independent semantics is not used. The order-independent semantics investigated in this paper permits these optimizations but has inferior performance (10-40%) on applications involving a high number of suspensions. It exhibits performance comparable to the other implementations in the absence of process suspensions. Houri's (1~ order- dependent semantics, used to obtain the FCPd figures, suspends as soon as possible and thus provides a lower bound on the overheads involved in process suspension.

The abstract machine described in this paper and the companion FCP machine described in Ref. 6 provide a framework for further comparative studies of Flat Parlog and FCP. The companion machine incorporates support for parallel execution. This same support can be introduced in the Flat Parlog machine. The machines will thus also provide a basis for comparison of the languages on parallel architectures.

As indicated previously, Parlog can be compiled to Flat Parlog using either a control metacall or source-to-source transformation techniques. Similar techniques can be used to implement control functionality required


for systems programming. (16) Quantitative comparison of these two approaches has not previously been possible. Further research based on the abstract machine reported in this paper will extend it to incorporate control metacalls. This will permit direct evaluation of the relative efficiency of the two approaches.

A P P E N D I X 1: S U M M A R Y OF A B S T R A C T I N S T R U C T I O N S

test-nil( Reg ) test_integer( Reg, Val) test~string( Reg, Strp ) test_real( Reg, Real) test_list( Reg, Reg ) test_tuple(Reg, Arity, Reg) equal( Reg, Reg )

put_nil put_integer(Val) put-~tring( Strp ) put_real(Real) put_tist( Reg ) put_tuple( Arity, Reg ) put_var(Reg) put_value(Reg) put_arg_var( Reg )

bind-nil( Reg ) bind_string( Reg, Strp ) bind_integer( Reg, Val) bind_real( Reg, Real) bind_list( Reg, Label) bind_tuple( Reg, Arity, Label) unify(Reg, Reg)

load(N) get(Reg) get(N, Regl,..., RegN) set(Reg) skip spawn(Label) execute(Label) iterate

828/16/2-3


prepare halt suspend goto try_me_else(Label) kernel(Index, Reg, Reg, Reg )

APPENDIX 2: THE B E N C H M A R K P R O G R A M S

A2.1. Takeuchi 's Program

mode tak(?, ?, ?, T). tak(K, Y,Z,A)<-- X> YI

X l i s X - 1 , Y l i s Y - 1 , Z l i s Z - 1 , tak(X1, Y,Z, AI), tak(Y1, Z, X, A2), tak(Z 1, X, Y, A 3), tak(A 1, A2, A3, A).

tak(X, Y, Z, Z) <--- X<<. YI true.

A2.2. Quicksort

mode qsort(?, T, ?). qsort([XI Ks], S1, L2)

part(K, Xs, S, L ), qsort( S, S 1, [ X r L 1 ] ), qsort( L, L 1, L2). qsort( [ ], X, X).

mode part(?, ?, ]', T). part(K, [YI Ys], [Y[S] , L ) ~

X>~ YI part(K, Ys, S, L). part(K, [YI Ys], S, [Y]L] ) +--

X< YI part(K, Ys, S, L). part(K, [ ], [ ], [ ]).

A2.3. Naive Reverse

mode rev( ?, T)- rev( [ X[ Ks ], Y) ~ rev( Ks, Ys ), append( Ys, [X], Y). rev([ ], [- ]).


mode append(?, ?, T). append( [ XI Xs ], Ys, [-X[ Zs ] ) ~ append( Xs, Ys, Zs ). append( [ ], Ys, Ys ).

A2.4. All So lut ions to 8 Queens

go , - queens(8, R).

mode queens(?, T). queens(N, S) *-- p(1, [ ], 0, N, S).

modep(?, ?, ?, ?, T). p(0, [Xl S], R, N, S1) +-- X < NI

Y : = X + 1, check( Y, S, 1, A ), p(A, [ Y I S ] , R , N , S1).

p(0, [NI S], R, N, S 1 ) ~ R1 : = R - l, p(0, S, R1, N, S1).

p(O, El,-,- , []). p(1, S ,R ,N , S1)+--R<N L

R I : = R + I p(O, [OI S], R1, N, S1).

p(1, S ,N ,N , [ S I S 1 ] ) ~ p(O,S,N,N, S1).

mode check(?, ?, ?, T). check(X, [ YI-], N, O) ~ N := X - Y [ true. check(X, [ Y] _], N, O) ~ N := Y - X [ true. check(X, IX[ _], _, 0). check(X, [_ IS], N, R)

otherwise1 M : = N + I , check(X, S, M, R).

check(_, [ ], _, 1 ).

A P P E N D I X 3: E X A M P L E E N C O D I N G S

A3.1. Qu icksor t

Standard Form:

qsort(A, S1, L2) ~ [XI Xs] c A [ part(X, Xs, S, L), qsort(S, S1, [XIL1]), qsort(L, L1, L2).

qsort(A, X, B) ~- [ ] ~ A [X = B.


Compiled Code:

Qsort:

Qsort 1:

Qsor t 2:

load(3) try~ne_else(Qsort 1 ) test_list( X O, X 3 ) prepare puLvalue( X 3 ) puLvalue( X 4 ) put arg_var(X5) put_arg_var( X 6 ) spawn(Qsort) puLvalue( X 5 ) put_value( X 1) get(X7) put_list(X7) put_value(X3 ) puLvar( X8 ) spawn(Qsort) put_value(X6) put_value(X8 ) puLvalue( X2 ) execute(Part)

try_me else(Qsort2 ) test~il( XO ) unify(X1, X2) halt

suspend

% qsort/3 % load args to XO-X2 % start 1st clause % [XIXs] c A , X3 :=X, X4 :=Xs % prepare part/4 %X %Xs % S % L % make qsort/3 %S %S1 % save arg3 % arg3:= [PI Q] % P : = X % Q : = L 1 % make qsort/3 %L %L1 % L2 % part/4

% start 2nd clause % [ ] ~ A , % X = B % halt

% process suspensions

A3.2. Par t i t ion

Standard Form:

part(X, A, B, L) ~ [ Yi Ys] ~ A, X>~ YI B= [YIS], part(X, Ys, S, L).

part(X, A, S, C ) ~ [YI Ys] c A , X <~ YI C= [-YI L], part(X, Ys, S, L).

part(X, A, B, C) ~ [ ] ~ A [ B = [ ] , C = [ ] .


Compiled Code:

Part :

Part 5:

Part 1:

Part 7:

Part2:

Part 3:

Part4:

load(4) try~'n e_else (Part 1 ) test_list(X1, X4) kernel(ge, XO, X4, X6) bind_list(X2, Part4) put_value(X4) put var(X8)

prepare skip pukvalue( X 5 ) put_value(X8 ) iterate

try_me_else(Part2) test_list(X1, X4) kernel(&, XO, X4, X6) bind_list(X3, Part6) pukvalue( X 4 ) put_var( XS )

prepare skip pukvalue( X 5 ) skip put_value(X8 ) iterate

try~'ne_else( Part 3) test_nil( X 1 ) bind_nil(X2) bind_nil(X3) halt

suspend

get(2, Xll , X8) unify(X11, X4) goto(Part5)

% part/4 % args in XO-X3 % start 1st clause %[Yl Ys] cA, X4 := Y, X5 := Ys %X>~ Y % B = [ P I Q ] % P : = Y % Q : = S

% prepare part/4 % X already in process record % Y s % S % L already there

% start 2nd clause %[Y1Ys] ~ A , X4 := Y, X5 := Ys % X>>. Y % C = [P] Q] % P : = Y % Q : = L

% prepare part/4 % X already in process record % Y s % S already in process record % L %

% last clause % [ ] ~ A % B : = [ ] % c : = [ ] % halt

% process suspensions

% [ P I Q ] ~ B % P = Y %


Par t 6: get(2, Xll , X8) % FPI Q] "r C unify(Xll, X4) % P = Y goto(Part 7) %

ACKN OWLEDG M ENTS

The research reported in this paper was carried out whilst the first author was visiting the Weizmann Institute. The authors would like to thank Ehud Shapiro, Keith Clark and Steve Gregory for making this visit possible. The research benefited from access to tools developed by Michael Hirsch, Avshalom Houri, William Silverman and others. The decision to experiment with Flat Parlog was motivated in part by discussions on this subject with Steve Gregory, Graem Ringwood and others at Imperial College.

REFERENCES

1. C. Mierowsky, S. Taylor, E. Shapiro, J. Levy, and M. Safra, The design and implementation of Flat Concurrent Prolog, Technical Report CS85-09, Weizmann Institute, Rehovot (1985).

2. E. Shapiro, Concurrent Prolog: A Progress Report, IEEE Computer (August 1986). 3. K. L. Clark and S. Gregory, PARLOG: parallel programming in logic, in ACM Trans. on

Programming Languages and Systems, 8(1 ): 1-49. 4. K. Ueda, Guarded Horn Clauses, EngD thesis, University of Tokyo (1986). 5. A. Takeuchi and K. Furukawa, Parallel logic programming languages, in Proc. of the 3rd

Intl Logic Programming Conf. (London, July), (ed.), E. Shapiro, New York, Springer- Verlag, pp. 242-254.

6. S. Taylor, An Abstract Machine for Implementing FCP on Parallel Architectures, Weizmann Institute of Science (February 1986).

7. S. Gregory, Parallel Logic Programming in PARLOG, Reading, Mass.: Addison-Wesley (1987).

8. S. Taylor and E. Shapiro, Compiling Guarded Logic Programs into Decision Graphs, Weizmann Institute of Science (February 1986).

9. S. Safra, Partial Evaluation of Concurrent Prolog and its Implications, M.Sc. Thesis, CS86-24, Weizmann Institute of Science (July 1986).

10. A. Houri and E. Shapiro, A sequential abstract machine for Flat Concurrent Prolog, Weizmann Institute Technical Report CS86-19, Rehovot (1986).

11. S. Gregory, I. T. Foster, A. D. Burr, and G. A. Ringwood, An abstract machine for the implementation of PARLOG on uniprocessors. Submitted for publication.

12. K. L. Clark and S. A. Tarnlund, A first order theory of data and programs, in Information Processing 77; Proc. of the 1FIP Congress 77 (ed.), B. Gilchrist, Amsterdam: Elsevier/ Nother Holland, pp. 939-944.

13. S. Taylor, S. Safra, and E. Shapiro, Parallel execution of Flat Concurrent Prolog, in IntL Journal of Parallel Programming, 15(3):245-275.


14. D. H. D. Warren, An abstract Prolog instruction set. Technical note 309, SRI Inter- national, Menlo Park, California.

15. M. Codish and E. Shapiro, Compiling Or-parallelism to And-parallelism, in Proc. of the 3rd Intl. Logic Programming Conf. (London, July), (ed.), E. Shapiro, New York, Springer- Verlag, pp. 283-297.

16. I. T. Foster, Logic operating systems: design issues, in Proc. 4th Intl. Conf. on Logic Programming, Melbourne (May 1987).

Documents

Flat Parlog: A basis for comparison