22
Future Generation Computer Systems 9 (1993) 259-280 259 North-Holland Exploiting OR-parallelism in logic programs: A review Kang Zhang Department of Computing, School of MPCE, Macquarie University, NSW 2109, Australia Abstract Different forms of parallelism have been identified in logic programs for efficient implementations on multiprocessor systems. Among these, OR- and AND-parallelism have been the focus for exploitation for the past decade. Research in such exploitation has led to the proposals and implementations of various execution models and logic programming systems. This paper attempts to summarise, in the form of a structured review, the major activities in the area of exploiting OR-paralle- lism. Problems arising in implementing OR-parallelism and various models and working systems featuring different techniques for solving these problems are discussed. Keywords. Logic programming; Prolog; OR-parallelism; multiprocessor systems 1. Introduction The execution speed of sequential logic pro- gramming systems has been constantly improving since D.H.D. Warren's Prolog interpreter/ compiler [61] for the DEC-System 10 proved the usefulness of logic as a practical programming tool [37]. Yet, in order to meet the requirements of today's and tomorrow's applications, substan- tial improvements in performance are still needed. A promising approach is through the introduction of parallel evaluation strategies into the language executor. VLSI technology and parallel computer architecture advances also provide an opportunity for performance improvement. Investigations into this possibility have already led to the proposal of a number of schemes and machine architectures for parallel processing of logic programs [21,36,44]. The three major forms of parallelism ex- ploitable in a logic program can be explained in terms of the structure of the program. Among these three forms of parallelism, low level paral- lelism, typically unification parallelism, may be obtained during the unification of a goal and a clause head. By executing a stream of intermedi- ate instructions, the low level parallelism can also be exploited in a pipelined fashion [59]. A critical review on unification and its potential parallelism can be found in [35]. At a high level, when alternative clauses in a procedure are evaluated simultaneously, OR-parallelism is achieved [19]. OR-parallelism can also be exploited at a higher level in terms of the search tree, which represents the execution of the program [5,63]. The unifica- tion of more than one clause head with a given goal is considered as separate branches of the search tree. Multiple such branches, each includ- ing the clause involved and its continuation, can be executed in parallel. AND-parallelism is ex- ploited when more than one goal in a clause body are evaluated in parallel. Each goal in the body may include multiple OR branches. In order to exploit AND/OR parallelism more efficiently through explicit language syntax and semantics, a number of new logic programming languages supporting concurrent processing have been proposed. This by itself is an interesting research area, and is out of the scope for the present review. We refer readers to Shapiro's 0376-5075/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

Exploiting OR-parallelism in logic programs: A review

Embed Size (px)

Citation preview

Page 1: Exploiting OR-parallelism in logic programs: A review

Future Generation Computer Systems 9 (1993) 259-280 259 North-Holland

Exploiting OR-parallelism in logic programs: A review

K a n g Z h a n g

Department of Computing, School of MPCE, Macquarie University, NSW 2109, Australia

Abstract

Different forms of parallelism have been identified in logic programs for efficient implementations on multiprocessor systems. Among these, OR- and AND-parallelism have been the focus for exploitation for the past decade. Research in such exploitation has led to the proposals and implementations of various execution models and logic programming systems. This paper attempts to summarise, in the form of a structured review, the major activities in the area of exploiting OR-paralle- lism. Problems arising in implementing OR-parallelism and various models and working systems featuring different techniques for solving these problems are discussed.

Keywords. Logic programming; Prolog; OR-parallelism; multiprocessor systems

1. Introduction

The execution speed of sequential logic pro- gramming systems has been constantly improving since D.H.D. Warren's Prolog interpreter/ compiler [61] for the DEC-System 10 proved the usefulness of logic as a practical programming tool [37]. Yet, in order to meet the requirements of today's and tomorrow's applications, substan- tial improvements in performance are still needed. A promising approach is through the introduction of parallel evaluation strategies into the language executor. VLSI technology and parallel computer architecture advances also provide an opportunity for performance improvement. Investigations into this possibility have already led to the proposal of a number of schemes and machine architectures for parallel processing of logic programs [21,36,44].

The three major forms of parallelism ex- ploitable in a logic program can be explained in terms of the structure of the program. Among these three forms of parallelism, low level paral- lelism, typically unification parallelism, may be obtained during the unification of a goal and a

clause head. By executing a stream of intermedi- ate instructions, the low level parallelism can also be exploited in a pipelined fashion [59]. A critical review on unification and its potential parallelism can be found in [35]. At a high level, when alternative clauses in a procedure are evaluated simultaneously, OR-parallelism is achieved [19]. OR-parallelism can also be exploited at a higher level in terms of the search tree, which represents the execution of the program [5,63]. The unifica- tion of more than one clause head with a given goal is considered as separate branches of the search tree. Multiple such branches, each includ- ing the clause involved and its continuation, can be executed in parallel. AND-parallelism is ex- ploited when more than one goal in a clause body are evaluated in parallel. Each goal in the body may include multiple OR branches.

In order to exploit AND/OR parallelism more efficiently through explicit language syntax and semantics, a number of new logic programming languages supporting concurrent processing have been proposed. This by itself is an interesting research area, and is out of the scope for the present review. We refer readers to Shapiro's

0376-5075/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

Page 2: Exploiting OR-parallelism in logic programs: A review

260 K. Zhang

comprehensive survey [53] and collected papers [52] on concurrent logic programming languages.

This paper presents a review on various mod- els and systems proposed for exploiting OR- parallelism. It covers most of the important mod- els and systems reported between early 1980s and 1992. The research in this area is still rapidly evolving and some mature systems are continu- ously been improved and new models are emerg- ing. The trend has been to improve the existing working systems or to extend them for the sup- port of both AND- and OR-parallelism. The re- view first introduces the basic concepts of OR- parallelism, potential difficulties for its exploita- tion and trade-offs for using various approaches. The review proceeds by looking at binding man- agement approaches, followed by task control strategies used in various models for multiproces- sor implementations. The performance of some working systems are then summarised and com- pared. The review concludes with an overall sum- mary.

2. OR-parallelism

2.1. Problems with exploitation

In terms of computation, OR-parallelism refers to a parallel search strategy. When a search pro- cess reaches a branch in the tree, it can start to search descendant branches in parallel. The name for OR-parallelism is based on the fact that in a non-deterministic program, a query is often satis- fied by any answer. In other words, when any one of the searches starting from a choice point (non- leaf node on the tree) finds a solution, the origi- nal goal is resolved. OR-parallelism also refers to the parallel searching for all solutions. In fact, many OR-parallel systems are developed for the support of obtaining all solutions.

Consider the following example:

,-p(X). p(Y) ~ q ( Y ) , r(Y) .

p(z) , - s ( Z ) .

q(a).

r(b).

s(c).

backtrack

q

(a) sequential Prolog (b) OR-parallel system

Fig. 1. Environment stacks in evaluation of the example

In sequential Prolog (Fig. l(a)) [18], when p is called, the first clause is involved. To solve p, the system calls q which binds X to Y and then Y to a. So the original X is bound to a. This value makes r fail, then backtracking occurs to the second clause. At this point, X is reset to un- bound. By calling s, X is successfully bound to c.

In an OR-parallel system (Fig. l(b)), two clauses of the procedure p are evaluated in paral- lel. Two different bindings, a and c, are gener- ated for X by q and s respectively. Therefore, a major problem in exploiting OR-parallelism is the representation and management of different bindings of the same variable corresponding to different OR branches. Any implementation scheme for binding environments (or BEs) di- rectly concerns the memory management of the parallel Prolog system, which may well dominate the overall performance of the potentially large- scale parallel execution [15].

Another potential problem of OR-parallelism is the combinatorial explosion in the number of parallel tasks, which is decided by the search space of a particular application program. To prevent the explosion as well as to make the maximum use of the existing resources, a cost-ef- fective control strategy is desirable. A heavy con- trol overhead may destroy the achievement of OR-parallelism. Various kinds of implementation techniques have been adopted to solve these problems. Here we identify those techniques as for binding management and task control, corre- sponding to the above two problems respectively.

The third problem is caused by the side-effects of built-in predicates. In sequential Prolog, exe- cution of a cut predicate prevents further evalua-

Page 3: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 261

tion of alternatives on successful evaluation. When running in parallel, there is a danger that the operational semantics of a program may be altered by evaluating clauses running on different processors, which would not be evaluated when run sequentially. This leads to the issue of specu- latiue work. Ideally processors should not waste their time on parts of the search tree which may be later pruned by a cut [29]. The techniques used for handling side-effects and speculative work are also considered as task control tech- niques since the issues involved are the special cases in implementing task control strategies.

2.2. Leuels o f classification

When presenting different techniques used in various models for binding management and task control, we classify these models based on three levels of criteria: language support, model of task partitioning, and architectural features.

Language - This is concerned with whether the model is implemented to support full Prolog, including cut and side-effect predicates, and the implemented parallel Prolog returns the same answers as does sequential Prolog; or whether it is implemented for the pure logic programs only; or whether some extra annotations are added to the standard Prolog to assist exploiting paral- lelism, and thus the implementation supports a new language that, in many cases, supersedes Prolog.

Model - Most OR-parallel systems fall into two categories in terms of their approaches to partitioning tasks for parallel execution: some systems use process-based models, while others are OR-branch based (or sub-tree based). The former models exploit OR-parallelism by execut- ing concurrently OR-processes representing indi- vidual clauses, each of which is regarded as an independent resolution. Results of these OR- processes are returned to their common parent process. The latter involves multiple processors working on different OR branches of the search tree simultaneously. Each of such OR branches includes not only the current resolution, but also its continuation. Therefore, OR-branch based models can exploit coarse-grain OR-parallelism.

Architecture - at the lowest level, various sys- tems can be differentiated according to the way binding environments (BEs) are managed and the

way parallel tasks are scheduled as discussed in Section 2.1.

Binding management schemes include cen-

tralised and distributed environments ap- proaches. In systems using centralised environ- ments, most of the computation state including BEs is shared by multiple processors, while with the distributed environments approach, processors have their own copies of the BEs. Some systems use a compromised approach by dividing the computation state into a global one and a local one. The local computation state includes BEs which are non-shared. The global computation state includes the informa- tion for task scheduling (and global BEs for some systems) which are shared among all the processors. This approach is called multiple sequences approach. Task control strategies concern with how paral- lel tasks should be scheduled for available processors in a cost-effective manner so that the system is kept load-balanced. The strate- gies vary from the broadcasting and demand- driven methods to message-passing method, largely depending on the implementation mod- els and binding management schemes adopted.

2.3. Trade-offs

Most of the systems were started with the implementation of pure logic programs with depth-first or breadth-first search semantics [19,54]. Some systems have later implemented extra-logic features, typically found in the stan- dard Prolog [5,40]. Other systems use new lan- guages, which usually support Prolog, but have extensions that enable the languages to take ad- vantages of parallelism [48,65].

The motivation for supporting Prolog is that the standard Prolog semantics is well-established in the logic programming community and a con- siderable amount of code is already in existence. This approach allows existing Prolog programs to be run without modifications but at a higher speed. However, the speed-up can be much lim- ited for the programs which make heavy use of side-effect predicates. More complicated imple- mentations are required for supporting side-ef- fects in OR-parallel systems. This is due to Pro- log's sequential semantics of depth-first left-to- right search strategy.

Page 4: Exploiting OR-parallelism in logic programs: A review

262 K. Zhang

The motivation for developing extended Pro- log languages is that the standard Prolog is not suitable for highly parallel processing as it was designed for uniprocessors in the first place. With the new language features supporting parallelism, the implementations can certainly be more effi- cient. But the existing Prolog programs cannot be executed on such systems without being partly rewritten.

The choice of the language to be supported effects, to some extend, the decision of the bind- ing management scheme. The centralised envi- ronments approach attempts to make the best use of the existing efficient implementation of sequential Prolog [61] by extending it with the capability of handling multiple bindings for indi- vidual variables [14,55,60]. Most systems using centralised environments are based on OR-branch implementation model so that semantics of the standard Prolog can be closely modelled. This approach has the following advantages: (a) The similarity in implementation to conven-

tional languages, in the way that BEs are manipulated, lends the approach well to con- ventional parallel processing techniques for existing multiprocessors.

(b) In preventing combinatorial explosion of par- allelism, the transition from parallel mode to sequential mode and vice versa can be rela- tively smooth.

(c) Given the existing sequential system, the par- allel extension is the easiest to understand, modify and optimise.

The distributed environments approach aims at high distributability and scalability and at the implementation on non-shared memory or mes- sage-passing architectures [20,66,69]. The execu- tion models using this approach are typically pro- cess-based, since independent processes and rele- vant environments can be distributed among many processors. BEs are usually independent from each other and each dereference operation is performed in only one or two BEs. Parallel tasks can be controlled through message-passing be- tween the processors involved. These models overcome the drawbacks of a centralised ap- proach which requires a shared memory, but in tradeoff of extra copying and binding operations.

The third environment management approach, the multiple sequences approach, attempts to take the advantages of both centralised and dis-

tributed environment approaches [4,40,63]. It makes the best use of the efficient sequential Prolog implementation but also enjoys high scala- bility. The central idea in this approach is to use multiple processors, each equipped with a se- quential Prolog engine, to work simultaneously on different parts (OR-branches) of the search tree. The task switching is performed upon the demand of an idle processor. Therefore, this ap- proach is a natural choice for the systems that support full Prolog. It exploits coarse grain OR- parallelism effectively and has demonstrated most promising speedups recently.

In principle, a good OR-parallel system should support high scalability. More specifically, the combination of a binding management scheme and a task control strategy should allow such a system to perform parallel operations in a time which is independent of the number of parallel tasks and the size of terms involved in unification. These parallel operations include the allocation of environment spaces, unification and resump- tion after success or failure. Gupta has identified three criteria corresponding to these three opera- tions [25], namely, • constant-time environment creation; • constant-time variable access and binding; and • constant-time task switching. Unfortunately, it has been shown [25] that it is not possible to achieve constant-time execution for all three operations, therefore a compromise has to be made to achieve the best possible performance.

Shared memory multiprocessors are best served by methods which have centralised environments or multiple sequences, which sacrifice constant time task switching, since task switching is under the control of the scheduler. Task creation and variable access are program dependent, and so cannot be effectively optimised by the implemen- tation. This is why various scheduling policies have been investigated for the Aurora and MUSE systems which are multiple sequential systems and developed for shared memory multiproces- sors (see Section 3.3).

Distributed memory systems are favoured by methods with non-constant-time task creation, since the whole binding environment are made available locally by copying during task-creation to achieve environment distributability.

The rest of the paper describes individual

Page 5: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 263

models and systems in more details according to the lowest level of classification, i.e. binding man- agement and task control.

3. Bindin~ management

3.1. Centralised environments

The main idea in the centralised environments approach is to build a virtual stack for each process so that it can share as much information as possible with its sibling processes and make copies only of the information that has to be bound uniquely by that process. The binding schemes in these models have the following con- cepts in common: (a) variable bindings are kept locally in individual

clauses; (b) unification of a goal and the head of an

applicable clause often requires access to variables which may be bound earlier;

(c) the unification between two unbound vari- ables is realised by binding one variable to the reference to the other.

The differences between these schemes are that, to dereference an ancestor variable, different types of auxiliary structures are used (e.g. direc- tory trees or hash windows) to allow each clause to store its own copy of the variable.

3.1.1. Early models The three earliest representatives of the stor-

age management algorithms for OR-parallel exe- cution have been compared in a study by Cram- mond [21]. These are the Directory Tree algo- rithm, the Hash Window algorithm and the Vari- able Importation algorithm. The first two models use structure sharing approach, while the third one uses distributed approach through structure copying (see Section 3.2.1).

Ciepielewski and Haridi [14] used a number of frames - created upon the invocation of clauses - to represent a binding environment. A frame containing no unbound variables is shared among its descendent OR processes, while only those frames which contain unbound variables are copied for a new environment. The OR search tree is represented as a directory tree, where a directory contains entries for all the frames asso-

ciated with the goal. A variable is accessed by looking up the current directory in the hope that one of its frames contains the variable. If no such frame is found, the parent directory is consulted, and so on, until the variable is found.

The hash window algorithm [11] assumes that most unification operations generate bindings for only a small number of variables, so that these bindings can be presented in a hash window, and easily accessed by the newly forked OR pro- cesses, when a variable is found unbound, a simi- lar dereference procedure (as in a directory tree) is conducted on the path to the ancestor frames. The dereference is made more efficient in the PEPSys model (Parallel ECRC Prolog System) by tagging each binding with a label identifying the binding position in the chain of hash windows [65].

The evaluation result [21] favours the hash window algorithm in execution speed - 20% and 32% faster than the variable importation and direct tree respectively. Its memory usage is also more efficient than the other two schemes. Four refined versions of the storage model [14] have been implemented by Ciepielewski and Hausman [16]. The performance evaluation of these imple- mentations also show that hashing techniques perform best.

3.1.2. OR-forest The OR-forest-based environment sharing

model [54] also creates a local environment direc- tory with each process for directing variable dereferencing. In this model, variable bindings generated at different nodes are stored in differ- ent binding records (called frames in the Direc- tory Tree model mentioned above), and accessed by OR-parallel processes through their own di- rectories. Binding records, however, contain only bound variables, rather than both bound and unbound variables, and thus the overhead of checking and copying unbound variables is avoided when creating new environments [55]. The binding records generated at an ancestor node can be shared by its child nodes.

The creation of a process is based on the concept of OR-forest, instead of OR search tree. An OR-forest is constructed out of a number of OR-trees, each having an independent AND goal as its root node. A major advantage of this form of tree construction over the conventional OR

Page 6: Exploiting OR-parallelism in logic programs: A review

264 K. Zhang

search tree is that it avoids redundant evaluations in case of AND goals being independent [54].

A high level simulation study of the OR-forest model using five benchmark programs indicates that the search speed of this model is 2 to 6 times faster than that of an OR search tree model, and that both deterministic and non-deterministic programs offer parallelism for the OR-forest model, but only non-deterministic programs for an OR search tree model [56].

3.1.3. BOPLOG The performance-oriented OR-parallel WAM

proposed for the BBN Butterfly Parallel Proces- sor, called BOPLOG [60] uses an extensively copied data structure in order to gain speed-up. BOPLOG's binding method makes use of a dou- bly-circular linked list, or binding list, to repre- sent the values of each shared variable. Each such binding is also stamped with a time tag, corresponding to the time when the current bind- ing was taken from a choice point, when the number of bindings to a variable increases, so does the length of the variable's binding list, and hence the access time to the variable increases proportionally.

In BOPLOG, the concept of binding span, measured by time stamps, is used in each entry of the ancestor stack to guide the dereference of variables to the correct bindings. This strategy makes unwinding unnecessary, therefore no trail is needed. It may, however, introduce a signifi- cant overhead for accumulation of both success bindings and failure bindings, which in turn im- poses an unnecessary cost on dereferencing.

3.2. Distributed environments

In distributed environments schemes, the num- ber of BEs seen by any process is restricted to one or two and thus dereference operations are simpler than in a centralised environments scheme. But BE independence is achieved at the cost of extra copying and binding operations. The ways in achieving BE independence distinguish distributed models.

3.2.1. Early models The earliest distributed binding scheme was

proposed by Wise in implementing his EPILOG language [66,67]. The basic operational mecha-

nism in EPILOG is to manipulate dframes - i.e. dynamic or distributed frames. A dframe contains certain context information of a clause and BE of variables which are local to the clause. The EPI- LOG scheme treats dframes as black boxes such that information is transferred between dframes entirely through message passing.

In support of unrestricted AND-parallelism as well as BE locality, the successful resolution of a child process will cause its bound arguments to be back-unified with the original terms in the corresponding goal.

Lindstrom's variable importation scheme [39] was another early model with high locality. In this scheme, all unbound variables in the parent BE are imported into the child BE. When the last body literal has been solved, the still unbound variables are then exported to the new copy of the parent BE. Therefore, subsequent unifica- tions can only have local effects independently on the descendent OR processes. This scheme is suitable for implementation on a distributed sys- tem with hierarchical memory organisation.

Extensive BE copyings are used in Halim's data-driven OR-parallel system [26,27]. In this model, a goal is fired for reduction once a BE is available to it. A BE is copied upon a new vari- able binding or a variable updating, such that the initial goal triggers the computation in the form of BEs and finally a set of BEs are produced as the solutions to the program. Details on memory management was not provided in [26,27].

3.2.2. Closed environments Conery [20] proposed a binding scheme, known

as closed environments which is a further devel- opment towards a distributed system, particularly the A N D / O R Process Model [19].

The main idea in the closed environment scheme is to allow only a small number (actually one or two) of the binding environments to be operated upon by an active process so that a distributed binding structure can be imple- mented. To achieve this, Conery introduced an environment closing algorithm, which transforms the intervening binding environments into closed form. A closed environment is defined as a set of environments E such that no variables contained within E, or within the structured terms pointed to from E, is outside of E. The closing algorithm involves operations of instantiation, renaming or

Page 7: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 265

environment 0 heap

(a) After unification of p(a, f(X)) with p(Y, Z)

environment 0 heap ixl-- L - - - environment 1 ~

(b) After closing Environmentl with respect to Environment0

Fig. 2. Closing an environment after head unification.

creating variables, in order to organise the bind- ings in such a way that all the information needed for unification is presented in the environment of either the caller clause or the caller clause. Fig- ure 2 shows an example of a closed environment (Environmentl in Fig. 2(b)), which is obtained by applying the closing algorithm to the unification of a goal (p(a, fiX)) and a clause head (p(Y, Z))).

The major sources of overhead in the closed environment scheme are the extensive copying of environments and the two-stage closing algo- rithm. A more serious problem might be the extra required operation - to close the environment of the caller clause after the successful reduction.

A variant of the closed environment approach is described by Kale et al. [33]. This approach

uses a technique to identify ground structured terms so that they need not be copied in an environment closing operation. According to Conery's approach, these terms need be copied.

3.2.3. D I A L O G D IA LO G is a distributed model based on

dataflow computation [70], extended from the PIM-D dataflow concepts [30,31]. A major fea- ture of the D IA LO G binding scheme is that the number of BEs operated on by a process at any instant time is restricted to one, that is the pro- cess' own BE [69].

The design principle of the D IA LO G scheme is based on the following fact. The evaluation of a process can be entirely independent of the prece-

A(X, Y) <-

l face ins~nUa~on t

B(X), Ct

I instantlation ~-

I interface instantlationl Xou=, " ~ t interface instantiation I Y output

~').

Fig. 3. The DIALOG non-shared binding scheme.

Page 8: Exploiting OR-parallelism in logic programs: A review

266 K. Zhang

dent (sibling or parent) processes if the variables which appear in the input arguments of the pro- cess, and are bound during the previous unifica- tion, have been substituted by their instances according to the previous BE. This is illustrated in Fig. 3, where a process is treated as a black box. The BE is local to the process and used for instantiating the variables that have been bound earlier in this process.

Variable instantiation operations ensure that all bound variables appearing in an argument will be substituted with their corresponding instances in the BE. When structured terms contain vari- ables for substitution, they are reproduced with the variables being substituted by their instances. Instantiation operations are applied to the pro- cess arguments which are due to be transferred to the next process. Argument transfer may happen either (a) between sibling processes (called inter- face instantiation), or (b) from a parent process to its child process after head unification (called face instantiation ).

Once the evaluation of a child process termi- nates, the ancestor variables that were bound in the child process are paired with their binding instances and exported back to the parent pro- cess. This export operation, together with inter- face instantiation, achieves the same goal as the back unification operation in EPILOG [67]. In common with other structure sharing schemes, the DIALOG scheme introduces some overheads on instantiation operations and structure copying.

3.3. Multiple sequences

The motivations of developing a system using multiple sequential Prolog engines are that the performance of a parallel implementation is very much determined by the performance of the un- derlying sequential implementation and that mul- tiple well coordinated sequential engines should deliver an optimal performance. This approach was developed out of the centralised environ- ments approach, and thus shares some common features with the centralised approach. More im- portant features include, that the maximum ac- tive parallelism at any given time is matched to the size of the parallel machine; and that the Prolog engine is separated from the scheduler so that either can be replaced or updated without

disturbing the other. This means that any existing Prolog engine can be adapted with some modifi- cations while the responsibility for exploiting par- allelism lies with the scheduler. Schedulers will be reviewed later in Process Control Section.

The systems using multiple sequential Prolog engines exploit coarse grain parallelism very ef- fectively on reasonably large scale multiproces- sors, and the execution time of a single engine is only fractionally more than the fastest standard Prolog system [13].

3. 3.1. ANL - WAM One of the early systems using multiple se-

quences is ANL-WAM (Argonne National Labo- ratory WAM) [23]. To retain the efficient usage of value cells in the WAM, while allowing simul- taneous binding of a variable from different OR branches, ANL-WAM uses the concept of favoured binding to indicate that a binding is favoured if it is made at the left-most branch. A favoured binding can be stored in the value cell, whose address and the binding are then recorded in a binding node. For every other alternative branch, an unfavoured binding is only stored in a binding node. In both cases, the entry to the binding node is kept in a hash table. Each hash table has a backup copy of its headers for each alternative branch. To dereference a variable whose binding is another variable, the corre- sponding hash table has to be consulted. An extra flag field must be used in a value cell to identify whether the binding of the variable is favoured or not. The experimental results indicate that the system benefits little from using "favoured" bind- ing [23].

3.3.2. SRI Model and Aurora A more efficient multiple sequences model is

known as the SRI Model developed by D.H.D. Warren [63]. Computation in SRI Model is per- formed by a number of workers working on dif- ferent OR branches of the search tree simultane- ously, in a depth-first, left-to-right search strategy as used in sequential Prolog. When a worker finishes its work on a branch, it switches tasks based on the scheduling policy that the topmost task on any branch should be chosen as a candi- date. In order for the shared data to be read-only, each worker has a segment of the shared mem- ory, where the control stack, global stack, local

Page 9: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 267

Trail Value cell

instance I = ,

Binding array

J instance

I

Fig. 4. A variable with conditional binding in the SRI Model.

stack and trail are kept and operated in a similar way to the WAM. To handle multiple environ- ments, there are two binding arrays (i.e. global and local) introduced to store global and local variables respectively. A similar approach also using binding arrays was proposed independently by D.S. Warren [64].

To bind a variable unconditionally, i.e. for a local variable binding, the binding instance is overwritten into the corresponding value cell without trailing. If the binding is conditional, i.e. for a global variable binding, the instance is writ- ten into the current working binding array, whose location is recorded in the variable value cell, and both the variable address and the instance are trailed (only the address is trailed in the WAM) as shown in Fig. 4. Upon backtracking, the corre- sponding binding array location is found by fetch- ing the pointer stored in the value cell, whose address is obtainable from the trail, and then the binding is "unwound".

The dereference operation is similar to other parallel systems developed from the WAM. It is estimated by Warren [63], that binding and un- binding are about 40% more expensive, and dereference is about 60% more expensive than in the WAM. This model imposes less overhead on a worker, when it is working than the previous two models [14,23]. The concept of a numbered variable is introduced into the SRI Model, which simplifies the judgment of seniority when a vari- able is bound to another.

The main overhead of the SRI Model is the updating of the binding array when a worker switches tasks from a node to another. Whether a binding is conditional or unconditional is decided at run-time, and the promotion of a binding from conditional to unconditional leads to the need to remove the binding from the corresponding bind- ing array.

As a part of the Gigalips Project (an informal collaboration between Argonne National Labora- tory, USA, Bristol University, UK and Swedish Institute of Computer Science (SICS), Sweden), the Aurora OR Parallel System [40] is an imple- mentation of the SRI Model for SICStus Prolog [13], a portable sequential Prolog system devel- oped at SICS, on a multiprocessor, based on the experience from ANL-WAM. Different sched- ulers have been developed for Aurora and will be discussed in Section 4.1.2.

The performance evaluation of the Aurora System shows that the overhead of updating bind- ing arrays on task switching is tolerable in prac- tice, but locking and moving around a shared part of the search tree may cause more overheads [41,42]. Nevertheless, the reported performance [57], compared with the fastest commercial imple- mentation, is encouraging.

3.3.3. MUSE MUSE (MUltiple SEquential Prolog engines)

is another OR-parallel system developed in paral- lel at SICS with Aurora [4]. The copying policy adopted in MUSE is different from that used in Aurora such that it provides higher degree of locality of references. In MUSE, OR-parallelism is explored by a number of workers, each execut- ing a sequential Prolog engine and having its own choicepoint stack, environment stack, term stack and trail, which are essentially parallel versions of the WAM stacks. Workers also share a part of memory for storing global data.

When copying data from one worker to an- other after a worker runs out of work, workers incrementally copy parts of the WAM stacks and also share nodes with each other. The two work- ers involved in copying will only copy the differ- ent parts, while the shared memory space stores information associated with the shared nodes on the search tree. The different parts to be copied are always stored in the cache of the source worker. Also all the WAM stacks are located at a fixed address in the local address space of each worker so that relocation of pointers is avoided when a worker copies a segment of stack to another worker. Such incremental copying policy requires a minimum modification of the WAM. Therefore, the features and advantages of the sequential Prolog are best preserved in the MUSE model.

Page 10: Exploiting OR-parallelism in logic programs: A review

268 K. Zhang

The performance evaluation shows that the MUSE model is faster than the Aurora model when more workers are added [6]. It is suggested that the reason for this is due to the large num- ber of non-local accesses of stack variables in Aurora. MUSE and Aurora are further compared in Section 5.

3.3.4. Versions-vector With many similarities to the SRI Model [63],

the Versions-Vector OR-parallel Model (or VV) also features very cheap variable accessing and expensive task switching [17]. The major differ- ence is that the VV Model allocates a versions vector (instead of a value cell) for each condition- ally bound variable. The number of components in a vector is equal to the number of processors in the system and each component is used by just one processor to store and access conditional bindings belonging to that processor. A vector is allocated only when a variable obtains its first conditional binding. But space is wasted for the variables that are never used by some processors because components corresponding to these pro- cessors still have to be allocated.

4. Task control

The amount of exploitable OR-parallelism de- pends on the search space of the application program. When the search space becomes too large, in other words, when the degree of poten- tial parallelism is much higher than the number of processors used, a cost-effective control strat- egy is needed in order to keep the system load- balanced. Various task scheduling policies have been proposed for the general control of parallel tasks generated.

The search space may include speculative work which is not evaluated in a standard sequential system due to the effect of a cut, but is searched undesirably in a pure OR-parallel system. The performance could be significantly improved if a large speculative work can be avoided for execu- tion. The issues in handling speculative work and the effect of cut have been dealt with in most of the recently implemented systems. Good speed- ups have been reported when avoiding specula- tive work.

4.1. Task scheduling

4.1.1. Broadcasting One of the techniques to control the paral-

lelism is the broadcasting method, used in early models. It reduces shared memory contention by generating locally, in busy processors, portions of the search tree upon the requests of idle proces- sors. Two such systems are KABU-WAKE [38] and ORBIT [68].

ORBIT has a broadcasting system architec- ture, where a program is duplicated in each pro- cessor, and the workload is controlled by a con- trol processor. The processes of the search tree are partitioned into a number of process bundles. An idle processor can obtain a bundle of pro- cesses from another processor under the guid- ance of the control processor, which keeps the updated information (depth) of each processor's control stack where the processes are stored. The processor with the deepest control stack is chosen by the control processor for feeding the idle processor.

In the KABU-WAKE method, the partition rule is slightly different. Upon the request of an idle processor, the busy processor finds its oldest unprocessed alternative on the search tree, and then sends the portion of tasks under that alter- native to the idle processor meanwhile deleting the alternative on its own stack. The messages are exchanged between busy processors and idle processors, without a centralised control proces- sor.

The parallelism in this scheme is under control for both large-scale and small-scale parallel ma- chines since the size of a bundle depends upon the number of available processors. But the cen- tralised control processor used in ORBIT can be a bottleneck when a large number of processors make requests to it, or report their control stack information to it simultaneously. The KABU- WAKE system eliminates the control bottleneck by using extra tokens on the communication stream between processors to allow distributed control of the portion migration.

Another system based on broadcasting hard- ware is Ali's BC-Machine [3]. Compared to OR- BIT [68], this system distributes most control op- erations to local processors by dynamically assign- ing a master processor for each processor group. All processors, each having a local memory, are

Page 11: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 269

divided into groups dynamically as the search tree is partitioned. Whenever a master creates enough jobs, determined by a threshold, it makes the local unprocessed jobs global in a centralised control memory so that every local processor in a group will have the same memory image. There- fore the group of processors can be load-bal- anced effectively. The master then copies its state to all idle processors in parallel via a specialised broadcasting network.

4.1.2. Demand-driven An alternative approach to scheduling parallel

processes may be called demand-driven evalua- tion. Tasks are migrated to an idle processor only on demand. Each of the tasks represents an alter- native clause of a choice point generated by a parent process in process-based systems; or an available sub-tree yet to be traversed in OR- branch-based systems. This strategy guarantees that the amount of parallelism is kept as high as necessary to fill the multiprocessor system, as is the overhead arising from the parallelism, pro- vided that sufficient parallelism is available in the program.

Among the demand-driven strategies, early version of Aurora [40], BOPLOG [60] and the BRAVE Abstract Machine (BAM) [22] adopt the same policy for task switching, that is, an idle

processor (or worker as used in the related litera- ture) attempts to obtain a choice from a choice point which is near the root of the entire search tree. It is illustrated in Fig. 5, where the first available worker will claim work from node B. This policy assumes that the choice closest to the root of the search tree can provide the most work to fuel the sequential processing so that the gran- ularity of tasks can be maximised. One way to guide the idle worker to such a choice is to provide the implicit temporal ordering informa- tion in the stacks so that an exhaustive search can be avoided [22].

The above task-switching strategy is referred to as the Argonne Scheduler and used in Aurora. Three other schedulers have also been developed for Aurora, i.e. the Manchester scheduler [12], the Wavefront scheduler, and the Bristol sched- uler [9]. The Bristol scheduler dispatches on the bottom-most and sharing several nodes on each branch at a time. It also takes into account the speculative work (see Section 6.4). The others dispatch on the top-most, and share one parallel node on each branch at a time [5]. The Manch- ester scheduler tries to match idle workers with the 'nearest ' available outstanding task, i.e. the task which requires the least number of bindings to be updated between the current and the new positions [12]. The Wavefront scheduler links all

J process

J processor41 I proc.~o~l

Fig. 5. A search tree with available choice points at node A, B and C.

Page 12: Exploiting OR-parallelism in logic programs: A review

270 K. Zhang

the topmost live nodes together in a data struc- ture known as a wavefront. Workers traverse the wavefront to find available work [10].

The advantage of dispatching on the top-most is that the size of the shared region is minimised, and the size of tasks is kept as large as possible. This should lead to a minimisation of the number of task switches required. However, the cost of task switching in this scheme is increased, since finding a task always involves a general search of the tree [5].

On the other hand, dispatching on the bottom-most attempts to minimise task-switching overhead, and also reduces the amount of specu- lative work done.

The BC-Machine method mentioned earlier was later replaced by the MUSE method which uses bottom-most technique to match idle work- ers with available work as follows [4]. An idle worker always attempts to get the nearest piece of available work on the current branch. If no available work is found, it attempts to choose a busy worker to share its excess local work. If no busy workers are found, it stays idle at a suitable position on the tree. Busy workers are made visible only to the nearest idle worker so that the task-switching cost is reduced and no more one worker can get work from a single busy worker.

In the PEPSys Model [7,65], all tasks are ini- tially sequential, where applicable, potential OR-parallel processes are indicated by branch points, created upon user provided OR-parallel specifiers. The Delphi model divides the search- tree into OR-branches evenly for workers when the execution is started. It avoids environment copying during task-switching by redirecting the worker to another OR-branch through local backtracking [7]. The decision for task-switching is also made through a centralised control mecha- nism.

The disadvantage of the above systems is that high memory contention is unavoidable for a large scale system since the scheduling decision is made on the search tree information stored in a global memory.

Such memory latency is alleviated in BRAVE [50], where the memory is organised and oper- ated in the similar way as in the SRI Model [63], but the scheduling policy is more distributed than that in the SRI Model. Each worker in BRAVE keeps a task list in its local memory. A worker

needs to consult other workers (rather than the shared search tree information), only when its own task list becomes empty, which means it needs more tasks to process.

BRAVE has been implemented on the reduc- tion machine GRIP [49], on the transputer [47], and later on a message-driven machine [22].

4.1.3. Message-passing The demand-driven control strategy described

above is derived from the sequential implementa- tion on a shared memory or a centralised control system. Task switching in the former system is decided by consulting the global search tree in- formation, which may be highly costly on a mas- sively parallel machine, in addition to the cost imposed by the task switching itself. The portion migration using broadcasting in KABU-WAKE requires a high bandwidth of communication due to its duplicated copying of a large portion of information between processors, and the neces- sary post-processing in the sender processor. Broadcasting from and reporting to the control processor in the latter system can also limit the actual number of multiple processors.

To decentralise the control and distribute the shared information as well, message-passing mod- els have been proposed. For example, the A N D / O R Process Model [19] controls OR-paral- lel processes locally at the process level. In this model, processes communicate with each other by passing messages under certain state modes. When an OR-parallel process receives a start message from its parent, it attempts to unify its head with the head of each candidate clause. If the unification is successful and the descendent AND processes in the body (if any) succeed too, the OR process sends a success message to its parent and goes to the gathering mode. Other- wise, a fall message is sent to the parent. During the evaluation of the descendant AND processes, the OR process is in the waiting mode. An AND process may receive a redo message from its parent OR process after it sends a success mes- sage with a result to the parent. The redo mes- sage will cause the AND process to immediately start working on its next answer. The OR process meanwhile sends the result to its own parent process if it is in the waiting mode.

The scheme can achieve control of the paral- lelism during the local communication between

Page 13: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 271

processes. Therefore, it is well suited to a highly distributed large system. The main disadvantage of the scheme is the extra overhead caused by the two-way communication. How the control scheme is guided by the run-time workload of the system is also unknown.

4.2. Side-effects and speculative work

The simplest way to avoid altering the seman- tics of side-effect predicates such as cut, assert, and retract, is to stop their use, and replace them with alternative clauses which can be used to achieve the same effect. The reason for using this approach is that Prolog was developed for uniprocessor systems, and has clauses which are inappropriate for parallelism [4]. An example of using alternative clauses is BRAVE [51]. Other systems use similar approaches as used in the Aurora model. The Aurora philosophy is that current sequential Prolog programs should work unmodified on the parallel system, which makes it more difficult to implement cut, assert and retract. This approach is motivated by the obser- vation that standard Prolog semantics are well accepted by the logic programming community, and a considerable amount of code is already in existence [28]. MUSE compromises by imple- menting cavalier commit, instead of cut and asyn- chronous assert and retract in the place of syn- chronous (sequential) side effects [4,5].

On the other hand, cut is used in sequential Prolog to avoid unnecessary computation, for ex- ample, when the required solution of a problem is just one (or a subset) of many possible solu-

tions. But in an OR-parallel Prolog system run- ning on a multiprocessor architecture, when one processor finds a solution, other processors may still work on more solutions which are not needed. Such wasted work is called speculative, which could be cut away in sequential Prolog as shown in Fig. 6.

4.2.1. B R A V E BRAVE is designed for all-solutions parallel

execution, where cut, assert and retract are re- moved on the basis that these features compro- mise parallel execution. Alternative features are provided in order to code certain algorithms [51].

Cut is replaced by an i f_then_else construc- tion to implement conditional control. The con- struction uses the syntax: p : - q ~ r ; s.

( I f q then r else s)

In sequential Prolog this would be coded as:

p :-q,!,r. p : - s .

Note that SICStus Prolog also provides an if predicate [51], where the above would be coded as"

if (q , r, s ) .

Difficulties arise when q has more than one solution. In parallel execution, the solution q committed to will be indeterminate. As a result the semantics may change from run to run. Two solutions are available, BRAVE continues execu- tion unless another solution for q is found, when a run-time error will be reported. An alternative

search space leading to useful solutions

I I cut away search space (speculative)

Fig. 6. Speculative work in a search tree.

Page 14: Exploiting OR-parallelism in logic programs: A review

272 K. Zhang

is provided in MU-Prolog [43], which suspends q until it becomes ground. This approach can be implemented in BRAVE since goal suspension is supported, though this requires more program- mer effort.

Assert and retract are handled by a database for partial results (known as lemmas). This allows meta-control of assert and retract, rather than the usual parallel approach of allowing these clauses to execute as they are encountered (asynchronous assert and retract). This again has the advantage of allowing tighter programmer control, but with the disadvantage of requiring greater programmer effort.

4.2.2. MUSE MUSE supports a generalisation of cut, known

as commit, that is a special form of pruning operator for parallel implementation and is not guaranteed to produce identical effects to se- quential cut [4]. This provides simplicity, but has the danger that sequential semantics may be al- tered. Full Prolog had been implemented in MUSE, including cut and standard side-effect predicates (e.g. read, write, assert, retract, etc.). The proposed mechanism for handling sequential side-effects is to allow them to execute only on the left-most branch of the search tree, suspend- ing execution until that point [5].

A novel method for handling write type of side-effects (e.g. write and assert) is used in MUSE [34], using the fact that such side-effects do not alter the binding environment. If a worker is unable to execute a write type of side-effect predicate, the predicate is temporarily saved in a suitable node on the search-tree, the worker then continues its execution as if that side-effect has been executed. The same method has been used to implement findall, bagof and setof predicates when saving multiple solutions.

4.2.3. ROPM Speculative work is pruned in a different way

in ROPM [46]. The branch of the search tree pruned by a large cut is not discarded. It is restarted when the larger cut is pruned later by a smaller cut. This scheme causes a large memory space to be wasted because pruned regions can- not be discarded, even if they will never be restarted.

4.2.4. Handling speculative work in Aurora The Aurora implementation also supports

commit operator and synchronous side-effects [29]. In general, if a cut is in the scope of another cut, the cut with the larger scope must not prune away branches that would be pruned away by the cut with the smaller scope.

Various schemes for handling speculative work have been implemented in Aurora, such as local leftmost, scope information based schemes for pruning speculative work, local less speculative preferred, delayed release, and depth-first search schemes for scheduling speculative work.

4.2.4.1. Pruning Local leftmost scheme is the simplest solution. It prunes all the branches rooted at the right siblings of a worker's sentry node, and suspends the execution of cut until the branch becomes leftmost in the subtree belonging to the predicate containing the cut. The draw- back of this scheme is that there is the possibility of workers executing parts of the search tree which will be cut away, so the effort is wasted on this speculative work.

Scope information based scheme immediately prunes away all branches which would be pruned away by cuts with smaller scopes. Execution is suspended only if it is in the scope of a cut with a smaller scope. Implementation is by means of cut counters, which contain the current number of cuts in the worker's continuation.

In another approach proposed by Ali [2], a branch is delayed for execution until it can no longer be affected by cuts. While this prevents speculative work completely, it severely limits the amount of parallelism.

The scope information based approach was found to exhibit better speedups than the local leftmost scheme for more than 4 workers [29].

4.2.4.2. Scheduling Ideally if speculative work exists in a search tree, workers should not be committed to this area while there exists useful work in other areas. The least speculative tasks are found to be in the leftmost branch of the subtree, and tasks become less speculative on the lower part of the subtree. Speculativeness can be assessed by counting the number of branches leading to cuts that could prune the work. Haus- man proposed several schemes in an attempt to minimise the amount of time spent in executing speculative work [29].

Page 15: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 273

The delayed release scheme makes speculative work available to other workers less frequently than non-speculative tasks. The delay time before speculative work is made available can be either proportional to the speculativeness of the tasks, or constant. This scheme can be implemented by increasing the granularity of speculative work. Granularity is set by making work available after a certain number of calls, normally 10, but in- creased by a factor when work is speculative. Constant delay was found to give better perfor- mance, due to the overheads of counting pruning branches.

The local less speculative preferred scheme has workers migrate to the leftmost branch lead- ing to work, and taking the topmost available task. The migration is controlled by having work- ers migrate to the branch with the least number of active workers, unless the work is speculative, in which case it migrates to the local leftmost branch leading to work.

The depth-first search has workers attempting to take the bottom-most available task in the leftmost branch when work is speculative. Imple- mentation of this in speculative regions is achieved by keeping all nodes in a boundary branch public, i.e above the dispatching nodes. The general strategy is for workers searching for

work to take the bottom-most available task, in- stead of the topmost.

The combination of depth-first with the local less speculative preferred strategies are consid- ered to be the best, and have been used in the Bristol scheduler for Aurora [9]. An example for using this scheme is shown in Fig. 7, where worker A is due to perform a cut and five other workers work in the region that is to be pruned. In the Bristol scheduler, worker A will be able to identify and then interrupt workers B, E and F. Worker B will search the branch whose root S was suspended during the pruning by worker A, and will interrupt worker C. Worker D will in turn be interrupted by worker C. The five inter- rupted workers will look for work elsewhere out- side the pruned region.

5. Perfi~rmance on multiprocessors

Of the OR-parallel systems reviewed above, several systems have been successfully imple- mented on multiprocessor machines and shown promising speed-ups. Among these, Aurora and MUSE are the most representative in providing typical performance characteristics of an OR- parallel system. This section briefly compares Au-

region to be pruned

WO

region to be suspended

Fig. 7. Bristol Scheduler scheme for handling speculative work.

Page 16: Exploiting OR-parallelism in logic programs: A review

274 K. Zhang

rora and MUSE and then summarises the perfor- mance of Aurora and MUSE.

Both MUSE [4] and Aurora [40] exploit OR- parallelism by using a number of workers (processes or processors) each working on a dif- ferent part of the Prolog search tree. As de- scribed in Section 3.3, different binding manage- ment approaches are used in the two systems. MUSE uses incremental copying of the WAM stacks while Aurora uses the SRI Model.

The SRI Model extends the WAM by using a large binding array in each worker and modifying the trail to contain address-variable pairs instead of just addresses [63]. A binding array is used in each worker to store and access variable bindings which are potentially shareable. The WAM stacks are shared by all workers. In MUSE, however, each worker has its own copies of the WAM stacks and some global address space shared by all workers. Workers incrementally copy parts of the stacks and also shared nodes with each other when a worker runs out of work.

MUSE and Aurora also use different sched- ulers for exploiting and controlling parallelism (see also Section 4.1.2). Among a number of schedulers developed for Aurora, the Argonne scheduler and the Manchester scheduler have been evaluated for their performances on various machines. According to the reported results, the latter always outperforms the former. MUSE has only one scheduler. The main difference between the two Aurora schedulers and the MUSE sched- uler is in the strategy used for dispatching work. The Argonne and Manchester schedulers take work from the topmost node on a branch, while the MUSE scheduler always takes the bottom- most node on a branch.

Many optimisations have been made for both Aurora and MUSE on the machines reported below [5,58]. The only optimisation that has been implemented for MUSE but not for Aurora is caching the WAM stacks on the BBN Butterfly TC2000. A detailed performance comparison be- tween MUSE and Aurora running a knowledge- based system application is reported in [6].

5.1. On Sequent Symmetry

An early version of Aurora, using the Manch- ester scheduler, has been instrumented to evalu-

ate the basic set of profiling data [57]. A bench mark suite consists of three groups of programs, according to their speed-ups on a Sequent Sym- metry $27 multiprocessor (12 processors and 16 Mb shared memory): high speed-ups (e.g. 8- queens problem) provides large search spaces, medium speed-ups (e.g. zebra puzzle) and low speedups (e.g. farmer crossing river problem) provides relatively smaller search spaces. These programs do not contain any large speculative works.

It is found that the Aurora execution time is contributed mainly from three sources, i.e. se- quential execution plus parallel administration, task switching and processor idle [57]. The paral- lel administration overhead using the SRI bind- ing scheme is roughly 25-30% of the sequential execution time. In other words, the ratio of the running time on Aurora with one worker and the running time on SICStus0.3 is about 1.25 to 1.30. The task switching overhead increases consider- ably when the granularity decreases. The aver- ages of the task switching overheads do not vary among the three groups, but vary considerably for individual programs because the frequency of scheduling operations varies from one program to another. The idle time is decided by the amount of parallelism available in the program, delays in creation of work due to task switching and the granularity of exploitable parallelism.

A shared memory architecture requires cen- tralised memory administrative operations in im- plementing an OR-parallel system, such as lock- ing when extending or shrinking parts of the search tree; and data migration when updating binding arrays [63]. The former accounts for 6-7% of total overhead time and increases slightly for more workers. The latter accounts for at most 10% of total overhead time and increases propor- tionally with the increase of the number of work- ers.

MUSE was also evaluated on a Sequent Sym- metry ($81 with 16 processors and 32 MBytes shared memory), but using SICStus0.6 Prolog as its engine [4]. When evaluating the same bench- mark suite as used by Szeredi for Aurora [57] and an additional set of large and real programs, similar results (as with Aurora) have been ob- tained, except that the absolute execution times on MUSE are about 30-50% shorter than those on Aurora. The results show almost linear speed-

Page 17: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 275

Table 1 Run-times (in seconds) of MUSE and Aurora on Sequent Symmetry

System Benchmarks 1 worker 4 workers 8 workers 15 workers 25 workers

MUSE circuit 426.74 (1.00) 28.73 (14.9) 17.39 (24.5) 8-queen 6.910 1.740 (3.97) 0.880 (7.85) 0.490 (14.10) zebra 4.390 1.331 (3.30) 0.840 (5.23) 6.89 (6.37) farmer 3.199 1.399 (2.29) 1.419 (2.25) 1.429 (2.24)

Aurora circuit 533.69 (0.80) 36.06 (11.8) 21.83 (19.5) 8-queen 7.831 2.000 (3.92) 1.010 (7.75) 0.559 (14.01) zebra 5.021 1.480 (3.39) 9.40 (5.34) 0.769 (6.53) farmer 3.620 2.110 (1.72) 2.1 t 0 (1.72) 2.390 (1.51)

ups for the programs with coarse grain paral- lelism, reasonable speedups for programs with medium grain parallelism and low speedups for programs with fine grain parallelism.

It is found that copying of a part of a worker state, making a part of the search tree shareable and grabbing a piece of work from a shared node contribute to the major sources of overheads.

Several performance comparisons have been conducted on Aurora and MUSE [4,6]. Table 1 lists the performance data selected from [4] and [6], where speed-ups are shown in parentheses. 'Circuit' is a knowledge-based program for de- signing circuit boards [6].

5.2. On B B N Butterfly

Performance evaluation has also been con- ducted on switch-based multiprocessor architec- tures, such as BBN Butterfly GP1000 [41,42] and Butterfly TC2000 [42]. A switch-based machine has both local and non-local memories with dif- ferent access times, as opposed to bus-based ma- chines, such as Sequent Symmetry, which give a uniform memory access time. An advantage of switch-based machines is their high scalability and thus the capability of running more proces- sors. The programs described above which shown high speed-ups, such as 8-queens, for up to 11

Speed-up

40

30

20

10

11 -queens

8-queens

_ tina , Manchester

. . . . . . ~ , Argonne

/ / / / " ~ " Manchester -

/ / ~ ~ _ ~ Argonne

- - ~ . . . . Manchester ~ . - " . . . . . . . . . . . .

" : " " " " " " " " - Argonne I

10 20 30 40 50 No. Processors

Fig. 8. The Manchester and Argonne schedulers on TC2000 [42].

Page 18: Exploiting OR-parallelism in logic programs: A review

276 K. Zhang

processors do not provide sufficient parallelism for the larger scaled Butterfly machines. Butterfly GP1000 and TC2000 have similar architectures. TC2000 is faster than GP1000 and has a three levels of memory hierarchy, i.e. cache, local and remote memories.

When evaluating the performance on the But- terfly GP1000 and TC2000, the Aurora engine (GP1000 based on SICStus0.3 Prolog, but TC2000 on SICStus0.6), and binding environments are copied onto each of the processors, and shared data structures are also distributed among all the processors. It is shown that all the large bench- marks (e.g. 11-queens problem) give near linear speed-ups on up to 36 processors. The speed-ups of relatively smaller programs (e.g. 8-queens problem and holiday planning 'tina') start level- ling off at 16 processors. The reasons for the slowdown on the latter programs are due to the increase of non-local memory accesses and lower cost-effectiveness of scheduling on the tasks with smaller granularity. The Argonne scheduler and the Manchester scheduler were individually tested with the engine with the performance showing in Fig. 8.

When evaluating the benchmark suite adopted by Szeredi [57], it was found that the programs with low and medium speedups could not show performance improvements when more than 4 processors were used on Butterfly TC2000.

Though both the schedulers have shown the efficiency for the programs with large and well- balanced search spaces, they do not perform well for programs with small search spaces. This is largely due to the switch contention. However, the overall results on the 11-queens program (op- timised for a better balanced search space) show that over 600 KLIPS performance can be achieved

on the Butterfly TC2000 with 36 processors, even though the caching ability is not fully exploited.

MUSE, however, supports caching of the WAM code stored locally in each worker. Its performance on Butterfly TC2000 using the same set of benchmarks as for Aurora shows that the execution time of an individual MUSE worker are mainly due to • the sequential execution plus interrupt check-

ing and local updating, • waiting for work to be generated and looking

for work for sharing, and • making parts of search tree shareable by other

workers. The results show that when running on one worker, MUSE is about 22% slower than SICS- tus0.6 Prolog, on which the MUSE engine is based. The major overhead in supporting parallel execution is due to the operations required in making work to be shared with other workers. Such operations include copying of the WAM stacks from the current worker to other workers and synchronisation for the workers involved. This overhead increases as the granularity of paral- lelism decreases.

The overall performance of MUSE is quite encouraging. The average real speed-up on 32 TC2000 processors over one processor is 25.4 for the programs with coarse grain parallelism. For all the benchmarks tested on TC2000, MUSE is faster than Aurora by 39% to 171% (Table 2).

6. Summary

Efficient binding management schemes and task control strategies for the support of OR- parallelism have been extensively investigated

Table 2 Run-times (in seconds) of MUSE and Aurora on BBN Butterfly TC2000

System Benchmarks 1 worker 10 workers 20 workers 30 workers 37 workers

MUSE circuit 105.97 (1.00) 10.81 (9.80) 5.56 (19.1) 3.93 (27.0) 3.29 (32.2) 11 -queen 225.23 22.78 (9.89) 11.58 (19.4) 7.88 (28.6) 8-queen 1.79 0.21 (8.52) 0.14 (12.8) 0.13 (13.8) zebra 0.98 0.58 (1.69) 0.61 (1.61) 0.63 (1.56) farmer 0.83 1.01 (0.82) 1.03 (0.81) 1.07 (0.78)

Aurora circuit 180.55 (0.59) 22.12 (4.79) 16.02 (6.61) 13.66 (7.76) 13.79 (7.68) 11-queen 369.14 36.92 (10.0) 18.54 (19.9) 12.47 (29.6) 8-queen 2.85 0.33 (8.64) 0.21 (13.6) 0.22 (13.0) zebra 1.55 0.79 (1.96) 1.12 (1.38) 1.99 (0.78) farmer t .14 1.80 (0.63) 2.14 (0.53) 2.33 (0.49)

Page 19: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 277

Table 3 A summary of the representative systems

System Langauge Model BE management Task control

AND/OR Process pure LP ANL-WAM pure LP Aurora full Prolog

BRAVE modified Prolog MUSE full Prolog

ROPM modified Prolog PEPSys modified Prolog

process distributed BE message-passing OR-branch multi-sequence demand-driven OR-branch multi-sequence demand-driven

(structure sharing) OR-branch multi-sequence demand-driven OR-branch multi-sequence demand-driven

(structure copying) process distributed BE message-passing process centralised BE demand-driven

over the last few years. The process-based model and OR-branch-based model are the two main models in representing parallel tasks. Some most representative systems are summarised in Table 3, which shows the differences in terms of the languages supported, the implementation models, the binding management schemes and the task control strategies.

The systems based on centralised environ- ments scheme use a global auxiliary structure to store all the variable bindings. Though high effi- ciency has been achieved in some schemes, high system scalability may still be difficult to obtain. The major disadvantages of the centralised bind- ing scheme are that, to access a variable which is bound at a very early stage, the dereference oper- ation is sometimes costly; also the link of the BEs generated at different resolution stages requires a shared memory organisation to facilitate the aux- iliary structure. The scalability problem also ex- ists for the task control strategies which use a similarly centralised structure.

Among the models using centralised auxiliary structures, the directory tree method [14], in which each node has its own directory which contains a number of contexts, has non-constant-time envi- ronment creation. The hashing window method reduces the overhead on task-switching, but pays more for variable access [11,65]. The time-stamp- ing method used in BOPLOG [60] actually sacri- fices both constant-time variable access and con- stant-time task switching.

Attempts have been made, by other re- searchers, to achieve better scalability and high distributability by using distributed approaches. These distributed models are also suitable for better task control, but usually pay the price on

environment creation and updating when exten- sive structure copying is needed. Among these models, the closing environment method [20] and DIALOG [69] have non-constant-time environ- ment creation. The variable importation method sacrifices both constant-time environment cre- ation and constant-time variable access [39].

Yet, the most promising binding management scheme has been the multiple sequences scheme. Two best known systems using this scheme are Aurora [40] and MUSE [4], which, though sacri- fice constant-time task switching, maintain the efficiency of standard sequential Prolog, but also offer high scalability. Both Aurora and MUSE have been successfully implemented on a number of multiprocessor machines with a large number of processors, and also demonstrated encouraging performance. It is predicted that this class of systems will continue to show improved perfor- mance with the development of more efficient schedulers.

Recently developed schedulers have taken into account speculative work caused by the use of cut. Speculative work allows an interesting source of potential parallelism to be exploited if it can be avoided gracefully. Previous studies have shown that if an effective method is found to eliminate much of the speculative work, signifi- cant speedups could be obtained [10]. This is especially true where only a subset of possible solutions are required.

The current trend has been to design and implement working systems which exploit both OR- and AND-parallelism. Intuitively, a com- bined AND/OR-parallel system is expected to offer a superior performance over a system which only exploit one type of parallelism.

Page 20: Exploiting OR-parallelism in logic programs: A review

278 K. Zhang

Acknowledgments

The author is very grateful to Khayri Ali, Mehmet Orgun, Chengzheng Sun and Rong Yang for their comments and suggestions on an early draft, which have clarified the use of certain terminology and also made the paper presenta- tion significantly improved. Thanks also go to Paul English for his discussion on speculative work. Anonymous referees are thanked for their comments and suggestions.

References

[1] H. Ait-Kaci, Warren's Abstract Machine - A Tutorial Reconstruction (MIT Press, Cambridge, MA, 1991).

[2] K.A.M. Ali, A method for implementing cut in parallel execution of Prolog, in: Proc. 1987 Syrup. on Logic Pro- gramming, San Francisco, USA (1987) 449-456.

[3] K.A.M. Ali, OR-parallel execution of Prolog on BC-mac- hine. In Proc. 5th Internat. Conf. and Syrup. on Logic Programming, Seatle, WA, USA (15-19 Aug. 1988) 1531-1545.

[4] K.A.M. Ali and R. Karlsson, The Muse Or-Parallel Pro- log model and its performance in: Proc. 1990 North American Conf. on Logic Programming, Austin, USA (Oct. 1990).

[5] K.A.M. Ali and R. Karlsson, Scheduling Or-Parallelism in Muse, in: Proc. 8th Internat. Conf. on Logic Program- ming, Paris, France (June 1991).

[6] K.A.M. Ali and R. Karlsson, OR-Parallel Speedups in a Knowledge Based System: on Muse and Aurora, in: FGCS'92, Tokyo (1992).

[7] H. Alshawi and D.B. Moran, The Delphi Model and Some Preliminary Experiments, in: Proc. 5th Internat. Conf. and Symp. on Logic Programming, Seatle, WA, USA (15-19 Aug. 1988) 1578-1589.

[8] U. Baron et al., The Parallel ECRC Prolog System PEP- Sys: An overview and evaluation results, in: Proc. Inter- nat. Conf. on Fifth Generation Computer System, Tokyo (28 Nov. - 2 Dec. 1988) 841-850.

[9] A. Beaumont et al., Flexible scheduling of OR-Paralle- lism in Aurora: The Bristol Scheduler, in: PARLE'91: Conf. on Parallel Architectures and Languages Europe (Springer, Berlin, June 1991).

[10] A. Beaumont, Scheduling strategies and speculative work, in: Proc. ICLP'91 Pre-conference Workshop on Parallel Execution of Logic Programs, Paris, France (June 1991).

[11] P. Borgwardt, Parallel Prolog using stack segments on shared-memroy multiprocessors, in: Proc. 1984 Intemat. Syrup. Logic Programming (Feb. 1984) 2-11.

[12] A. Calderwood and P. Szeredi, Scheduling OR-Paralle- lism in Aurora - the Manchester Scheduler, in: Proc. 6th Internat. Conf. on Logic Programming (June 1989) 419- 435.

[13] M. Carlsson and J. Widen, SICStus Prolog User's Man- ual, SICS Research Report R88007B, October 1988.

[14] Ciepielewski and S. Haridi, A formal model for OR- parallel execution of logic programs, in: Proc. Informat. Processing 83 (t983) 299-305.

[15] A. Ciepielewski, B. Hausman and S. Haridi, Initial evalu- ation of a virtual machine for OR-parallel execution of logic programs, in: Proc. IFIP TC-IO Working Conf. on Fifth Generation Computer Architectures, Woods, J.V. (ed.) (Elsevier, Science, Amsterdam, 1985) 81-99.

[16] A. Ciepielewski, B. Hausman, Performance evaluation of a storage model for OR-parallel execution of logic pro- grams, in: IEEE Proc. Syrup. On Logic Programming, Salt Lake City, Utah, USA (Sep. 1986) 246-257.

[17] A. Ciepielewski, S. Haridi and B. Hausman, OR-parallel prolog on shared memory multiprocessors, J. Logic Pro- gramming, 7 (1989) 125-147.

[18] W.F. Clocksin, and C.S. Mellish, Programming in Prolog (Springer, New York, 1981).

[19] J.S. Conery, Parallel Execution of Logic Programs (Kluwer, Dordrecht, 1987).

[20] J.S. Conery, Binding environments for parallel logic pro- grams in non-shared memory multiprocessors. Internat. J Parallel Programming 17 (2) (1988) 125-152.

[21] J. Crammond, A comparative study of unification algo- rithms for OR-parallel execution of logic languages, IEEE Trans. Comput. C-34 (10) (Oct. 1985) 911-917.

[22] S.A. Delgado-Rannauro and T.J. Reynolds, A message driven OR-parallel machines, in: Proc. 3rd Internat. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, USA (April 1989) 217-226.

[23] T. Disz, E. Lusk and R. Overbeek, Experiments with OR-parallel logic programming, in: Proc. 4th Internat. Conf. on Logic Programming, Lassez, J-L. (ed.) (1987) 576-600.

[24] J. Gabriel, T. Lindholm, E.L. Lusk and R.A. Overbeek, A tutorial on the Warren abstract machine for computa- tional logic, Technical Report, ANL-84-84, Argonne Na- tional Laboratory, Argonne, USA, June 1985.

[25] G. Gupta and B. Jayaraman, On criteria for OR-parallel execution models of logic programs, in: Proc. 1990 North American Conf. on Logic Programming, Austin, USA (Oct. 1990) 605-623.

[26] Z. Halim, Data-driven and demand-driven evaluation of logic programs, Ph.D Thesis, Dept. of Computer Science, University of Manchester, 1984.

[27] Z. Halim, A data-driven machine for OR-parallel evalua- tion of logic programs, New Generation Comput. 4 (1986) 5-33.

[28] B. Hausman, A. Ciepielewski and A. Calderwood, Cut and side-effects in OR-parallel prolog, in: Proc. Internat. Conf. of Fifth Generation Computer Systems, Tokyo (28 Nov. - 2 Dec. 1988) 831-840.

[29] B. Hausman, Handling of speculative work in OR-paral- lel PROLOG: Evaluation results, in: Proc. 1990 North American Conf. on Logic Programming, Austin, USA (Oct. 1990).

[30] N. Ito and H. Shimizn, Dataflow based execution mecha- nisms of parallel and concurrent Prolog, New Generation Comput. 3 (1985) 15-41.

[31] N. Ito et al., The architecture and preliminary evaluation results of the experimental parallel inference machine

Page 21: Exploiting OR-parallelism in logic programs: A review

Exploiting OR-parallelism in logic programs 279

PIM-D, in: Proc. 13th Ann. Internat. Symp. on Computer Architecture (1986) 533-541.

[32] L.V. Kale, D.A. Padua and D.C. Sehr, OR parallel execution of prolog programs with side effects, J. Super- comput. 2 (1988) 209-223.

[33] L.V. Kale, B. Ramkumar and W. Shu, A memory organi- sation independent binding environment for AND and OR parallel execution of logic programs, In. Proc. 5th Internat. Conf. and Syrup. on Logic Programming, Seatle, WA, USA (15-19 Aug. 1988) 1223-1240.

[34] R. Karlsson, A high performance OR-parallel prolog system, PhD Thesis, The Royal Institute of Technology and Swedish Institute of Computer Science, March 1992.

[35] K. Knight, Unification: A multidisciplinary survey, ACM Comput. Surv. 21 (1) (March 1989) 93-124.

[36] P.M. Kogge, The Architecture of Symbolic Computers (McGraw-Hill, New York, 1991).

[37] R.A. Kowalski, Predicate logic as a programming lan- guage, in: Proc. Information Processing 74, (Aug. 1974) 569-574.

[38] K. Kumon et al., KABU-WAKE: A new parallel infer- ence method and its evaluation, COMPCON, Spring'86 (1986) 168-172.

[39] G. Lindstrom, OR-parallelism on applicative architec- ture, in: Proc. 2nd Internat. Logic Programming Conf. (July 1984) 159-170.

[40] E. Lusk, D.H.D. Warren, S. Haridi et al., The Aurora OR-Parallel Prolog System, in: Proc. Internat. Conf. of Fifth Generation Computer Systems, Tokyo (28 Nov. - 2 Dec. 1988) 819-830.

[41] S. Mudambi, Performance of Aurora on a switch-based multiprocessor, in: Proc. 1989 North American Conf. on Logic Programming, Cleveland, USA (Oct. 1989).

[42] S. Mudambi, Performance of Aurora on NUMA ma- chines, in: Proc. 1991 lnternat. Syrup. on Logic Program- ming, San Diego, USA (Oct. 1991).

[43] L. Naish, Negation and Control in Prolog, LNCS-238 (Springer, Berlin, 1985).

[44] G.Z. Qadah and M. Nussbaum, Logic Machines: A Sur- veys in: AFIPS Conf. Proc. NCC, Vol. 56, Chicago, IL (June 1987) 256-278.

[45] M. Ratcliffe and J-C. Syre, The PEPSys parallel logic programming languages, in: Proc. lOth Internat. Joint Conf. on Artificial Intelligence, Milano, Italy (Aug. 1987).

[46] B. Ramkumar and L. Kale, Compiled execution of the Reduce-OR process model on multiprocessors, in: Proc. 1989 N. American Conf. on Logic Programming, Cleve- land, USA (Oct. 1989) 313-331.

[47] T.J. Reynolds and D. Lyons, Transputers and Parallel Prolog, in: Proc. 7th Occam User Group Technical Meet- ing, Muntean, T. (ed.), Grenoble, France (Sep. 1987) 221-228.

[48] T.J. Reynolds et al., BRAVE - A parallel logic language for artificial intelligence, Future Generation Comput. Syst. 4 (1988) 69-75.

[49] T.J. Reynolds et al., BRAVE on GRIP, in: Proc. ICL Conf., York (May 1988).

[50] T.J. Reynolds and S. Delgado-Rannauro, VLSI for paral- lel execution of Prolog, in: Proc. Internat. Workshop on VLSl for Artificial Intelligence, Oxford (July 1988).

[51] T.J. Reynolds and P. Kefalas, OR-Parallel Prolog and search problems in AI applications, in: Proc. 1990 North

American Conf. on Logic Programming, Austin, USA (Oct. 1990).

[52] E. Shapiro, Concurrent Prolog: Collected Papers (MIT Press, Cambridge, MA, 1987).

[53] E. Shapiro, The family of concurrent logic programming languages, ACM Comput. Surv. 21 (3) (Sep. 1989).

[54] C.Z. Sun and Y.G. Ci, The OR-forest description for the execution of logic programs, in: Proc. 3rd Internat. on Logic Programming (July 1986) 710-717.

[55] C.Z. Sun and Y.G. Ci, The sharing of environment in AND-OR-parallel execution of logic programs, in: Proc. 14th lnternat. Syrup. on Computer Architecture (June 1987) 137-144.

[56] C.Z. Sun and Y.G. Ci, The OR-forest-based parallel execution model of logic programs, Future Generation Comput. Syst. 6 (1) (June 1990) 24-34.

[57] P. Szeredi, Performance analysis of the Aurora Or-Paral- lel Prolog system, in: Proc. 1989 North American Conf. on Logic Programming, Cleveland, USA (Oct. 1989).

[58] P. Szeredi, Solving optimisation problems in Aurora OR-Parallel Prolog System, in: Proc. ICLP'91 Pre-con- ference Workshop on Parallel Execution of Logic Pro- grams, Paris, France (June 1991).

[59] E. Tick and D.H.D. Warren, Towards a pipelined Prolog processor, in: Proc. IEEE lnternat. Syrup. on Logic Pro- gramming, Atlantic City, NJ, USA (Feb. 1984) 29-40.

[ 6 0 ] P . Tinker and Lindstrom, A performance-oriented design for OR-parallel logic programming, in: Proc. 4th lnternat. Conf. on Logic Programming, Lassez, J-L. (ed.) (1987) 601-615.

[61] D.H.D. Warren, Implementing Prolog - Compiling pred- icate logic programs, DAI Research Report, No. 39 and 40, University of Edinburgh, 1977.

[62] D.H.D. Warren, An Abstract Prolog Instruction Set, Technical Note 309, AI Centre, SRI International, Au- gust, 1983.

[63] D.H.D. Warren, The SRI model for OR-parallel execu- tion of Prolog - Abstract design and implementation issues, in: Proc. 1987 Syrup. on Logic Programming (1987) 92-102.

[64] D.S. Warren, Efficient Prolog memory management for flexible control strategy, in: Proc. 1984 Internat. Syrup. on Logic Programming (1984) 198-202.

[65] H. Westphal et al., The PEPSys Model: Combining back- tracking, AND- and OR-parallelism, in: Proc. 1987 Syrup. on Logic Programming (1987) 436-448.

[66] M.J. Wise, A Parallel Prolog: The construction of a data-driven model, in: Proc. Syrup. on Lisp and Func- tional Programming, ACM (1982) 55-66.

[67] M.J. Wise, Prolog Multiprocessors (Prentice-Hall, Engle- wood Cliffs, NJ, 1986).

[68] H. Yasuhara and K. Nitadori, ORBIT: A parallel com- puting model of Prolog, New Generation Comput. 2 (1984) 277-288.

[69] K. Zhang and R. Thomas, A non-shared binding scheme for Parallel Prolog implementation, in: Proc. 12th Inter- nat. Joint Conf. on Artificial Intelligence, Sydney (24-30 Aug. 1991) 877-882.

[70] K. Zhang and R. Thomas, DIALOG - A dataflow model for parallel execution of logic programs, Future Genera- tion Comput. Syst. 6 (4) (Sep. 1991) 373-388.

Page 22: Exploiting OR-parallelism in logic programs: A review

280 K. Zhang

K. Zhang is a Lecturer at Macquarie University, Sydney, Australia. He re- ceived his BEng in Computer Studies from Chengdu Institute of Radio En- gineering (now University of Elec- tronic Science and Technology of China) in 1982; and his PhD from Brighton Polytechnic (CNAA) in 1990. He was a Software Engineer in the CAD Section of East-China Re- search Institute of Computer Tech- nology, Shanghai, between 1982 and 1985. He was then an Academic Visi-

tor to the SEAKE Centre at Brighton Polytechnic in 1986. After his postgraduate studies, he was an SERC Postdoctoral Fellow in 1991, before joining Macquarie University. Dr. Zhang's current research interests are in the areas of parallel implementation of logic programs, program visualisation and parallel programming tools.