Exploiting OR-parallelism in logic programs: A review

Future Generation Computer Systems 9 (1993) 259-280 259 North-Holland

Exploiting OR-parallelism in logic programs: A review

K a n g Z h a n g

Department of Computing, School of MPCE, Macquarie University, NSW 2109, Australia

Abstract

Different forms of parallelism have been identified in logic programs for efficient implementations on multiprocessor systems. Among these, OR- and AND-parallelism have been the focus for exploitation for the past decade. Research in such exploitation has led to the proposals and implementations of various execution models and logic programming systems. This paper attempts to summarise, in the form of a structured review, the major activities in the area of exploiting OR-parallelism. Problems arising in implementing OR-parallelism and various models and working systems featuring different techniques for solving these problems are discussed.

Keywords. Logic programming; Prolog; OR-parallelism; multiprocessor systems

1. Introduction

The execution speed of sequential logic programming systems has been constantly improving since D.H.D. Warren's Prolog interpreter/ compiler [61] for the DEC-System 10 proved the usefulness of logic as a practical programming tool [37]. Yet, in order to meet the requirements of today's and tomorrow's applications, substan- tial improvements in performance are still needed. A promising approach is through the introduction of parallel evaluation strategies into the language executor. VLSI technology and parallel computer architecture advances also provide an opportunity for performance improvement. Investigations into this possibility have already led to the proposal of a number of schemes and machine architectures for parallel processing of logic programs [21,36,44].

The three major forms of parallelism exploitable in a logic program can be explained in terms of the structure of the program. Among these three forms of parallelism, low level parallelism, typically unification parallelism, may be obtained during the unification of a goal and a

clause head. By executing a stream of intermedi- ate instructions, the low level parallelism can also be exploited in a pipelined fashion [59]. A critical review on unification and its potential parallelism can be found in [35]. At a high level, when alternative clauses in a procedure are evaluated simultaneously, OR-parallelism is achieved [19]. OR-parallelism can also be exploited at a higher level in terms of the search tree, which represents the execution of the program [5,63]. The unification of more than one clause head with a given goal is considered as separate branches of the search tree. Multiple such branches, each including the clause involved and its continuation, can be executed in parallel. AND-parallelism is exploited when more than one goal in a clause body are evaluated in parallel. Each goal in the body may include multiple OR branches.

In order to exploit AND/OR parallelism more efficiently through explicit language syntax and semantics, a number of new logic programming languages supporting concurrent processing have been proposed. This by itself is an interesting research area, and is out of the scope for the present review. We refer readers to Shapiro's

0376-5075/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

260 K. Zhang

comprehensive survey [53] and collected papers [52] on concurrent logic programming languages.

This paper presents a review on various models and systems proposed for exploiting OR- parallelism. It covers most of the important models and systems reported between early 1980s and 1992. The research in this area is still rapidly evolving and some mature systems are continu- ously been improved and new models are emerg- ing. The trend has been to improve the existing working systems or to extend them for the support of both AND- and OR-parallelism. The review first introduces the basic concepts of OR- parallelism, potential difficulties for its exploitation and trade-offs for using various approaches. The review proceeds by looking at binding management approaches, followed by task control strategies used in various models for multiprocessor implementations. The performance of some working systems are then summarised and compared. The review concludes with an overall summary.

2. OR-parallelism

2.1. Problems with exploitation

In terms of computation, OR-parallelism refers to a parallel search strategy. When a search process reaches a branch in the tree, it can start to search descendant branches in parallel. The name for OR-parallelism is based on the fact that in a non-deterministic program, a query is often satis- fied by any answer. In other words, when any one of the searches starting from a choice point (non- leaf node on the tree) finds a solution, the original goal is resolved. OR-parallelism also refers to the parallel searching for all solutions. In fact, many OR-parallel systems are developed for the support of obtaining all solutions.

Consider the following example:

,-p(X). p(Y) ~ q ( Y ) , r(Y) .

p(z) , - s ( Z ) .

q(a).

r(b).

s(c).

backtrack

q

(a) sequential Prolog (b) OR-parallel system

Fig. 1. Environment stacks in evaluation of the example

In sequential Prolog (Fig. l(a)) [18], when p is called, the first clause is involved. To solve p, the system calls q which binds X to Y and then Y to a. So the original X is bound to a. This value makes r fail, then backtracking occurs to the second clause. At this point, X is reset to unbound. By calling s, X is successfully bound to c.

In an OR-parallel system (Fig. l(b)), two clauses of the procedure p are evaluated in parallel. Two different bindings, a and c, are generated for X by q and s respectively. Therefore, a major problem in exploiting OR-parallelism is the representation and management of different bindings of the same variable corresponding to different OR branches. Any implementation scheme for binding environments (or BEs) di- rectly concerns the memory management of the parallel Prolog system, which may well dominate the overall performance of the potentially large- scale parallel execution [15].

Another potential problem of OR-parallelism is the combinatorial explosion in the number of parallel tasks, which is decided by the search space of a particular application program. To prevent the explosion as well as to make the maximum use of the existing resources, a cost-effective control strategy is desirable. A heavy control overhead may destroy the achievement of OR-parallelism. Various kinds of implementation techniques have been adopted to solve these problems. Here we identify those techniques as for binding management and task control, corresponding to the above two problems respectively.

The third problem is caused by the side-effects of built-in predicates. In sequential Prolog, execution of a cut predicate prevents further evalua-

Exploiting OR-parallelism in logic programs 261

tion of alternatives on successful evaluation. When running in parallel, there is a danger that the operational semantics of a program may be altered by evaluating clauses running on different processors, which would not be evaluated when run sequentially. This leads to the issue of specu- latiue work. Ideally processors should not waste their time on parts of the search tree which may be later pruned by a cut [29]. The techniques used for handling side-effects and speculative work are also considered as task control techniques since the issues involved are the special cases in implementing task control strategies.

2.2. Leuels o f classification

When presenting different techniques used in various models for binding management and task control, we classify these models based on three levels of criteria: language support, model of task partitioning, and architectural features.

Language - This is concerned with whether the model is implemented to support full Prolog, including cut and side-effect predicates, and the implemented parallel Prolog returns the same answers as does sequential Prolog; or whether it is implemented for the pure logic programs only; or whether some extra annotations are added to the standard Prolog to assist exploiting parallelism, and thus the implementation supports a new language that, in many cases, supersedes Prolog.

Model - Most OR-parallel systems fall into two categories in terms of their approaches to partitioning tasks for parallel execution: some systems use process-based models, while others are OR-branch based (or sub-tree based). The former models exploit OR-parallelism by executing concurrently OR-processes representing individual clauses, each of which is regarded as an independent resolution. Results of these OR- processes are returned to their common parent process. The latter involves multiple processors working on different OR branches of the search tree simultaneously. Each of such OR branches includes not only the current resolution, but also its continuation. Therefore, OR-branch based models can exploit coarse-grain OR-parallelism.

Architecture - at the lowest level, various systems can be differentiated according to the way binding environments (BEs) are managed and the

way parallel tasks are scheduled as discussed in Section 2.1.

Binding management schemes include cen-

tralised and distributed environments approaches. In systems using centralised environments, most of the computation state including BEs is shared by multiple processors, while with the distributed environments approach, processors have their own copies of the BEs. Some systems use a compromised approach by dividing the computation state into a global one and a local one. The local computation state includes BEs which are non-shared. The global computation state includes the information for task scheduling (and global BEs for some systems) which are shared among all the processors. This approach is called multiple sequences approach. Task control strategies concern with how parallel tasks should be scheduled for available processors in a cost-effective manner so that the system is kept load-balanced. The strategies vary from the broadcasting and demand- driven methods to message-passing method, largely depending on the implementation models and binding management schemes adopted.

2.3. Trade-offs

Most of the systems were started with the implementation of pure logic programs with depth-first or breadth-first search semantics [19,54]. Some systems have later implemented extra-logic features, typically found in the standard Prolog [5,40]. Other systems use new languages, which usually support Prolog, but have extensions that enable the languages to take advantages of parallelism [48,65].

The motivation for supporting Prolog is that the standard Prolog semantics is well-established in the logic programming community and a considerable amount of code is already in existence. This approach allows existing Prolog programs to be run without modifications but at a higher speed. However, the speed-up can be much lim- ited for the programs which make heavy use of side-effect predicates. More complicated implementations are required for supporting side-effects in OR-parallel systems. This is due to Pro- log's sequential semantics of depth-first left-to- right search strategy.

262 K. Zhang

The motivation for developing extended Pro- log languages is that the standard Prolog is not suitable for highly parallel processing as it was designed for uniprocessors in the first place. With the new language features supporting parallelism, the implementations can certainly be more efficient. But the existing Prolog programs cannot be executed on such systems without being partly rewritten.

The choice of the language to be supported effects, to some extend, the decision of the binding management scheme. The centralised environments approach attempts to make the best use of the existing efficient implementation of sequential Prolog [61] by extending it with the capability of handling multiple bindings for individual variables [14,55,60]. Most systems using centralised environments are based on OR-branch implementation model so that semantics of the standard Prolog can be closely modelled. This approach has the following advantages: (a) The similarity in implementation to conven-

tional languages, in the way that BEs are manipulated, lends the approach well to conventional parallel processing techniques for existing multiprocessors.

(b) In preventing combinatorial explosion of parallelism, the transition from parallel mode to sequential mode and vice versa can be relatively smooth.

(c) Given the existing sequential system, the parallel extension is the easiest to understand, modify and optimise.

The distributed environments approach aims at high distributability and scalability and at the implementation on non-shared memory or message-passing architectures [20,66,69]. The execution models using this approach are typically process-based, since independent processes and rele- vant environments can be distributed among many processors. BEs are usually independent from each other and each dereference operation is performed in only one or two BEs. Parallel tasks can be controlled through message-passing between the processors involved. These models overcome the drawbacks of a centralised approach which requires a shared memory, but in tradeoff of extra copying and binding operations.

The third environment management approach, the multiple sequences approach, attempts to take the advantages of both centralised and dis-

tributed environment approaches [4,40,63]. It makes the best use of the efficient sequential Prolog implementation but also enjoys high scalability. The central idea in this approach is to use multiple processors, each equipped with a sequential Prolog engine, to work simultaneously on different parts (OR-branches) of the search tree. The task switching is performed upon the demand of an idle processor. Therefore, this approach is a natural choice for the systems that support full Prolog. It exploits coarse grain OR- parallelism effectively and has demonstrated most promising speedups recently.

In principle, a good OR-parallel system should support high scalability. More specifically, the combination of a binding management scheme and a task control strategy should allow such a system to perform parallel operations in a time which is independent of the number of parallel tasks and the size of terms involved in unification. These parallel operations include the allocation of environment spaces, unification and resump- tion after success or failure. Gupta has identified three criteria corresponding to these three operations [25], namely, • constant-time environment creation; • constant-time variable access and binding; and • constant-time task switching. Unfortunately, it has been shown [25] that it is not possible to achieve constant-time execution for all three operations, therefore a compromise has to be made to achieve the best possible performance.

Shared memory multiprocessors are best served by methods which have centralised environments or multiple sequences, which sacrifice constant time task switching, since task switching is under the control of the scheduler. Task creation and variable access are program dependent, and so cannot be effectively optimised by the implementation. This is why various scheduling policies have been investigated for the Aurora and MUSE systems which are multiple sequential systems and developed for shared memory multiprocessors (see Section 3.3).

Distributed memory systems are favoured by methods with non-constant-time task creation, since the whole binding environment are made available locally by copying during task-creation to achieve environment distributability.

The rest of the paper describes individual


models and systems in more details according to the lowest level of classification, i.e. binding management and task control.

3. Bindin~ management

3.1. Centralised environments

The main idea in the centralised environments approach is to build a virtual stack for each process so that it can share as much information as possible with its sibling processes and make copies only of the information that has to be bound uniquely by that process. The binding schemes in these models have the following concepts in common: (a) variable bindings are kept locally in individual

clauses; (b) unification of a goal and the head of an

applicable clause often requires access to variables which may be bound earlier;

(c) the unification between two unbound variables is realised by binding one variable to the reference to the other.

The differences between these schemes are that, to dereference an ancestor variable, different types of auxiliary structures are used (e.g. directory trees or hash windows) to allow each clause to store its own copy of the variable.

3.1.1. Early models The three earliest representatives of the stor-

age management algorithms for OR-parallel execution have been compared in a study by Cram- mond [21]. These are the Directory Tree algorithm, the Hash Window algorithm and the Vari- able Importation algorithm. The first two models use structure sharing approach, while the third one uses distributed approach through structure copying (see Section 3.2.1).

Ciepielewski and Haridi [14] used a number of frames - created upon the invocation of clauses - to represent a binding environment. A frame containing no unbound variables is shared among its descendent OR processes, while only those frames which contain unbound variables are copied for a new environment. The OR search tree is represented as a directory tree, where a directory contains entries for all the frames asso-

ciated with the goal. A variable is accessed by looking up the current directory in the hope that one of its frames contains the variable. If no such frame is found, the parent directory is consulted, and so on, until the variable is found.

The hash window algorithm [11] assumes that most unification operations generate bindings for only a small number of variables, so that these bindings can be presented in a hash window, and easily accessed by the newly forked OR processes, when a variable is found unbound, a similar dereference procedure (as in a directory tree) is conducted on the path to the ancestor frames. The dereference is made more efficient in the PEPSys model (Parallel ECRC Prolog System) by tagging each binding with a label identifying the binding position in the chain of hash windows [65].

The evaluation result [21] favours the hash window algorithm in execution speed - 20% and 32% faster than the variable importation and direct tree respectively. Its memory usage is also more efficient than the other two schemes. Four refined versions of the storage model [14] have been implemented by Ciepielewski and Hausman [16]. The performance evaluation of these implementations also show that hashing techniques perform best.

3.1.2. OR-forest The OR-forest-based environment sharing

model [54] also creates a local environment directory with each process for directing variable dereferencing. In this model, variable bindings generated at different nodes are stored in different binding records (called frames in the Direc- tory Tree model mentioned above), and accessed by OR-parallel processes through their own di- rectories. Binding records, however, contain only bound variables, rather than both bound and unbound variables, and thus the overhead of checking and copying unbound variables is avoided when creating new environments [55]. The binding records generated at an ancestor node can be shared by its child nodes.

The creation of a process is based on the concept of OR-forest, instead of OR search tree. An OR-forest is constructed out of a number of OR-trees, each having an independent AND goal as its root node. A major advantage of this form of tree construction over the conventional OR

264 K. Zhang

search tree is that it avoids redundant evaluations in case of AND goals being independent [54].

A high level simulation study of the OR-forest model using five benchmark programs indicates that the search speed of this model is 2 to 6 times faster than that of an OR search tree model, and that both deterministic and non-deterministic programs offer parallelism for the OR-forest model, but only non-deterministic programs for an OR search tree model [56].

3.1.3. BOPLOG The performance-oriented OR-parallel WAM

proposed for the BBN Butterfly Parallel Proces- sor, called BOPLOG [60] uses an extensively copied data structure in order to gain speed-up. BOPLOG's binding method makes use of a dou- bly-circular linked list, or binding list, to represent the values of each shared variable. Each such binding is also stamped with a time tag, corresponding to the time when the current binding was taken from a choice point, when the number of bindings to a variable increases, so does the length of the variable's binding list, and hence the access time to the variable increases proportionally.

In BOPLOG, the concept of binding span, measured by time stamps, is used in each entry of the ancestor stack to guide the dereference of variables to the correct bindings. This strategy makes unwinding unnecessary, therefore no trail is needed. It may, however, introduce a signifi- cant overhead for accumulation of both success bindings and failure bindings, which in turn imposes an unnecessary cost on dereferencing.

3.2. Distributed environments

In distributed environments schemes, the number of BEs seen by any process is restricted to one or two and thus dereference operations are simpler than in a centralised environments scheme. But BE independence is achieved at the cost of extra copying and binding operations. The ways in achieving BE independence distinguish distributed models.

3.2.1. Early models The earliest distributed binding scheme was

proposed by Wise in implementing his EPILOG language [66,67]. The basic operational mecha-

nism in EPILOG is to manipulate dframes - i.e. dynamic or distributed frames. A dframe contains certain context information of a clause and BE of variables which are local to the clause. The EPI- LOG scheme treats dframes as black boxes such that information is transferred between dframes entirely through message passing.

In support of unrestricted AND-parallelism as well as BE locality, the successful resolution of a child process will cause its bound arguments to be back-unified with the original terms in the corresponding goal.

Lindstrom's variable importation scheme [39] was another early model with high locality. In this scheme, all unbound variables in the parent BE are imported into the child BE. When the last body literal has been solved, the still unbound variables are then exported to the new copy of the parent BE. Therefore, subsequent unifica- tions can only have local effects independently on the descendent OR processes. This scheme is suitable for implementation on a distributed system with hierarchical memory organisation.

Extensive BE copyings are used in Halim's data-driven OR-parallel system [26,27]. In this model, a goal is fired for reduction once a BE is available to it. A BE is copied upon a new variable binding or a variable updating, such that the initial goal triggers the computation in the form of BEs and finally a set of BEs are produced as the solutions to the program. Details on memory management was not provided in [26,27].

3.2.2. Closed environments Conery [20] proposed a binding scheme, known

as closed environments which is a further development towards a distributed system, particularly the A N D / O R Process Model [19].

The main idea in the closed environment scheme is to allow only a small number (actually one or two) of the binding environments to be operated upon by an active process so that a distributed binding structure can be implemented. To achieve this, Conery introduced an environment closing algorithm, which transforms the intervening binding environments into closed form. A closed environment is defined as a set of environments E such that no variables contained within E, or within the structured terms pointed to from E, is outside of E. The closing algorithm involves operations of instantiation, renaming or


environment 0 heap

(a) After unification of p(a, f(X)) with p(Y, Z)

environment 0 heap ixl-- L - - - environment 1 ~

(b) After closing Environmentl with respect to Environment0

Fig. 2. Closing an environment after head unification.

creating variables, in order to organise the bindings in such a way that all the information needed for unification is presented in the environment of either the caller clause or the caller clause. Fig- ure 2 shows an example of a closed environment (Environmentl in Fig. 2(b)), which is obtained by applying the closing algorithm to the unification of a goal (p(a, fiX)) and a clause head (p(Y, Z))).

The major sources of overhead in the closed environment scheme are the extensive copying of environments and the two-stage closing algorithm. A more serious problem might be the extra required operation - to close the environment of the caller clause after the successful reduction.

A variant of the closed environment approach is described by Kale et al. [33]. This approach

uses a technique to identify ground structured terms so that they need not be copied in an environment closing operation. According to Conery's approach, these terms need be copied.

3.2.3. D I A L O G D IA LO G is a distributed model based on

dataflow computation [70], extended from the PIM-D dataflow concepts [30,31]. A major fea- ture of the D IA LO G binding scheme is that the number of BEs operated on by a process at any instant time is restricted to one, that is the process' own BE [69].

The design principle of the D IA LO G scheme is based on the following fact. The evaluation of a process can be entirely independent of the prece-

A(X, Y) <-

l face ins~nUa~on t

B(X), Ct

I instantlation ~-

I interface instantlationl Xou=, " ~ t interface instantiation I Y output

~').

Fig. 3. The DIALOG non-shared binding scheme.

266 K. Zhang

dent (sibling or parent) processes if the variables which appear in the input arguments of the process, and are bound during the previous unification, have been substituted by their instances according to the previous BE. This is illustrated in Fig. 3, where a process is treated as a black box. The BE is local to the process and used for instantiating the variables that have been bound earlier in this process.

Variable instantiation operations ensure that all bound variables appearing in an argument will be substituted with their corresponding instances in the BE. When structured terms contain variables for substitution, they are reproduced with the variables being substituted by their instances. Instantiation operations are applied to the process arguments which are due to be transferred to the next process. Argument transfer may happen either (a) between sibling processes (called interface instantiation), or (b) from a parent process to its child process after head unification (called face instantiation ).

Once the evaluation of a child process termi- nates, the ancestor variables that were bound in the child process are paired with their binding instances and exported back to the parent process. This export operation, together with interface instantiation, achieves the same goal as the back unification operation in EPILOG [67]. In common with other structure sharing schemes, the DIALOG scheme introduces some overheads on instantiation operations and structure copying.

3.3. Multiple sequences

The motivations of developing a system using multiple sequential Prolog engines are that the performance of a parallel implementation is very much determined by the performance of the un- derlying sequential implementation and that multiple well coordinated sequential engines should deliver an optimal performance. This approach was developed out of the centralised environments approach, and thus shares some common features with the centralised approach. More important features include, that the maximum active parallelism at any given time is matched to the size of the parallel machine; and that the Prolog engine is separated from the scheduler so that either can be replaced or updated without

disturbing the other. This means that any existing Prolog engine can be adapted with some modifications while the responsibility for exploiting parallelism lies with the scheduler. Schedulers will be reviewed later in Process Control Section.

The systems using multiple sequential Prolog engines exploit coarse grain parallelism very effectively on reasonably large scale multiprocessors, and the execution time of a single engine is only fractionally more than the fastest standard Prolog system [13].

3. 3.1. ANL - WAM One of the early systems using multiple se-

quences is ANL-WAM (Argonne National Labo- ratory WAM) [23]. To retain the efficient usage of value cells in the WAM, while allowing simul- taneous binding of a variable from different OR branches, ANL-WAM uses the concept of favoured binding to indicate that a binding is favoured if it is made at the left-most branch. A favoured binding can be stored in the value cell, whose address and the binding are then recorded in a binding node. For every other alternative branch, an unfavoured binding is only stored in a binding node. In both cases, the entry to the binding node is kept in a hash table. Each hash table has a backup copy of its headers for each alternative branch. To dereference a variable whose binding is another variable, the corresponding hash table has to be consulted. An extra flag field must be used in a value cell to identify whether the binding of the variable is favoured or not. The experimental results indicate that the system benefits little from using "favoured" binding [23].

3.3.2. SRI Model and Aurora A more efficient multiple sequences model is

known as the SRI Model developed by D.H.D. Warren [63]. Computation in SRI Model is performed by a number of workers working on different OR branches of the search tree simultaneously, in a depth-first, left-to-right search strategy as used in sequential Prolog. When a worker finishes its work on a branch, it switches tasks based on the scheduling policy that the topmost task on any branch should be chosen as a candidate. In order for the shared data to be read-only, each worker has a segment of the shared memory, where the control stack, global stack, local


Trail Value cell

instance I = ,

Binding array

J instance

I

Fig. 4. A variable with conditional binding in the SRI Model.

stack and trail are kept and operated in a similar way to the WAM. To handle multiple environments, there are two binding arrays (i.e. global and local) introduced to store global and local variables respectively. A similar approach also using binding arrays was proposed independently by D.S. Warren [64].

To bind a variable unconditionally, i.e. for a local variable binding, the binding instance is overwritten into the corresponding value cell without trailing. If the binding is conditional, i.e. for a global variable binding, the instance is writ- ten into the current working binding array, whose location is recorded in the variable value cell, and both the variable address and the instance are trailed (only the address is trailed in the WAM) as shown in Fig. 4. Upon backtracking, the corresponding binding array location is found by fetch- ing the pointer stored in the value cell, whose address is obtainable from the trail, and then the binding is "unwound".

The dereference operation is similar to other parallel systems developed from the WAM. It is estimated by Warren [63], that binding and un- binding are about 40% more expensive, and dereference is about 60% more expensive than in the WAM. This model imposes less overhead on a worker, when it is working than the previous two models [14,23]. The concept of a numbered variable is introduced into the SRI Model, which simplifies the judgment of seniority when a variable is bound to another.

The main overhead of the SRI Model is the updating of the binding array when a worker switches tasks from a node to another. Whether a binding is conditional or unconditional is decided at run-time, and the promotion of a binding from conditional to unconditional leads to the need to remove the binding from the corresponding binding array.

As a part of the Gigalips Project (an informal collaboration between Argonne National Labora- tory, USA, Bristol University, UK and Swedish Institute of Computer Science (SICS), Sweden), the Aurora OR Parallel System [40] is an implementation of the SRI Model for SICStus Prolog [13], a portable sequential Prolog system developed at SICS, on a multiprocessor, based on the experience from ANL-WAM. Different schedulers have been developed for Aurora and will be discussed in Section 4.1.2.

The performance evaluation of the Aurora System shows that the overhead of updating binding arrays on task switching is tolerable in prac- tice, but locking and moving around a shared part of the search tree may cause more overheads [41,42]. Nevertheless, the reported performance [57], compared with the fastest commercial implementation, is encouraging.

3.3.3. MUSE MUSE (MUltiple SEquential Prolog engines)

is another OR-parallel system developed in parallel at SICS with Aurora [4]. The copying policy adopted in MUSE is different from that used in Aurora such that it provides higher degree of locality of references. In MUSE, OR-parallelism is explored by a number of workers, each executing a sequential Prolog engine and having its own choicepoint stack, environment stack, term stack and trail, which are essentially parallel versions of the WAM stacks. Workers also share a part of memory for storing global data.

When copying data from one worker to another after a worker runs out of work, workers incrementally copy parts of the WAM stacks and also share nodes with each other. The two workers involved in copying will only copy the different parts, while the shared memory space stores information associated with the shared nodes on the search tree. The different parts to be copied are always stored in the cache of the source worker. Also all the WAM stacks are located at a fixed address in the local address space of each worker so that relocation of pointers is avoided when a worker copies a segment of stack to another worker. Such incremental copying policy requires a minimum modification of the WAM. Therefore, the features and advantages of the sequential Prolog are best preserved in the MUSE model.

268 K. Zhang

The performance evaluation shows that the MUSE model is faster than the Aurora model when more workers are added [6]. It is suggested that the reason for this is due to the large number of non-local accesses of stack variables in Aurora. MUSE and Aurora are further compared in Section 5.

3.3.4. Versions-vector With many similarities to the SRI Model [63],

the Versions-Vector OR-parallel Model (or VV) also features very cheap variable accessing and expensive task switching [17]. The major difference is that the VV Model allocates a versions vector (instead of a value cell) for each condition- ally bound variable. The number of components in a vector is equal to the number of processors in the system and each component is used by just one processor to store and access conditional bindings belonging to that processor. A vector is allocated only when a variable obtains its first conditional binding. But space is wasted for the variables that are never used by some processors because components corresponding to these processors still have to be allocated.

4. Task control

The amount of exploitable OR-parallelism depends on the search space of the application program. When the search space becomes too large, in other words, when the degree of potential parallelism is much higher than the number of processors used, a cost-effective control strategy is needed in order to keep the system load- balanced. Various task scheduling policies have been proposed for the general control of parallel tasks generated.

The search space may include speculative work which is not evaluated in a standard sequential system due to the effect of a cut, but is searched undesirably in a pure OR-parallel system. The performance could be significantly improved if a large speculative work can be avoided for execution. The issues in handling speculative work and the effect of cut have been dealt with in most of the recently implemented systems. Good speedups have been reported when avoiding speculative work.

4.1. Task scheduling

4.1.1. Broadcasting One of the techniques to control the paral-

lelism is the broadcasting method, used in early models. It reduces shared memory contention by generating locally, in busy processors, portions of the search tree upon the requests of idle processors. Two such systems are KABU-WAKE [38] and ORBIT [68].

ORBIT has a broadcasting system architecture, where a program is duplicated in each processor, and the workload is controlled by a control processor. The processes of the search tree are partitioned into a number of process bundles. An idle processor can obtain a bundle of processes from another processor under the guid- ance of the control processor, which keeps the updated information (depth) of each processor's control stack where the processes are stored. The processor with the deepest control stack is chosen by the control processor for feeding the idle processor.

In the KABU-WAKE method, the partition rule is slightly different. Upon the request of an idle processor, the busy processor finds its oldest unprocessed alternative on the search tree, and then sends the portion of tasks under that alternative to the idle processor meanwhile deleting the alternative on its own stack. The messages are exchanged between busy processors and idle processors, without a centralised control processor.

The parallelism in this scheme is under control for both large-scale and small-scale parallel machines since the size of a bundle depends upon the number of available processors. But the centralised control processor used in ORBIT can be a bottleneck when a large number of processors make requests to it, or report their control stack information to it simultaneously. The KABU- WAKE system eliminates the control bottleneck by using extra tokens on the communication stream between processors to allow distributed control of the portion migration.

Another system based on broadcasting hard- ware is Ali's BC-Machine [3]. Compared to OR- BIT [68], this system distributes most control operations to local processors by dynamically assign- ing a master processor for each processor group. All processors, each having a local memory, are


divided into groups dynamically as the search tree is partitioned. Whenever a master creates enough jobs, determined by a threshold, it makes the local unprocessed jobs global in a centralised control memory so that every local processor in a group will have the same memory image. There- fore the group of processors can be load-balanced effectively. The master then copies its state to all idle processors in parallel via a specialised broadcasting network.

4.1.2. Demand-driven An alternative approach to scheduling parallel

processes may be called demand-driven evaluation. Tasks are migrated to an idle processor only on demand. Each of the tasks represents an alternative clause of a choice point generated by a parent process in process-based systems; or an available sub-tree yet to be traversed in OR- branch-based systems. This strategy guarantees that the amount of parallelism is kept as high as necessary to fill the multiprocessor system, as is the overhead arising from the parallelism, provided that sufficient parallelism is available in the program.

Among the demand-driven strategies, early version of Aurora [40], BOPLOG [60] and the BRAVE Abstract Machine (BAM) [22] adopt the same policy for task switching, that is, an idle

processor (or worker as used in the related litera- ture) attempts to obtain a choice from a choice point which is near the root of the entire search tree. It is illustrated in Fig. 5, where the first available worker will claim work from node B. This policy assumes that the choice closest to the root of the search tree can provide the most work to fuel the sequential processing so that the granularity of tasks can be maximised. One way to guide the idle worker to such a choice is to provide the implicit temporal ordering information in the stacks so that an exhaustive search can be avoided [22].

The above task-switching strategy is referred to as the Argonne Scheduler and used in Aurora. Three other schedulers have also been developed for Aurora, i.e. the Manchester scheduler [12], the Wavefront scheduler, and the Bristol scheduler [9]. The Bristol scheduler dispatches on the bottom-most and sharing several nodes on each branch at a time. It also takes into account the speculative work (see Section 6.4). The others dispatch on the top-most, and share one parallel node on each branch at a time [5]. The Manch- ester scheduler tries to match idle workers with the 'nearest ' available outstanding task, i.e. the task which requires the least number of bindings to be updated between the current and the new positions [12]. The Wavefront scheduler links all

J process

J processor41 I proc.~o~l

Fig. 5. A search tree with available choice points at node A, B and C.

270 K. Zhang

the topmost live nodes together in a data structure known as a wavefront. Workers traverse the wavefront to find available work [10].

The advantage of dispatching on the top-most is that the size of the shared region is minimised, and the size of tasks is kept as large as possible. This should lead to a minimisation of the number of task switches required. However, the cost of task switching in this scheme is increased, since finding a task always involves a general search of the tree [5].

On the other hand, dispatching on the bottom-most attempts to minimise task-switching overhead, and also reduces the amount of speculative work done.

The BC-Machine method mentioned earlier was later replaced by the MUSE method which uses bottom-most technique to match idle workers with available work as follows [4]. An idle worker always attempts to get the nearest piece of available work on the current branch. If no available work is found, it attempts to choose a busy worker to share its excess local work. If no busy workers are found, it stays idle at a suitable position on the tree. Busy workers are made visible only to the nearest idle worker so that the task-switching cost is reduced and no more one worker can get work from a single busy worker.

In the PEPSys Model [7,65], all tasks are ini- tially sequential, where applicable, potential OR-parallel processes are indicated by branch points, created upon user provided OR-parallel specifiers. The Delphi model divides the search- tree into OR-branches evenly for workers when the execution is started. It avoids environment copying during task-switching by redirecting the worker to another OR-branch through local backtracking [7]. The decision for task-switching is also made through a centralised control mechanism.

The disadvantage of the above systems is that high memory contention is unavoidable for a large scale system since the scheduling decision is made on the search tree information stored in a global memory.

Such memory latency is alleviated in BRAVE [50], where the memory is organised and operated in the similar way as in the SRI Model [63], but the scheduling policy is more distributed than that in the SRI Model. Each worker in BRAVE keeps a task list in its local memory. A worker

needs to consult other workers (rather than the shared search tree information), only when its own task list becomes empty, which means it needs more tasks to process.

BRAVE has been implemented on the reduction machine GRIP [49], on the transputer [47], and later on a message-driven machine [22].

4.1.3. Message-passing The demand-driven control strategy described

above is derived from the sequential implementation on a shared memory or a centralised control system. Task switching in the former system is decided by consulting the global search tree information, which may be highly costly on a mas- sively parallel machine, in addition to the cost imposed by the task switching itself. The portion migration using broadcasting in KABU-WAKE requires a high bandwidth of communication due to its duplicated copying of a large portion of information between processors, and the necessary post-processing in the sender processor. Broadcasting from and reporting to the control processor in the latter system can also limit the actual number of multiple processors.

To decentralise the control and distribute the shared information as well, message-passing models have been proposed. For example, the A N D / O R Process Model [19] controls OR-parallel processes locally at the process level. In this model, processes communicate with each other by passing messages under certain state modes. When an OR-parallel process receives a start message from its parent, it attempts to unify its head with the head of each candidate clause. If the unification is successful and the descendent AND processes in the body (if any) succeed too, the OR process sends a success message to its parent and goes to the gathering mode. Other- wise, a fall message is sent to the parent. During the evaluation of the descendant AND processes, the OR process is in the waiting mode. An AND process may receive a redo message from its parent OR process after it sends a success message with a result to the parent. The redo message will cause the AND process to immediately start working on its next answer. The OR process meanwhile sends the result to its own parent process if it is in the waiting mode.

The scheme can achieve control of the parallelism during the local communication between


processes. Therefore, it is well suited to a highly distributed large system. The main disadvantage of the scheme is the extra overhead caused by the two-way communication. How the control scheme is guided by the run-time workload of the system is also unknown.

4.2. Side-effects and speculative work

The simplest way to avoid altering the semantics of side-effect predicates such as cut, assert, and retract, is to stop their use, and replace them with alternative clauses which can be used to achieve the same effect. The reason for using this approach is that Prolog was developed for uniprocessor systems, and has clauses which are inappropriate for parallelism [4]. An example of using alternative clauses is BRAVE [51]. Other systems use similar approaches as used in the Aurora model. The Aurora philosophy is that current sequential Prolog programs should work unmodified on the parallel system, which makes it more difficult to implement cut, assert and retract. This approach is motivated by the obser- vation that standard Prolog semantics are well accepted by the logic programming community, and a considerable amount of code is already in existence [28]. MUSE compromises by implementing cavalier commit, instead of cut and asynchronous assert and retract in the place of synchronous (sequential) side effects [4,5].

On the other hand, cut is used in sequential Prolog to avoid unnecessary computation, for example, when the required solution of a problem is just one (or a subset) of many possible solu-

tions. But in an OR-parallel Prolog system running on a multiprocessor architecture, when one processor finds a solution, other processors may still work on more solutions which are not needed. Such wasted work is called speculative, which could be cut away in sequential Prolog as shown in Fig. 6.

4.2.1. B R A V E BRAVE is designed for all-solutions parallel

execution, where cut, assert and retract are re- moved on the basis that these features compromise parallel execution. Alternative features are provided in order to code certain algorithms [51].

Cut is replaced by an i f_then_else construction to implement conditional control. The construction uses the syntax: p : - q ~ r ; s.

( I f q then r else s)

In sequential Prolog this would be coded as:

p :-q,!,r. p : - s .

Note that SICStus Prolog also provides an if predicate [51], where the above would be coded as"

if (q , r, s ) .

Difficulties arise when q has more than one solution. In parallel execution, the solution q committed to will be indeterminate. As a result the semantics may change from run to run. Two solutions are available, BRAVE continues execution unless another solution for q is found, when a run-time error will be reported. An alternative

search space leading to useful solutions

I I cut away search space (speculative)

Fig. 6. Speculative work in a search tree.

272 K. Zhang

is provided in MU-Prolog [43], which suspends q until it becomes ground. This approach can be implemented in BRAVE since goal suspension is supported, though this requires more programmer effort.

Assert and retract are handled by a database for partial results (known as lemmas). This allows meta-control of assert and retract, rather than the usual parallel approach of allowing these clauses to execute as they are encountered (asynchronous assert and retract). This again has the advantage of allowing tighter programmer control, but with the disadvantage of requiring greater programmer effort.

4.2.2. MUSE MUSE supports a generalisation of cut, known

as commit, that is a special form of pruning operator for parallel implementation and is not guaranteed to produce identical effects to sequential cut [4]. This provides simplicity, but has the danger that sequential semantics may be altered. Full Prolog had been implemented in MUSE, including cut and standard side-effect predicates (e.g. read, write, assert, retract, etc.). The proposed mechanism for handling sequential side-effects is to allow them to execute only on the left-most branch of the search tree, suspend- ing execution until that point [5].

A novel method for handling write type of side-effects (e.g. write and assert) is used in MUSE [34], using the fact that such side-effects do not alter the binding environment. If a worker is unable to execute a write type of side-effect predicate, the predicate is temporarily saved in a suitable node on the search-tree, the worker then continues its execution as if that side-effect has been executed. The same method has been used to implement findall, bagof and setof predicates when saving multiple solutions.

4.2.3. ROPM Speculative work is pruned in a different way

in ROPM [46]. The branch of the search tree pruned by a large cut is not discarded. It is restarted when the larger cut is pruned later by a smaller cut. This scheme causes a large memory space to be wasted because pruned regions cannot be discarded, even if they will never be restarted.

4.2.4. Handling speculative work in Aurora The Aurora implementation also supports

commit operator and synchronous side-effects [29]. In general, if a cut is in the scope of another cut, the cut with the larger scope must not prune away branches that would be pruned away by the cut with the smaller scope.

Various schemes for handling speculative work have been implemented in Aurora, such as local leftmost, scope information based schemes for pruning speculative work, local less speculative preferred, delayed release, and depth-first search schemes for scheduling speculative work.

4.2.4.1. Pruning Local leftmost scheme is the simplest solution. It prunes all the branches rooted at the right siblings of a worker's sentry node, and suspends the execution of cut until the branch becomes leftmost in the subtree belonging to the predicate containing the cut. The draw- back of this scheme is that there is the possibility of workers executing parts of the search tree which will be cut away, so the effort is wasted on this speculative work.

Scope information based scheme immediately prunes away all branches which would be pruned away by cuts with smaller scopes. Execution is suspended only if it is in the scope of a cut with a smaller scope. Implementation is by means of cut counters, which contain the current number of cuts in the worker's continuation.

In another approach proposed by Ali [2], a branch is delayed for execution until it can no longer be affected by cuts. While this prevents speculative work completely, it severely limits the amount of parallelism.

The scope information based approach was found to exhibit better speedups than the local leftmost scheme for more than 4 workers [29].

4.2.4.2. Scheduling Ideally if speculative work exists in a search tree, workers should not be committed to this area while there exists useful work in other areas. The least speculative tasks are found to be in the leftmost branch of the subtree, and tasks become less speculative on the lower part of the subtree. Speculativeness can be assessed by counting the number of branches leading to cuts that could prune the work. Haus- man proposed several schemes in an attempt to minimise the amount of time spent in executing speculative work [29].


The delayed release scheme makes speculative work available to other workers less frequently than non-speculative tasks. The delay time before speculative work is made available can be either proportional to the speculativeness of the tasks, or constant. This scheme can be implemented by increasing the granularity of speculative work. Granularity is set by making work available after a certain number of calls, normally 10, but increased by a factor when work is speculative. Constant delay was found to give better performance, due to the overheads of counting pruning branches.

The local less speculative preferred scheme has workers migrate to the leftmost branch leading to work, and taking the topmost available task. The migration is controlled by having workers migrate to the branch with the least number of active workers, unless the work is speculative, in which case it migrates to the local leftmost branch leading to work.

The depth-first search has workers attempting to take the bottom-most available task in the leftmost branch when work is speculative. Imple- mentation of this in speculative regions is achieved by keeping all nodes in a boundary branch public, i.e above the dispatching nodes. The general strategy is for workers searching for

work to take the bottom-most available task, instead of the topmost.

The combination of depth-first with the local less speculative preferred strategies are considered to be the best, and have been used in the Bristol scheduler for Aurora [9]. An example for using this scheme is shown in Fig. 7, where worker A is due to perform a cut and five other workers work in the region that is to be pruned. In the Bristol scheduler, worker A will be able to identify and then interrupt workers B, E and F. Worker B will search the branch whose root S was suspended during the pruning by worker A, and will interrupt worker C. Worker D will in turn be interrupted by worker C. The five interrupted workers will look for work elsewhere outside the pruned region.

5. Perfi~rmance on multiprocessors

Of the OR-parallel systems reviewed above, several systems have been successfully implemented on multiprocessor machines and shown promising speed-ups. Among these, Aurora and MUSE are the most representative in providing typical performance characteristics of an OR- parallel system. This section briefly compares Au-

region to be pruned

WO

region to be suspended

Fig. 7. Bristol Scheduler scheme for handling speculative work.

274 K. Zhang

rora and MUSE and then summarises the performance of Aurora and MUSE.

Both MUSE [4] and Aurora [40] exploit OR- parallelism by using a number of workers (processes or processors) each working on a different part of the Prolog search tree. As described in Section 3.3, different binding management approaches are used in the two systems. MUSE uses incremental copying of the WAM stacks while Aurora uses the SRI Model.

The SRI Model extends the WAM by using a large binding array in each worker and modifying the trail to contain address-variable pairs instead of just addresses [63]. A binding array is used in each worker to store and access variable bindings which are potentially shareable. The WAM stacks are shared by all workers. In MUSE, however, each worker has its own copies of the WAM stacks and some global address space shared by all workers. Workers incrementally copy parts of the stacks and also shared nodes with each other when a worker runs out of work.

MUSE and Aurora also use different schedulers for exploiting and controlling parallelism (see also Section 4.1.2). Among a number of schedulers developed for Aurora, the Argonne scheduler and the Manchester scheduler have been evaluated for their performances on various machines. According to the reported results, the latter always outperforms the former. MUSE has only one scheduler. The main difference between the two Aurora schedulers and the MUSE scheduler is in the strategy used for dispatching work. The Argonne and Manchester schedulers take work from the topmost node on a branch, while the MUSE scheduler always takes the bottom- most node on a branch.

Many optimisations have been made for both Aurora and MUSE on the machines reported below [5,58]. The only optimisation that has been implemented for MUSE but not for Aurora is caching the WAM stacks on the BBN Butterfly TC2000. A detailed performance comparison between MUSE and Aurora running a knowledge- based system application is reported in [6].

5.1. On Sequent Symmetry

An early version of Aurora, using the Manch- ester scheduler, has been instrumented to evalu-

ate the basic set of profiling data [57]. A bench mark suite consists of three groups of programs, according to their speed-ups on a Sequent Sym- metry $27 multiprocessor (12 processors and 16 Mb shared memory): high speed-ups (e.g. 8- queens problem) provides large search spaces, medium speed-ups (e.g. zebra puzzle) and low speedups (e.g. farmer crossing river problem) provides relatively smaller search spaces. These programs do not contain any large speculative works.

It is found that the Aurora execution time is contributed mainly from three sources, i.e. sequential execution plus parallel administration, task switching and processor idle [57]. The parallel administration overhead using the SRI binding scheme is roughly 25-30% of the sequential execution time. In other words, the ratio of the running time on Aurora with one worker and the running time on SICStus0.3 is about 1.25 to 1.30. The task switching overhead increases considerably when the granularity decreases. The aver- ages of the task switching overheads do not vary among the three groups, but vary considerably for individual programs because the frequency of scheduling operations varies from one program to another. The idle time is decided by the amount of parallelism available in the program, delays in creation of work due to task switching and the granularity of exploitable parallelism.

A shared memory architecture requires centralised memory administrative operations in implementing an OR-parallel system, such as locking when extending or shrinking parts of the search tree; and data migration when updating binding arrays [63]. The former accounts for 6-7% of total overhead time and increases slightly for more workers. The latter accounts for at most 10% of total overhead time and increases proportionally with the increase of the number of workers.

MUSE was also evaluated on a Sequent Sym- metry ($81 with 16 processors and 32 MBytes shared memory), but using SICStus0.6 Prolog as its engine [4]. When evaluating the same benchmark suite as used by Szeredi for Aurora [57] and an additional set of large and real programs, similar results (as with Aurora) have been obtained, except that the absolute execution times on MUSE are about 30-50% shorter than those on Aurora. The results show almost linear speed-


Table 1 Run-times (in seconds) of MUSE and Aurora on Sequent Symmetry

System Benchmarks 1 worker 4 workers 8 workers 15 workers 25 workers

MUSE circuit 426.74 (1.00) 28.73 (14.9) 17.39 (24.5) 8-queen 6.910 1.740 (3.97) 0.880 (7.85) 0.490 (14.10) zebra 4.390 1.331 (3.30) 0.840 (5.23) 6.89 (6.37) farmer 3.199 1.399 (2.29) 1.419 (2.25) 1.429 (2.24)

Aurora circuit 533.69 (0.80) 36.06 (11.8) 21.83 (19.5) 8-queen 7.831 2.000 (3.92) 1.010 (7.75) 0.559 (14.01) zebra 5.021 1.480 (3.39) 9.40 (5.34) 0.769 (6.53) farmer 3.620 2.110 (1.72) 2.1 t 0 (1.72) 2.390 (1.51)

ups for the programs with coarse grain parallelism, reasonable speedups for programs with medium grain parallelism and low speedups for programs with fine grain parallelism.

It is found that copying of a part of a worker state, making a part of the search tree shareable and grabbing a piece of work from a shared node contribute to the major sources of overheads.

Several performance comparisons have been conducted on Aurora and MUSE [4,6]. Table 1 lists the performance data selected from [4] and [6], where speed-ups are shown in parentheses. 'Circuit' is a knowledge-based program for de- signing circuit boards [6].

5.2. On B B N Butterfly

Performance evaluation has also been conducted on switch-based multiprocessor architectures, such as BBN Butterfly GP1000 [41,42] and Butterfly TC2000 [42]. A switch-based machine has both local and non-local memories with different access times, as opposed to bus-based machines, such as Sequent Symmetry, which give a uniform memory access time. An advantage of switch-based machines is their high scalability and thus the capability of running more processors. The programs described above which shown high speed-ups, such as 8-queens, for up to 11

Speed-up

40

30

20

10

11 -queens

8-queens

_ tina , Manchester

. . . . . . ~ , Argonne

/ / / / " ~ " Manchester -

/ / ~ ~ _ ~ Argonne

- - ~ . . . . Manchester ~ . - " . . . . . . . . . . . .

" : " " " " " " " " - Argonne I

10 20 30 40 50 No. Processors

Fig. 8. The Manchester and Argonne schedulers on TC2000 [42].

276 K. Zhang

processors do not provide sufficient parallelism for the larger scaled Butterfly machines. Butterfly GP1000 and TC2000 have similar architectures. TC2000 is faster than GP1000 and has a three levels of memory hierarchy, i.e. cache, local and remote memories.

When evaluating the performance on the But- terfly GP1000 and TC2000, the Aurora engine (GP1000 based on SICStus0.3 Prolog, but TC2000 on SICStus0.6), and binding environments are copied onto each of the processors, and shared data structures are also distributed among all the processors. It is shown that all the large benchmarks (e.g. 11-queens problem) give near linear speed-ups on up to 36 processors. The speed-ups of relatively smaller programs (e.g. 8-queens problem and holiday planning 'tina') start level- ling off at 16 processors. The reasons for the slowdown on the latter programs are due to the increase of non-local memory accesses and lower cost-effectiveness of scheduling on the tasks with smaller granularity. The Argonne scheduler and the Manchester scheduler were individually tested with the engine with the performance showing in Fig. 8.

When evaluating the benchmark suite adopted by Szeredi [57], it was found that the programs with low and medium speedups could not show performance improvements when more than 4 processors were used on Butterfly TC2000.

Though both the schedulers have shown the efficiency for the programs with large and well- balanced search spaces, they do not perform well for programs with small search spaces. This is largely due to the switch contention. However, the overall results on the 11-queens program (optimised for a better balanced search space) show that over 600 KLIPS performance can be achieved

on the Butterfly TC2000 with 36 processors, even though the caching ability is not fully exploited.

MUSE, however, supports caching of the WAM code stored locally in each worker. Its performance on Butterfly TC2000 using the same set of benchmarks as for Aurora shows that the execution time of an individual MUSE worker are mainly due to • the sequential execution plus interrupt check-

ing and local updating, • waiting for work to be generated and looking

for work for sharing, and • making parts of search tree shareable by other

workers. The results show that when running on one worker, MUSE is about 22% slower than SICS- tus0.6 Prolog, on which the MUSE engine is based. The major overhead in supporting parallel execution is due to the operations required in making work to be shared with other workers. Such operations include copying of the WAM stacks from the current worker to other workers and synchronisation for the workers involved. This overhead increases as the granularity of parallelism decreases.

The overall performance of MUSE is quite encouraging. The average real speed-up on 32 TC2000 processors over one processor is 25.4 for the programs with coarse grain parallelism. For all the benchmarks tested on TC2000, MUSE is faster than Aurora by 39% to 171% (Table 2).

6. Summary

Efficient binding management schemes and task control strategies for the support of OR- parallelism have been extensively investigated

Table 2 Run-times (in seconds) of MUSE and Aurora on BBN Butterfly TC2000

System Benchmarks 1 worker 10 workers 20 workers 30 workers 37 workers

MUSE circuit 105.97 (1.00) 10.81 (9.80) 5.56 (19.1) 3.93 (27.0) 3.29 (32.2) 11 -queen 225.23 22.78 (9.89) 11.58 (19.4) 7.88 (28.6) 8-queen 1.79 0.21 (8.52) 0.14 (12.8) 0.13 (13.8) zebra 0.98 0.58 (1.69) 0.61 (1.61) 0.63 (1.56) farmer 0.83 1.01 (0.82) 1.03 (0.81) 1.07 (0.78)

Aurora circuit 180.55 (0.59) 22.12 (4.79) 16.02 (6.61) 13.66 (7.76) 13.79 (7.68) 11-queen 369.14 36.92 (10.0) 18.54 (19.9) 12.47 (29.6) 8-queen 2.85 0.33 (8.64) 0.21 (13.6) 0.22 (13.0) zebra 1.55 0.79 (1.96) 1.12 (1.38) 1.99 (0.78) farmer t .14 1.80 (0.63) 2.14 (0.53) 2.33 (0.49)


Table 3 A summary of the representative systems

System Langauge Model BE management Task control

AND/OR Process pure LP ANL-WAM pure LP Aurora full Prolog

BRAVE modified Prolog MUSE full Prolog

ROPM modified Prolog PEPSys modified Prolog

process distributed BE message-passing OR-branch multi-sequence demand-driven OR-branch multi-sequence demand-driven

(structure sharing) OR-branch multi-sequence demand-driven OR-branch multi-sequence demand-driven

(structure copying) process distributed BE message-passing process centralised BE demand-driven

over the last few years. The process-based model and OR-branch-based model are the two main models in representing parallel tasks. Some most representative systems are summarised in Table 3, which shows the differences in terms of the languages supported, the implementation models, the binding management schemes and the task control strategies.

The systems based on centralised environments scheme use a global auxiliary structure to store all the variable bindings. Though high efficiency has been achieved in some schemes, high system scalability may still be difficult to obtain. The major disadvantages of the centralised binding scheme are that, to access a variable which is bound at a very early stage, the dereference operation is sometimes costly; also the link of the BEs generated at different resolution stages requires a shared memory organisation to facilitate the auxiliary structure. The scalability problem also exists for the task control strategies which use a similarly centralised structure.

Among the models using centralised auxiliary structures, the directory tree method [14], in which each node has its own directory which contains a number of contexts, has non-constant-time environment creation. The hashing window method reduces the overhead on task-switching, but pays more for variable access [11,65]. The time-stamp- ing method used in BOPLOG [60] actually sacrifices both constant-time variable access and constant-time task switching.

Attempts have been made, by other re- searchers, to achieve better scalability and high distributability by using distributed approaches. These distributed models are also suitable for better task control, but usually pay the price on

environment creation and updating when extensive structure copying is needed. Among these models, the closing environment method [20] and DIALOG [69] have non-constant-time environment creation. The variable importation method sacrifices both constant-time environment creation and constant-time variable access [39].

Yet, the most promising binding management scheme has been the multiple sequences scheme. Two best known systems using this scheme are Aurora [40] and MUSE [4], which, though sacrifice constant-time task switching, maintain the efficiency of standard sequential Prolog, but also offer high scalability. Both Aurora and MUSE have been successfully implemented on a number of multiprocessor machines with a large number of processors, and also demonstrated encouraging performance. It is predicted that this class of systems will continue to show improved performance with the development of more efficient schedulers.

Recently developed schedulers have taken into account speculative work caused by the use of cut. Speculative work allows an interesting source of potential parallelism to be exploited if it can be avoided gracefully. Previous studies have shown that if an effective method is found to eliminate much of the speculative work, signifi- cant speedups could be obtained [10]. This is especially true where only a subset of possible solutions are required.

The current trend has been to design and implement working systems which exploit both OR- and AND-parallelism. Intuitively, a com- bined AND/OR-parallel system is expected to offer a superior performance over a system which only exploit one type of parallelism.

278 K. Zhang

Acknowledgments

The author is very grateful to Khayri Ali, Mehmet Orgun, Chengzheng Sun and Rong Yang for their comments and suggestions on an early draft, which have clarified the use of certain terminology and also made the paper presenta- tion significantly improved. Thanks also go to Paul English for his discussion on speculative work. Anonymous referees are thanked for their comments and suggestions.

References

[1] H. Ait-Kaci, Warren's Abstract Machine - A Tutorial Reconstruction (MIT Press, Cambridge, MA, 1991).

[2] K.A.M. Ali, A method for implementing cut in parallel execution of Prolog, in: Proc. 1987 Syrup. on Logic Pro- gramming, San Francisco, USA (1987) 449-456.

[3] K.A.M. Ali, OR-parallel execution of Prolog on BC-machine. In Proc. 5th Internat. Conf. and Syrup. on Logic Programming, Seatle, WA, USA (15-19 Aug. 1988) 1531-1545.

[4] K.A.M. Ali and R. Karlsson, The Muse Or-Parallel Pro- log model and its performance in: Proc. 1990 North American Conf. on Logic Programming, Austin, USA (Oct. 1990).

[5] K.A.M. Ali and R. Karlsson, Scheduling Or-Parallelism in Muse, in: Proc. 8th Internat. Conf. on Logic Program- ming, Paris, France (June 1991).

[6] K.A.M. Ali and R. Karlsson, OR-Parallel Speedups in a Knowledge Based System: on Muse and Aurora, in: FGCS'92, Tokyo (1992).

[7] H. Alshawi and D.B. Moran, The Delphi Model and Some Preliminary Experiments, in: Proc. 5th Internat. Conf. and Symp. on Logic Programming, Seatle, WA, USA (15-19 Aug. 1988) 1578-1589.

[8] U. Baron et al., The Parallel ECRC Prolog System PEP- Sys: An overview and evaluation results, in: Proc. Inter- nat. Conf. on Fifth Generation Computer System, Tokyo (28 Nov. - 2 Dec. 1988) 841-850.

[9] A. Beaumont et al., Flexible scheduling of OR-Paralle- lism in Aurora: The Bristol Scheduler, in: PARLE'91: Conf. on Parallel Architectures and Languages Europe (Springer, Berlin, June 1991).

[10] A. Beaumont, Scheduling strategies and speculative work, in: Proc. ICLP'91 Pre-conference Workshop on Parallel Execution of Logic Programs, Paris, France (June 1991).

[11] P. Borgwardt, Parallel Prolog using stack segments on shared-memroy multiprocessors, in: Proc. 1984 Intemat. Syrup. Logic Programming (Feb. 1984) 2-11.

[12] A. Calderwood and P. Szeredi, Scheduling OR-Paralle- lism in Aurora - the Manchester Scheduler, in: Proc. 6th Internat. Conf. on Logic Programming (June 1989) 419- 435.

[13] M. Carlsson and J. Widen, SICStus Prolog User's Man- ual, SICS Research Report R88007B, October 1988.

[14] Ciepielewski and S. Haridi, A formal model for OR- parallel execution of logic programs, in: Proc. Informat. Processing 83 (t983) 299-305.

[15] A. Ciepielewski, B. Hausman and S. Haridi, Initial evaluation of a virtual machine for OR-parallel execution of logic programs, in: Proc. IFIP TC-IO Working Conf. on Fifth Generation Computer Architectures, Woods, J.V. (ed.) (Elsevier, Science, Amsterdam, 1985) 81-99.

[16] A. Ciepielewski, B. Hausman, Performance evaluation of a storage model for OR-parallel execution of logic programs, in: IEEE Proc. Syrup. On Logic Programming, Salt Lake City, Utah, USA (Sep. 1986) 246-257.

[17] A. Ciepielewski, S. Haridi and B. Hausman, OR-parallel prolog on shared memory multiprocessors, J. Logic Pro- gramming, 7 (1989) 125-147.

[18] W.F. Clocksin, and C.S. Mellish, Programming in Prolog (Springer, New York, 1981).

[19] J.S. Conery, Parallel Execution of Logic Programs (Kluwer, Dordrecht, 1987).

[20] J.S. Conery, Binding environments for parallel logic programs in non-shared memory multiprocessors. Internat. J Parallel Programming 17 (2) (1988) 125-152.

[21] J. Crammond, A comparative study of unification algorithms for OR-parallel execution of logic languages, IEEE Trans. Comput. C-34 (10) (Oct. 1985) 911-917.

[22] S.A. Delgado-Rannauro and T.J. Reynolds, A message driven OR-parallel machines, in: Proc. 3rd Internat. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, USA (April 1989) 217-226.

[23] T. Disz, E. Lusk and R. Overbeek, Experiments with OR-parallel logic programming, in: Proc. 4th Internat. Conf. on Logic Programming, Lassez, J-L. (ed.) (1987) 576-600.

[24] J. Gabriel, T. Lindholm, E.L. Lusk and R.A. Overbeek, A tutorial on the Warren abstract machine for computa- tional logic, Technical Report, ANL-84-84, Argonne Na- tional Laboratory, Argonne, USA, June 1985.

[25] G. Gupta and B. Jayaraman, On criteria for OR-parallel execution models of logic programs, in: Proc. 1990 North American Conf. on Logic Programming, Austin, USA (Oct. 1990) 605-623.

[26] Z. Halim, Data-driven and demand-driven evaluation of logic programs, Ph.D Thesis, Dept. of Computer Science, University of Manchester, 1984.

[27] Z. Halim, A data-driven machine for OR-parallel evaluation of logic programs, New Generation Comput. 4 (1986) 5-33.

[28] B. Hausman, A. Ciepielewski and A. Calderwood, Cut and side-effects in OR-parallel prolog, in: Proc. Internat. Conf. of Fifth Generation Computer Systems, Tokyo (28 Nov. - 2 Dec. 1988) 831-840.

[29] B. Hausman, Handling of speculative work in OR-parallel PROLOG: Evaluation results, in: Proc. 1990 North American Conf. on Logic Programming, Austin, USA (Oct. 1990).

[30] N. Ito and H. Shimizn, Dataflow based execution mecha- nisms of parallel and concurrent Prolog, New Generation Comput. 3 (1985) 15-41.

[31] N. Ito et al., The architecture and preliminary evaluation results of the experimental parallel inference machine


PIM-D, in: Proc. 13th Ann. Internat. Symp. on Computer Architecture (1986) 533-541.

[32] L.V. Kale, D.A. Padua and D.C. Sehr, OR parallel execution of prolog programs with side effects, J. Super- comput. 2 (1988) 209-223.

[33] L.V. Kale, B. Ramkumar and W. Shu, A memory organisation independent binding environment for AND and OR parallel execution of logic programs, In. Proc. 5th Internat. Conf. and Syrup. on Logic Programming, Seatle, WA, USA (15-19 Aug. 1988) 1223-1240.

[34] R. Karlsson, A high performance OR-parallel prolog system, PhD Thesis, The Royal Institute of Technology and Swedish Institute of Computer Science, March 1992.

[35] K. Knight, Unification: A multidisciplinary survey, ACM Comput. Surv. 21 (1) (March 1989) 93-124.

[36] P.M. Kogge, The Architecture of Symbolic Computers (McGraw-Hill, New York, 1991).

[37] R.A. Kowalski, Predicate logic as a programming language, in: Proc. Information Processing 74, (Aug. 1974) 569-574.

[38] K. Kumon et al., KABU-WAKE: A new parallel inference method and its evaluation, COMPCON, Spring'86 (1986) 168-172.

[39] G. Lindstrom, OR-parallelism on applicative architecture, in: Proc. 2nd Internat. Logic Programming Conf. (July 1984) 159-170.

[40] E. Lusk, D.H.D. Warren, S. Haridi et al., The Aurora OR-Parallel Prolog System, in: Proc. Internat. Conf. of Fifth Generation Computer Systems, Tokyo (28 Nov. - 2 Dec. 1988) 819-830.

[41] S. Mudambi, Performance of Aurora on a switch-based multiprocessor, in: Proc. 1989 North American Conf. on Logic Programming, Cleveland, USA (Oct. 1989).

[42] S. Mudambi, Performance of Aurora on NUMA machines, in: Proc. 1991 lnternat. Syrup. on Logic Program- ming, San Diego, USA (Oct. 1991).

[43] L. Naish, Negation and Control in Prolog, LNCS-238 (Springer, Berlin, 1985).

[44] G.Z. Qadah and M. Nussbaum, Logic Machines: A Sur- veys in: AFIPS Conf. Proc. NCC, Vol. 56, Chicago, IL (June 1987) 256-278.

[45] M. Ratcliffe and J-C. Syre, The PEPSys parallel logic programming languages, in: Proc. lOth Internat. Joint Conf. on Artificial Intelligence, Milano, Italy (Aug. 1987).

[46] B. Ramkumar and L. Kale, Compiled execution of the Reduce-OR process model on multiprocessors, in: Proc. 1989 N. American Conf. on Logic Programming, Cleve- land, USA (Oct. 1989) 313-331.

[47] T.J. Reynolds and D. Lyons, Transputers and Parallel Prolog, in: Proc. 7th Occam User Group Technical Meet- ing, Muntean, T. (ed.), Grenoble, France (Sep. 1987) 221-228.

[48] T.J. Reynolds et al., BRAVE - A parallel logic language for artificial intelligence, Future Generation Comput. Syst. 4 (1988) 69-75.

[49] T.J. Reynolds et al., BRAVE on GRIP, in: Proc. ICL Conf., York (May 1988).

[50] T.J. Reynolds and S. Delgado-Rannauro, VLSI for parallel execution of Prolog, in: Proc. Internat. Workshop on VLSl for Artificial Intelligence, Oxford (July 1988).

[51] T.J. Reynolds and P. Kefalas, OR-Parallel Prolog and search problems in AI applications, in: Proc. 1990 North

American Conf. on Logic Programming, Austin, USA (Oct. 1990).

[52] E. Shapiro, Concurrent Prolog: Collected Papers (MIT Press, Cambridge, MA, 1987).

[53] E. Shapiro, The family of concurrent logic programming languages, ACM Comput. Surv. 21 (3) (Sep. 1989).

[54] C.Z. Sun and Y.G. Ci, The OR-forest description for the execution of logic programs, in: Proc. 3rd Internat. on Logic Programming (July 1986) 710-717.

[55] C.Z. Sun and Y.G. Ci, The sharing of environment in AND-OR-parallel execution of logic programs, in: Proc. 14th lnternat. Syrup. on Computer Architecture (June 1987) 137-144.

[56] C.Z. Sun and Y.G. Ci, The OR-forest-based parallel execution model of logic programs, Future Generation Comput. Syst. 6 (1) (June 1990) 24-34.

[57] P. Szeredi, Performance analysis of the Aurora Or-Paral- lel Prolog system, in: Proc. 1989 North American Conf. on Logic Programming, Cleveland, USA (Oct. 1989).

[58] P. Szeredi, Solving optimisation problems in Aurora OR-Parallel Prolog System, in: Proc. ICLP'91 Pre-conference Workshop on Parallel Execution of Logic Pro- grams, Paris, France (June 1991).

[59] E. Tick and D.H.D. Warren, Towards a pipelined Prolog processor, in: Proc. IEEE lnternat. Syrup. on Logic Pro- gramming, Atlantic City, NJ, USA (Feb. 1984) 29-40.

[ 6 0 ] P . Tinker and Lindstrom, A performance-oriented design for OR-parallel logic programming, in: Proc. 4th lnternat. Conf. on Logic Programming, Lassez, J-L. (ed.) (1987) 601-615.

[61] D.H.D. Warren, Implementing Prolog - Compiling predicate logic programs, DAI Research Report, No. 39 and 40, University of Edinburgh, 1977.

[62] D.H.D. Warren, An Abstract Prolog Instruction Set, Technical Note 309, AI Centre, SRI International, Au- gust, 1983.

[63] D.H.D. Warren, The SRI model for OR-parallel execution of Prolog - Abstract design and implementation issues, in: Proc. 1987 Syrup. on Logic Programming (1987) 92-102.

[64] D.S. Warren, Efficient Prolog memory management for flexible control strategy, in: Proc. 1984 Internat. Syrup. on Logic Programming (1984) 198-202.

[65] H. Westphal et al., The PEPSys Model: Combining backtracking, AND- and OR-parallelism, in: Proc. 1987 Syrup. on Logic Programming (1987) 436-448.

[66] M.J. Wise, A Parallel Prolog: The construction of a data-driven model, in: Proc. Syrup. on Lisp and Func- tional Programming, ACM (1982) 55-66.

[67] M.J. Wise, Prolog Multiprocessors (Prentice-Hall, Engle- wood Cliffs, NJ, 1986).

[68] H. Yasuhara and K. Nitadori, ORBIT: A parallel computing model of Prolog, New Generation Comput. 2 (1984) 277-288.

[69] K. Zhang and R. Thomas, A non-shared binding scheme for Parallel Prolog implementation, in: Proc. 12th Inter- nat. Joint Conf. on Artificial Intelligence, Sydney (24-30 Aug. 1991) 877-882.

[70] K. Zhang and R. Thomas, DIALOG - A dataflow model for parallel execution of logic programs, Future Genera- tion Comput. Syst. 6 (4) (Sep. 1991) 373-388.

280 K. Zhang

K. Zhang is a Lecturer at Macquarie University, Sydney, Australia. He re- ceived his BEng in Computer Studies from Chengdu Institute of Radio En- gineering (now University of Elec- tronic Science and Technology of China) in 1982; and his PhD from Brighton Polytechnic (CNAA) in 1990. He was a Software Engineer in the CAD Section of East-China Re- search Institute of Computer Tech- nology, Shanghai, between 1982 and 1985. He was then an Academic Visi-

tor to the SEAKE Centre at Brighton Polytechnic in 1986. After his postgraduate studies, he was an SERC Postdoctoral Fellow in 1991, before joining Macquarie University. Dr. Zhang's current research interests are in the areas of parallel implementation of logic programs, program visualisation and parallel programming tools.

Documents

Exploiting OR-parallelism in logic programs: A review