Models of Parallel Computation and Parallel Complexity

Models of Parallel Computationand

Parallel Complexity

George Lentaris

µ∏

λ∀

A thesis submitted in partial fulfilment of the

requirements for the degree of Master of Science in

Logic, Algorithms and Computation.

Department of Mathematics,

National & Kapodistrian University of Athens

supervised by:

Asst. Prof. Dionysios Reisis, NKUA

Prof. Stathis Zachos, NTUA

Athens, July 2010

Abstract

This thesis reviews selected topics from the theory of parallel computa-tion. The research begins with a survey of the proposed models of parallelcomputation. It examines the characteristics of each model and it discussesits use either for theoretical studies, or for practical applications. Subse-quently, it employs common simulation techniques to evaluate the computa-tional power of these models. The simulations establish certain model rela-tions before advancing to a detailed study of the parallel complexity theory,which is the subject of the second part of this thesis. The second part exam-ines classes of feasible highly parallel problems and it investigates the limitsof parallelization. It is concerned with the benefits of the parallel solutionsand the extent to which they can be applied to all problems. It analyzes theparallel complexity of various well-known tractable problems and it discussesthe automatic parallelization of the efficient sequential algorithms. Moreover,it compares the models with respect to the cost of realizing parallel solutions.Overall, the thesis presents various class inclusions, problem classificationsand open questions of the field.

George Lentaris

Committee: Asst. Prof. Dionysios Reisis, Prof. Stathis Zachos,Prof. Vassilis Zissimopoulos, Lect. Aris Pagourtzis.

Graduate Program in Logic, Algorithms and Computation

Departments of Mathematics, Informatics, M.I.TH.E.–National & Kapodistrian University of Athens–

General Sciences, Electrical and Computer Engineering–National Technical University of Athens–Computer Engineering and Information

–University of Patras–

µ∏

λ∀

Contents

1 Introduction 3

2 Models of Parallel Computation 82.1 Shared Memory Models . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 PRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 PRAM Extensions . . . . . . . . . . . . . . . . . . . . 14

2.2 Distributed Memory Models . . . . . . . . . . . . . . . . . . . 182.2.1 BSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 LogP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.3 Fixed-Connection Networks . . . . . . . . . . . . . . . 27

2.3 Circuit Models . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.1 Boolean Circuits . . . . . . . . . . . . . . . . . . . . . 322.3.2 FSM, Conglomerates and Aggregates . . . . . . . . . . 362.3.3 VLSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 More Models and Taxonomies . . . . . . . . . . . . . . . . . . 442.4.1 Alternating Turing Machine . . . . . . . . . . . . . . . 452.4.2 Other Models . . . . . . . . . . . . . . . . . . . . . . . 462.4.3 Taxonomy of Parallel Computers . . . . . . . . . . . . 51

3 Model Simulations and Equivalences 553.1 Shared Memory Models . . . . . . . . . . . . . . . . . . . . . 56

3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 563.1.2 Simulations between PRAMs . . . . . . . . . . . . . . 58

3.2 Distributed Memory Models . . . . . . . . . . . . . . . . . . . 643.2.1 The simulation of a shared memory . . . . . . . . . . . 653.2.2 Simulations between BSP and LogP . . . . . . . . . . . 68

3.3 Boolean Circuits Simulations . . . . . . . . . . . . . . . . . . . 733.3.1 Equivalence to the Turing Machine . . . . . . . . . . . 74

1

3.3.2 Equivalence to the PRAM . . . . . . . . . . . . . . . . 76

4 Parallel Complexity Theory 814.1 Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . 82

4.1.1 Feasible Highly Parallel Computation . . . . . . . . . . 834.1.2 Beneath and Beyond NC . . . . . . . . . . . . . . . . . 100

4.2 P-completeness . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2.1 Reductions and Complete Problems . . . . . . . . . . . 1114.2.2 NC versus P . . . . . . . . . . . . . . . . . . . . . . . . 121

4.3 The Parallel Computation Thesis . . . . . . . . . . . . . . . . 126

5 Conclusion 131

2

Chapter 1

Introduction

Digital parallel computers date back to the late 1950s. At that time theindustry had already developed a number of sequential machines and the en-gineers became interested in the use of parallelism in numerical calculations.Consequently, during the 1960s and the 1970s, several shared-memory ‘mul-tiprocessor’ systems were designed for academic and commercial purposes.Such systems consisted of a small number of processors (1 to 4) connectedto a few memory modules (1 to 16) via a crossbar switch. The processorswere operating side-by-side on shared data. The advances in the technologyof integrated circuits and the ideas of famous architects, such as G. Amdahland S. Cray, contributed in the evolution of supercomputers. By the early1980s, the Cray X-MP supercomputer could perform more than 200 MillionFLoating Operations Per Second (MFLOPS). In the mid-1980s, massivelyparallel processors (MPPs) were developed by connecting mass market, off-the-self, microprocessors. The ASCI Red MPP was the first machine to rateabove 1 Tera-FLOPS in 1996 by utilizing more than 4,000 computing nodesarranged in a grid. The 1990s also saw the evolution of computer clusters,which are similar to the MPPs without, however, being as tightly coupled.A cluster consists of mass market computers, which are connected by anoff-the-shelf network (e.g., Ethernet). This architecture has gained groundover the years characterizing most of the latest top supercomputers. TheCray-Jaguar cluster is the fastest parallel machine today, performing almost2 Peta-FLOPS and utilizing more than 200,000 cores.

The purely theoretical study of parallel computation begun in the 1970s[1]. The first results concerned circuit simulations relating the model tothe Turing machine. In the second half of the decade various models of

3

parallel computation were formulated, including the Parallel Random AccessMachine. Also, Pippenger introduced the class of feasible highly parallelproblems (NC), the theory of P-completeness was born, and the researchersstarted identifying the inherently sequential problems. During the 1980sand the 1990s the area expanded dramatically, flooding the literature withnumerous important papers and books. Until today, parallel computationconstitutes an active research area, which targets efficient algorithms andarchitectures, as well as novel models of computation and complexity results.

Basic terms and notions

Through all these years of theory and practice certain terms and notionshave been developed. To begin with, most researchers would describe paral-lel computing as the combined use of two, or more, processing elements tocompute a function. That is, the parts of the computation are performedby distinct processing elements, which communicate during the process toexchange information and coordinate their operations. The number and thesize of the parts of the computation, as well as the volume of the exchangedinformation, are qualitatively characterized by terms such as granularity andlevel of parallelism. Roughly, granularity indicates the ratio of computationto communication steps: a fine grained parallel process consists of numeroussuccessive communication events with little local operations in between them,while a coarse grained computation is partitioned into big tasks requiringlimited coordination (synchronization). Extremely coarse grained computa-tions with almost no task interdependencies can be used for the so called‘embarrassingly parallel problems’, which are the easiest to solve in parallel.Evidently, the granularity is related to the chosen level of parallelism: task,instruction, data, and bit level. Task level parallelism focuses on assign-ing entire subroutines of the algorithm (possibly very different in structureand scope) to distinct processors of the machine. At instruction level, wefocus on parallelizing consecutive program instructions by preserving theirinterdependencies (as, e.g., in superscalar processors). Data level parallelismdistributes the data to be processed across different computing nodes, whichwork side-by-side possibly executing the same instructions. At bit level, themachine processes concurrently several bits of the involved datum (a tech-nique explored mostly in circuit design). Generally, the parallelism leveland the granularity of the computation determine the use of either tightlyor loosely coupled systems. A coarse grained, task level, parallel computa-

4

tion can be performed by a machine with loosely coupled components, e.g.,off-the-self computers located at distant places (clusters). On the contrary,a fine grained, data level, computation should be implemented on a tightlycoupled system with, e.g., shared memory multiprocessors.

Clearly, parallelization is used to obtain faster solutions than those offeredby the sequential computation. In this direction, we define the speedup of aparallel algorithm utilizing p processors as the ratio of the time required bya sequential algorithm over the time required by the parallel algorithm, i.e.,speedup = T (1)/T (p). More precisely, when the sequential time correspondsto the fastest known algorithm for that specific problem, we get the absolutespeedup. Else, when the sequential time is measured with the parallel algo-rithm itself running on a single processor, we get the relative speedup. Notethat a parallel solution cannot reduce the sequential time by a factor greaterthan the number of the utilized processors, i.e., speedup ≤ p. Consequently,from a complexity theoretic viewpoint, in order to significantly reduce thepolynomial time of a tractable problem to, say, sublinear parallel time, wehave to invest at least a polynomial number of processors.

The aforementioned linear speedup is ideal and in most cases it cannotbe achieved by real world machines. Among applications, the speedup usu-ally ranges from p to log p (with a middle ground at p/ log p) [2]. Based onempirical data, the researchers propose various practical upper bounds forthe speedup, the most notable being Amdahl’s law. According to Amdahl,each computation includes an inherently sequential part, which cannot beparallelized. Assuming that this part is a fraction f of the entire computa-tion, we get speedup ≤ 1/(f +(1−f)/p). This ‘sequential overhead’ f varieswith respect to the problems and the algorithms. Fortunately, in some casesf is very small, and in other cases f decreases as the size of the problemincreases. For instance, Gustafson’s law states that increasing sufficientlythe size of a problem and distributing the workload in several processors willprobably result in almost linear speedup (experimentally confirmed), i.e.,speedup ≤ p− fn · (p− 1). Notice the difference between the two laws: Am-dahl examines the performance of the algorithm with respect to the numberof processors by assuming fixed-size inputs and constant f , while Gustafsonobserves that f is size-dependent and probably diminishes for large datasets.Overall, in the case of fixed size problems, we have that speedup → 1/f asp→∞. In other words, beyond a certain number of processors, adding morehardware to the parallel machine will have no actual effect to the executiontime. In fact, the ratio T (1)/T (∞) is called parallelism of the computation

5

and it can be used either as a theoretical maximum of the speedup or as apractical limit on the number of the useful processors [3].

Besides speedup, we are interested in the effectiveness of the parallel com-putation. To be more precise, we define the parallel efficiency as the ratiospeedup/p (again, we have absolute and relative metrics). The efficiencyvalue –typically between 0 and 1– reflects the overhead of the parallelism,such as the time spent for communication and synchronization steps (unipro-cessors and optimal speedup algorithms have parallel efficiency equal to 1).Alternatively, we can express the efficiency as the ratio T (1)/pT (p), whichcompares the sequential cost T (1) to the parallel cost p · T (p). Note herethat the product processors × time is a common cost function, which takesinto account both time and hardware requirements of the parallel solution.Additionally to the efficiency, we define ratios such as the redundancy =W (p)/W (1), and the utilization = W (p)/pT (p), where W (p) denotes thetotal number of operations performed by the p processors and is called work(or energy) of the computation [2].

Thesis scope and organization

The current thesis reviews parallel computation from a theoretical viewpoint.First, it is concerned with the formal description of the computation andthe understanding of abstract notions such as concurrency and coordina-tion. Hence, chapter 2 surveys the models of parallel computation that havebeen proposed by various researchers in the literature. All these modelingapproaches show the line of thought developed by experts studying paral-lelism and, at the same time, summarize the design trends used in real worldapplications.

The diversity of the models in the literature gives rise to certain ques-tions. Most of them regard the computational power of these models and theability to migrate solutions between them in a mechanical fashion. That is,given the formal description of a computation for a specific model (a programor a circuit) it is important to study generalized techniques, called simula-tions, which allow the given computation to be performed by another model.Chapter 3 uses simulations to investigate the abilities of certain categories ofmodels and to compare model variations with respect to the cost of realizingparallel solutions.

Having been acquainted with the basics of parallel computation, the thesiscontinues with a second part devoted to the study of problems. Specifically, it

6

is concerned with fundamental questions such as: are all problems amenableto parallelization? Can we use parallelization to solve intractable problems?How can we classify problems respectfully to the parallelization difficulty?How are these classes related? What is the relation between parallel andsequential computation cost? Is there a way to transform algorithmicallyan efficient sequential solution to an efficient parallel solution? Such ques-tions are tackled, among others, by the parallel complexity theory, which isexplored in chapter 4. Finally, chapter 5 concludes this thesis.

7

Chapter 2

Models of ParallelComputation

Prior to designing parallel solutions, analyzing algorithms, or studying theparallel complexity of problems, we must define a suitable model to describethe parallel computation. A model of parallel computation is a parameterizeddescription of a class of machines [1]. Each machine in the class is capable ofcomputing a specific function and is obtained from the model by determiningvalues for its parameters (e.g., a Turing machine is obtained by determiningthe work tapes, the symbol sets, the states and the transition function).

Over the years, several models of parallel computation have been pro-posed in the literature. Although many of them are widely used until today,none was ever universally accepted as dominant. The reason is that eachmodel serves a slightly different goal than the others, rendering its own useindisputable. To explain the diversity in modeling parallel computation, oneshould examine the modeling trends summarized in the following four axes:computation, concurrency, theory, and practice. That is, a parallel modelshould capture the notion of computation and, moreover, the notion of con-currency. Regarding the first, it is clear that we can adopt the outcome of theefforts made in the domain of the sequential computation (which is alreadyvast). Regarding the second, the model must reflect the fact that differentparts of the computation are performed in parallel and, more essentially, itmust include a means of communication for the coordination of the entirecomputation. By combining computation and communication ideas, we canderive a plethora of parallel models. In another direction, the model designer

8

must make choices between theory and practice. That is, the designer mustbalance factors such as generality and technicality, ease of performance anal-ysis and plausibility of real-world implementation, novelty and historical use,etc. Depending on the intended use and the available technology, the choicesadd a flexibility, which inevitably leads to a proliferation of different models.

Considering the above, the current chapter describes the majority of theparallel models of the literature. The presentation of the models is organizedas follows. We begin with the processor-based machines, i.e., the modelscapturing computation with the use of powerful components able to com-pute complicated functions on their own. These models are further dividedin two major categories according to the employed communication method:common memory or message exchanges. The third section introduces thecircuit models, which capture computation based on very low complexitycomponents and communication lines. Within each one of the three sections,the presentation begins with a high level of abstraction (mostly theoreticmodels) and continues with the incorporation of real-world details to discusspractical models. Finally, the fourth section describes models beyond thesecategories, obsolete and novel modeling approaches, as well as architecturaltaxonomies of the modern parallel computers.

2.1 Shared Memory Models

The shared memory models were among the first to be proposed for thestudy of parallel computation (1970s). The most characteristic shared mem-ory model is the PRAM. In fact, every other model of the category can beviewed as an extension of the original PRAM of Fortune and Wyllie. In anatural generalization of the single processor, the PRAM is a collection ofindependent processors with ordinary instruction sets and local memories,plus one central, shared, memory to which every processor is connected in arandom access fashion (figure 2.1). Note that the processors cannot commu-nicate directly with each other; their communication is performed throughthe central memory (via designated variables).

The original PRAM is an ideal model, which abstracts away many ofthe real world implementation details. As a result, it allows for a straight-forward performance analysis of the parallel algorithm. However, in somecases, the disregard of real world parameters might lead to false estimationsof practical situations. As a remedy, certain extensions have been added to

9

Figure 2.1: The shared memory model: processors and common memory

the PRAM over the years in order to keep track of the current technologyaspects. The following subsections describe the PRAM with its variationsand its extensions.

2.1.1 PRAM

The Parallel Random Access Machine (PRAM) was introduced in 1978 [4].It bases on a set of Random Access Machines (RAMs) sharing a commonmemory and operating in parallel. Formally,

Definition 2.1.1. A PRAM consists of an unbounded set of processors P =P0, P1, . . ., an unbounded global memory, a set of input registers, and afinite program. Each processor has an unbounded local memory, an accu-mulator, a program counter, and a flag indicating whether the processor isrunning or not.

All memory locations (cells) and accumulators are capable of holding ar-bitrary long integers. Following the RAM notion in [5], the PRAM program isa finite sequence of instructions Π = 〈π0, π1, . . . , πm〉, where each labeled in-struction πi is one of the types LOAD, STORE, ADD, SUB, JUMP, JZERO,READ, FORK, HALT. At each time instant, Pi can access either a cell ofthe global memory or a cell of its local memory (it cannot access the localmemory of another processor).

Initially, the input to the PRAM, x ∈ 0, 1n, is placed in the input reg-isters (n is placed in the accumulator of P0), all memory is cleared and P0

is started. During the computation, a processor Pi can activate any otherprocessor Pj by using a FORK instruction to determine the contents of Pj’sprogram counter. The processors operate in lockstep, i.e. they are synchro-nized to execute their corresponding instructions simultaneously (in one unitof time). Afterwards, they immediately advance to the execution of their next

10

instruction indicated by their corresponding program counter. The programexecution terminates when P0 halts and the input is accepted only when theaccumulator of P0 contains “1”. The execution time is defined as the totalnumber of instructions (steps) executed by P0 during the computation.

Definition 2.1.2. Let M be a PRAM. M computes in parallel time t(n) andprocessors p(n) if for every input x ∈ 0, 1n, machine M halts within atmost t(n) time steps and activates at most p(n) processors.

M computes in sequential time t(n) if it computes in parallel time t(n)using 1 processor.

Definitions 2.1.1 and 2.1.2 form the basis of the PRAM. By elaboratingon the specifics of the model, we derive a number of well-known PRAMvariations. Naturally, the first question that follows the above descriptionconcerns the memory policy of the model: what happens when more thanone processors try to access the same cell of the shared memory? As shownbelow, simultaneous access to shared memory can be arbitrated in variousways.

Memory Policies of PRAMs

Generally, we consider that the instruction cycle separates shared memoryreads from writes [1]. Each PRAM instruction is executed in a cycle withthree phases: 1) a read phase –if any– from the shared memory is performed,2) a computation –if any– associated with the instruction is done, and 3)a write –if any– to shared memory is performed. This convention elimi-nates read/write conflicts to the shared memory, but it does not eliminateread/read and write/write conflicts. These conflicts are resolved based onconcepts such as exclusiveness, concurrency, priority, etc. More specifically,we have the following model variations [2]:

• EREW-PRAM : exclusive-read, exclusive-write PRAM. Only one pro-cessor can read the contents of a shared cell at any time instant. Simi-larily, only one processor can write to a shared cell at any time instant(conflicts result in execution halting and rejection).

• CREW-PRAM : concurrent-read, exclusive-write PRAM. This modelcorresponds to the first definition of the PRAM and it is considered asthe default variation of the model. Any number of processors can read

11

from the same memory location, but only one can write to a sharedcell at any time instant.

• ERCW-PRAM : exclusive-read, concurrent-write PRAM. Only one pro-cessor can read from a cell, but many processors can write to a cell atany time instant (not a very rational convention, considering that CWtechnology should render CR even simpler).

• CRCW-PRAM : concurrent-read, concurrent-write PRAM. Any num-ber of porcessors can read/write to any cell at any time instant. It isthe most powerfull of the model variations.

Besides EW, more restricted models exist, which are based on the conceptof ownership [6]. These are called the owner write (EROW/CROW) modelsand they avoid conflicts by having each processor own one cell to which allhis write operations are performed.

In the case of concurrent writes, we must further define a policy deter-mining the exact data to be written in the requested cell [2]:

undefined CRCW-PRAM : an undefined value is written in the cell.detecting CRCW-PRAM : a special code “detected collision” is written.random CRCW-PRAM : an offered datum in random is chosen.common CRCW-PRAM : write when all the offered data are equal.max/min CRCW-PRAM : the largest/smallest datum is written.reduction CRCW-PRAM : a logical/arithmetic operation (and/or/sum)

is performed to the multiple data and theresulting value is written in the cell.

priority CRCW-PRAM : based on a predetermined ordering (e.g. thatof the PIDs), the processor with the highestpriority writes its datum in the cell.

Computing Functions with PRAM

Besides accepting languages, PRAMs can be used to compute functions. Forsuch computations we must slightly modify the output of the model: we equipthe PRAM with a set of output registers (similar to the input registers). Wesay that a machine M computes a function f(x) = y with x, y ∈ 0, 1∗if whenever it is started with x in its input registers, it will eventualy haltholding y in its output registers. Note that the use of I/O registers allowsthe study of sublinear time [4].

12

Another input/output convention [1], is to present an input x ∈ 0, 1n

to the machine M by placing the integer n in the first shared memory cell C0

and the bits x1, x2, . . . , xn in cells C1, C2, . . . , Cn. Similarly, M will presentits output y ∈ 0, 1m by placing m in C0 and the bits y1, y2, . . . , yn in cellsC1, C2, . . . , Cn.

Definition 2.1.3. Let f be a function from 0, 1∗ to 0, 1∗. The functionf is computable in parallel time t(n) and processors p(n) if there is a PRAMM that on input x outputs f(x) in time t(n) and processors p(n).

Nondeterminism in PRAM

Another important aspect of PRAM is nondeterminism. A nondeterministicPRAM can be defined as a collection of nondeterministic RAMs operatingin parallel. To be more precise, consider that each instruction πi of thePRAM program Π is labeled. If some label appears more than once inΠ, then we have a nondeterministic PRAM M [4]. Alternatively, we canenvisage the nondeterministic PRAM as a program Π with uniquely labeledinstructions πi, where each JUMP instruction can branch to more than onelabels (e.g. JUMP L1,L5,L7). This definition is analogous to the definitionof the nondeterministic Turing Machine (TM), where the transition functionof the DTM becomes a relation allowing the NTM to branch from a singleconfiguration to many distinct configurations at any step of the computation.Similarly to the NTM, the input to the NPRAM is accepted only if thereexists a computation path in which processor P0 halts with a “1” in itsaccumulator. Time and processor complexity for the NPRAM is defined asin definition 2.1.2.

SIMDAG

The Single Instruction stream Multiple DAta stream Global memory machine(SIMDAG) was introduced independently in 1978 [7]. However, it can beconsidered as the special case of the PRAM where the program counteris common for all processors, i.e., all active processors execute the sameinstruction πi at each step of the computation. Note that each processormay operate on different data values than the other processors (multipledata stream). The original SIMDAG corresponds to a CREW machine witha control unit common for all processors.

13

2.1.2 PRAM Extensions

The ordinary PRAM is a good model for expressing the logical structureof parallel algorithms and is very useful for their gross classification. How-ever, the abstraction level of this model hides important performance bot-tlenecks that appear in real world applications. The PRAM assumes unitcost access to the shared memory (independent of the number of proces-sors), infinite bandwidth for the memory-to-processor communication (andthus for the processor-to-processor communication), and makes many moreunrealistic implications. Such conventions might lead to the development ofalgorithms with a very degraded performance when tested in practice. Toreduce the theory-to-practice disparity, numerous modifications and exten-sions have been added to the PRAM model during the past three decades.The following paragraphs present the most prominent of these models.

PRAM with Memory Latency

The most common criticism of the simple PRAM model is the unit cost ac-cess to the shared memory. Under this assumption, the PRAM does notdiscourage the design of algorithms which access the shared memory con-tinuously (nor the superfluous interprocessor communication via the sharedmemory). It charges the same cost to any instruction independently of thenumber of the active processors. However, the implementation of algorithmsthat read/write excessively to the shared cells would lead any real world sys-tem to memory contention. Therefore, the cost of a memory access wouldincrease and the actual performance of the algorithm would diverge from itstheoretical estimation1.

A shared memory access in real multiprocessor systems consumes tens tothousands of instruction cycles. Whichever the underlying interconnectionnetwork is, the delay increases with the number of the active processors(either because of the message contention, or because of the expansion ofthe hardware to support larger communication paths). Such a phenomenonshould be taken into account by any theoretical model trying to capture thebehavior of a parallel algorithm and lead to efficient solutions.

A first approach is to equip the PRAM with a parameter δ defining thedelay of the processor-to-memory communication, measured in elementary

1notice that the memory policies discussed in the previous subsection deal with con-tention for a single shared cell, not with contention for the entire shared memory.

14

steps of the processors (instruction cycles) [2]. Since δ is a function of thenumber of processors |P | and the underlying interconnection, it should bedetermined prior to the design or the analysis of a parallel algorithm forthe PRAM. In the resulting model the cost of a read/write operation to theshared memory is δ and not unit (unless δ = 1), i.e., we have a “non-uniformmemory access” model. Consequently, the algorithm designer should attempto minimize the global communication by judicious use of the correspondingread/write instructions in the porgram Π.

A second approach –a generalization of the first– is given in [8]. Thekey idea bases on an observation regarding the memory service in real worldcomputers. When a processor accesses a block of shared cells, typically, ittakes a substantial period of time to access the first cell, but after that,subsequent cells can be accessed quite rapidly. This might occur because ofthe data caching techniques or because of the policies employed for congestioncontrol (in message passing systems).

The Block Parallel Random Access Machine (BPRAM) is in many wayssimilar to the pure PRAM. It includes a shared memory of unbounded sizeand a number of processors |P | with their own private memories (also ofunbounded size). The instructions and the I/O conventions are the same forboth models. The basic differences lie in the number of processors, whichis fixed for the BPRAM, and in the accessing of the shared memory: theLOAD-STORE instructions of the BPRAM refer to a whole block of theshared memory (a number of contiguous cells) of length b (b = 1 for a singlecell). The cost of accessing such a block is l + b, where l is the latency of thememory service discussed above. The latency l and the number of processors|P | are the parameters of the BPRAM. To conclude with the definition, anynumber of processors may access global memory simultaneously, but in anexclusive-read-exclusive-write (EREW) fashion. In other words, blocks thatare being accessed simultaneously cannot overlap.

Asynchronous PRAM

Synchronization is an important consideration in massively parallel machines,of great concern to the hardware and software communities, but largely ig-nored by the theory community. The simple PRAM assumes that the pro-cessors execute in lockstep. However, the construction of a real machinewith this specification becomes impractical for large number of processors.This is because of the various problems that arise with the synchronization

15

of the processors, which depends on the execution time of their instructions.To determine the length of the processors’ lockstep, one must study worstcase scenarios taking into account phenomena such as the clock skew, theinterconnection network congestion, the delay of a costly instruction (e.g.a floating point multiplication), etc. All these delay parameters add up toa large overhead increasing the execution time of a single computation stepand decreasing the utilization of the real machine. The synchronous approachseems even more inefficient considering the potential functionality gained by ageneralization of the PRAM: the Asynchronous PRAM (APRAM) model [9].

The APRAM is defined as a PRAM without a global clock, i.e., each pro-cessor can operate in its own speed. The memory-processors convention andthe set of instructions remain the same for the APRAM, with the exceptionof an new type of instructions called the “synchronization steps”. A syn-chronization step among a set P of active processors is a logical point duringthe computation where each processor Pi waits for all the other processors inP to arrive before continuing with its own local program. The use of theseinstructions divides the execution of a program Π in phases, within whicheach processor runs independently from the others (and thus it cannot knowthe progress of the others).

The execution time of an APRM program is determined by the executiontime of its phases. Each phase is completed after all processors arrive at thesynchronization step. Consequently, each phase consumes time equal to thatof the last processor to arrive at the step plus the execution time of the stepitself, which depends to the total number of the active processors.2

Similar to the PRAM, the APRAM is a family of models that differ inthe types of the synchronization steps, in the memory policy, and in the costof accessing the memory. Regarding the memory policies, the APRAM sup-ports both exclusive and concurrent read/writes (EREW, CREW, CECW).Its major difference from the PRAM is the inability of a processor to reada shared cell while another processor writes to it. Since APRAM misses thelockstep, it is impossible to use the three cycles (read-compute-write) de-scribed in the previous section to resolve the read/write conflict. Instead, wecan only use a synchronization step involving both processors between thetwo accesses. With this convention, APRAM eliminates the possibility of

2the cost of a synchronization instruction is not unit; it is B(|P |) where B is a non-decreasing function. This function is a parameter of the model (not fixed) allowing theanalysis to adapt to the characteristics of the underlying real world machine.

16

race conditions among processors. Regarding the access cost, APRAM canaccount for a communication delay to the shared memory or not (unit costaccess). A commonly used assumption is that a read operation consumes 2dtime while a write operation consumes d time, where d is a paramemter of themodel. Regarding the types of synchronization, APRAM can be modified tosupport subset synchronization where multiple disjoint sets of processors cansynchronize independently and in parallel. The cost for a synchronizationstep among the processors in a set P ′ is charged only to those processorsin P ′. The case P ′ = P is called Phase PRAM denoting the all-processorsynchronization.

Shared Memory Divided in Modules

There are even more specialized criticisms about the shared memory assump-tions of the PRAM. Consider for example the ability to access an unlim-ited number of memory cells simultaneously (potentially all of them). Eventhough it is not technologically impossible, it is quite costly to support suchfunctionality and it is not the trend of today’s industry. The most commonmemory designs lead to a memory module consisting of a set of individualcells, which however are grouped under a common interface. As a result,only one cell of the module can be accessed at a time. A more practicalmodel should take into account this restriction and assume a shared memoryorganized in modules.

In the Module Parallel Computer (MPC) [10] the number of memorymodules m is bounded3, constituting a parameter of the model along withthe number of RAM processors |P |. Each module mi can be accessed by theprocessors in a EREW fashion. Therefore, the challenge is to distribute thelogical addresses of the entire shared memory so as to minimize the potentialmemory conflicts between processors without stalling the execution of thealgorithm. Of course, it is expected that less modules lead to more accessdelays (because of the exclusiveness restriction) even though the intercon-nection network is idealized in MPC.

The mapping of a parallel algorithm to the MPC is called the granular-ity of parallel memories problem. The authors in [10] propose two genericsolutions: (i) randomization, where the logical addresses are randomly dis-tributed among the memory modules by selecting a hash function which can

3assuming an unbounded set of memory modules brings us back to the original defini-tion of the PRAM

17

be efficiently computed by any processor, (ii) copies, where several copies ofeach logical address is stored in distinct memory modules (storage redun-dancy). The randomization solution is shown to keep memory contentionlow in the average, while the copies solution decreases memory contention inthe worst case.

In a more radical approach, [11] describes a parallel machine with a sharedmemory divided in modules and arranged in a tree. The leaves of the treecorrespond to the processors, while the levels of the tree capture the cachingtechniques applied to most real world machines. The processors communicatedirectly with their parent memory modules and similarly, memory modulescommunicate with their children and parent modules by exchanging blocksof values. The length of a block depends on the level of the tree that theexchange takes place (typically, the length doubles at each level towardsthe root). The model can support several memory policies, namely EREW,CREW and CRCW. The parameters of the so called Uniform Parallel Mem-ory Hierarchy (UPMH) define the number of the modules, the depth of thememory tree, the communicated blocksize, and even the transfer costs. Inthe same approach, non-uniform communication costs of various intercon-nection topologies can be modelled by combining several levels of a PMH.The idea to use such a model for parallel computation was inspired by theobservation that the techniques employed to optimize sequential algorithmsdestined for memory hierarchies are similar to those for developing parallelalgorithms.

2.2 Distributed Memory Models

The distributed memory model of computation assumes a set of autonomousprocessors with ordinary instruction sets and local memories. Additionally,each processor can send and receive finite length messages with the use ofspecial purpose instructions. As opposed to the shared memory models, theprocessors here are not connected to a common memory. Rather, they areconnected to a –common– communication medium (or, more precisely, inter-connected via a specific network). Hence, the processors can communicatedireclty with each other by exchanging messages. The computation advancesin a sequence of steps involving local operations, communication, and pos-sible synchronization. Notice that the memory of the parallel machine isdistributed over the local memories of its processors. Figure 2.2 illustrates

18

Figure 2.2: Distributed memory model: a network of processors

such a setting.The distributed memory model is widely used today for the develop-

ment of supercomputers. The design and study of algorithms for this modelstrongly depend on the characteristics of the underlying interconnection net-work of the machine, i.e., to various machine-specific details. On one hand,these details define a large number of architectures allowing the engineer todevelop efficient solutions tailored to the requirements of the application.On the other hand, the existence of a large number of network architec-tures renders the unified theoretical study of algorithms in this model quitetroublesome. Naturally, there have been proposed “bridging” models whichabstract away the details of the network and allow the analysis of the algo-rithm to base only on a limited number of parameters. In this direction, thefollowing subsection presents the well-known BSP model.

2.2.1 BSP

The apparent need for a unifying model for parallel computation led Valiantin defining the “Bulk Synchronous Parallel” model (BSP) [12]. The majorpurpose of such a model is to act as a standard on which people can agree.It is intended neither as a hardware nor as a programming model, but some-thing in between, analogously to the von Neumann model in sequential com-putation. The von Neumann model abstracts the underlying technology ofa computer. Even with rapidly changing technology and architectural ideas,hardware designers can still share the common goal of realizing efficient vonNeumann machines. They are not too concerned about the software that isgoing to be executed. Similarly, the software industry in all its diversity can

19

aim to write programs that can be executed efficiently on this model, withoutexplicit consideration of the hardware. Thus, the von Neumann model is theconnecting bridge that enables programs from the diverse and chaotic worldof software to run efficientby on machines from the diverse and chaotic worldof hardware.

The BSP is a bridging model in the realm of parallel computation. Itconsists of a number of components performing processing and/or memoryfunctions and a router delivering messages point-to-point between pairs ofcomponents. The router abstracts the underlying interconnection network ofthe components by introducing certain parameters in the model, describedbelow. Moreover, the BSP incorporates a synchronization mechanism similarto the one used in the shared memory APRAM (see section 2.1.2): theprocessing components are synchronized at regular time intervals of L timeunits, where L is the periodicity of the computation. Synchronization isperformed by the Barrier Synchronizer of the model (a separate entity).

The structure of the BSP computation reflects the way that a parallelprogrammer thinks: perform some local computation, exchange informationrequired for the next local computation, and synchronize the processes toassure the correctness of the algorithm. To be more precise, a BSP compu-tation consists of a sequence of supersteps. In each superstep, each compo-nent is allocated a task consisting of some combination of local computationsteps, message transmissions and message arrivals from other components.Conceptually [13], the superstep is divided in three phases: the local com-putation phase, the communication phase, and the barrier synchronization.Each processor can be thought of as being equipped with an output pool,into which outgoing messages are inserted, and an input pool, from whichincoming messages are extracted. During the local computation phase, aprocessor may insert messages into its output pool, extract messages fromits input pool, and perform operations involving data held locally. Duringthe communication phase, every message held in the output pool of a proces-sor is transferred to the input pool of its destination processor. The previouscontents of the input pools, if any, are erased. The superstep is concludedby barrier synchronization. Every L time units, a global check is made todetermine whether the superstep has been completed by all the components(i.e., all local computations are completed and every message has reached itsdestination). If it has, the machine proceeds to the next superstep. Other-wise, one more time period L is allocated to the unfinished superstep. Notethat [14], although the model emphasizes global barrier style synchronization,

20

pairs of processors are always free to synchronize pairwise by, for example,sending messages to and from an agreed memory location. These messagetransmissions, however, would have to respect the superstep rules.

BSP parameters and characteristics

In the BSP model we assume that each local operation costs one unit of time.The task of the router is to realize an arbitrary h-relation, i.e. a superstep inwhich each component sends and receives at most h messages. We properlydefine a BSP computer by first determining three basic parameters4 [12] [15]:

p: the number of components

s: the startup cost of the h-relation realization (in time units)

g: the multiplicative inverse of the router’s throughput5

The parameters s and g describe the performance of the router. Their valuesdepend on the underlying interconnection network and are nondecreasingfunctions of p. For example, the g parameter depends on the bisection band-width of the network, the communication protocols, the routing, the buffermanagement, etc. Similarly, the s parameter depends on software issues ofeach processor, the wait cycles for each synchronization step, and other char-acteristics of the network. Both s and g can be bounded in theory, but inpractise they are empirically determined by running suitable benchmarks.Most of the factors affecting s and g become even more apparent as thenumber of processors increases. As a result, the time costs incurred by sand g increase with p. However, there is a way around this problem: theg parameter can be controled, within limits, by adjusting the router design.It can be kept low by using more pipelining or by having wider communica-tion channels. Keeping g low or fixed as the machine size p increases incurs,of course, other type of costs. In particular, as the machine scales up, thehardware investment for communication needs to grow faster than that forcomputation [12].

4Alternatively, [14] uses the parameters p-proseccors, g-bandwidth and L-periodicity5More precisely, g must be regarded as the ratio of the number of local computational

operations performed per second by all the processors, to the total number of data wordsdelivered per second by the router (alternatively, 1/g is the available bandwidth per pro-cessor). It is used to measure communication costs, i.e. we consider that an h-relation isrealized at the cost of g · h time units for h larger than some h0

21

The time complexity of a BSP algorithm is measured by summing thecosts of its supersteps. The cost of a superstep can be defined in variousways, depending on specific assumptions made by the model. In the mostcommon approach, we measure g when the router is in continuous use (i.e.the throughput of a pipelined architecture) and thus, the cost of realizingan h-relation becomes6 g · h + s. Consequently, when the communicationoperations do not overlap with the local computation steps at each processor,the cost of a superstep i becomes

Costi = g · hi + s + mi

where g, s are the model parameters, hi is the number of the exchanged mes-sages per processor (the h-relation), and mi is the number of the computationsteps per processor during the superstep i. Note that hi and mi correspondto the maximum values over all processors.

Before using the aforementioned cost function7, one should specify thevalues of its parameters according to the characteristics of the underlying–physical– machine. Intuitively [13], for sufficiently large sets of messages(h >> s/g), the communication medium must deliver p messages every gunits of time. Parameter s must be an upper bound for the time requiredfor global barrier synchronization (mi = 0, hi = 0). Moreover, g + s mustbe an upper bound to the time needed to route any partial permutationand therefore to the latency of a message in the absence of other messages(mi = 0, hi = 1). Practically, g and s have been measured for various realworld machines and their values are given in [16].

BSP example

A complete set of programming tools and specific C libraries has been devel-oped for compiling and running BSP programs [17]. Bellow we give a simpleBSP example [16]; the program bsp sum() computes the sum of p integerson p processors (initially held in distinct processors). The computation isperformed using the following logarithmic technique. At each algorithmicstep, processor pi adds the value sent by pi/2 to its local partial sum, and

6the time to sent h messages through the router is g · h and the time to initiate such aprocess is s

7another worth mentioning approach is to charge each superstep asCosti = max(hi + s, mi). This function is suitable for models where the communi-cation operations overlap with the local computations

22

forwards the result to p2i. At the end of the computation, the rightmostprocessor, p− 1, holds the result. Each algorithmic step is implemented as aBSP superstep. A total of dlog pe supersteps is required.

int bsp_sum(int x)

int i, left, right;

bsp_pushregister(&left,sizeof(int));

bsp_sync();

right=x;

for (i=1; i<bsp_numofprocs(); i=i*2)

if ( bsp_id()+i < bsp_numofprocs() )

bsp_put(bsp_pid()+i, &right, &left, 0, sizeof(int));

bsp_sync();

if ( bsp_pid() >= i ) right=right+left;

bsp_popregister(&left);

return right;

The above code-sample is copied and executed at each BSP processorconcurrently. The function bsp pid() gives a unique ID to the processorwhich called it. The bsp sync() implements the barrier synchronization ofthe BSP. The bsp pushregister() declares the variable left as a storage spacewhich will be used by other processors for writing (for sending data to thecurrent processor). The variable right of the processor pi is initially given theinput value of pi and is used during the execution to accumulate the partialsums. The bsp put() is used for communication: it will copy the right valueof processor bsp pid() to the value left of processor bsp pid()+i. Finally,bsp popregister() removes the global visibility of the variable left.

The cost of this algorithm is dlog pe(1 + g + s) + s, because there aredlog pe supersteps for computations and one for registration. Notice that ateach superstep we have one local addition (m = 1) and only one word iscommunicated between any pair of processors (h = 1).

2.2.2 LogP

The BSP was an inspiration for the authors in [18] to introduce a similarmodel named LogP (the name is derived from the four parameters of themodel, which will be described next). Targeting similar goals, the LogPwas designed taking into account the technology trends underlying parallelcomputers 8. It is intended to serve as a basis for developing fast portable

8in fact, the LogP authors in [18] discus technological aspects and make accurate pre-dictions of how Massively Parallel Processors will be designed throughout the 90’s

23

parallel algorithms, as well as to offer guidelines to hardware designers. LogP,like BSP, attempts to strike a balance between detail and simplicity by usingcertain parameters to abstract the communication between the processors.

LogP is a distributed memory multiprocessor in which processors com-municate by point-to-point messages. Contrary to the BSP, LogP is asyn-chronous, i.e. it features no barrier synchronization. Two processors cansynchronize only by exchanging messages between them. LogP uses param-eters for modeling the performance of the interconnection network, but itdoes not describe the structure of the network. It extends BSP by using onemore parameter and by imposing a network capacity constraint, i.e. up to amaximum number of messages can be in transit concurrently.

Conceptually [13], for each processor there is an output register where theprocessor puts any message to be submitted to the communication medium.The “preparation” of a message requires certain time and, once submitted,the message is “accepted” by the communication medium. It will be deliv-ered to its destination with a certain delay. When a submitted message isaccepted, the submitting processor reverts to the operational state, whereit continues with its local operations. Upon arrival, a message is promptlyremoved from the communication medium. This message can be immedi-ately processed by the receiving processor or it can be buffered. As withthe preparation of an outgoing message, the acquisition of an incoming mes-sage requires certain time, which is modeled by the new parameter of LogP,named “overhead”.

LogP parameters and characteristics

Once again, in the LogP model we assume that each local operation costsone unit of time (one “cycle”). We formally define the model by defining itsfour parameters:

L: the latency, the delay of a message to reach its destination

o: the overhead, the time a processor engages in a transmission or reception

g: the gap, minimum time between message transmissions (per processor)

P : the number of components (i.e., processor/memory modules)

The parameters L, o and g are measured as multiples of the processor cycle.More precisely, the “latency” is an upper bound of the time required for a

24

single word message to travel from its source to its destination. It dependsmostly on the underlying interconnection network and it can be calculated asL = Hr + dM/we, for an M -bit message traveling through w-wide channels,across H hops with r delay each. The “overhead” is a period during whichthe processor is engaged in sending/receiving messages and cannot performany other operations. Notice that o does not include L and that it is mostlyrelated to the underlying technology of the processor. It is regarded as themean overhead (Tsnd + Trcv)/2. The “gap” is a parameter used to model theavailable per processor bandwidth. Since the maximum speed at which aprocessor can send messages is one every g time units, then the reciprocalof g corresponds to the bandwidth. It can be calculated [15] as g = PM/wW ,where W is the bisection width of the network. Overall, the total timeto communicate a message between two processors can be calculated as T =2o+L = Tsnd+Hr+dM/we+Trcv. In practice, the values of these parametershave been measured for several real world parallel machines, e.g. in [18].

From the above definition, it is clear that any processor can have nomore than dL/ge of its messages traveling in the communication mediumconcurrently. In fact, the model assumes finite network capacity, such thatno more than dL/ge messages can be in transit from any processor, or toany processor, at any time. If a processor attempts to transmit a messagethat would exceed this limit it stalls until the message can be sent withoutexceeding the capacity limit. Notice here that the model does not ensurethat the messages will arrive at the same order that they were sent.

The above parameters are not considered equally important in all situ-ations [18]. For an easier algorithm analysis, it is possible to ignore one ormore parameters and work with a simpler model. For example, in algorithmsthat communicate data infrequently, it is reasonable to ignore the bandwidthand capacity limits. In some algorithms messages are sent in long streamswhich are pipelined through the network, so that message transmission timeis dominated by the inter-message gaps, and the latency may be disregarded.Also, notice that g and o can be merged in one parameter without alteringmuch the results of the analysis [18] 9.

9approximation by a factor of at most 2 (if we consider g = o). Arguably, o is unnec-essarily inserted in the model [15] [16]. Actually, the LogP authors hope that technology(off-the-shelf processors) will eventually eliminate o [18].

25

LogP example

Arguably [13] [16], LogP provides a less convenient programming abstrac-tion compared to BSP. The analysis of a LogP algorithm is somewhat lessstraightforward, primarily due to the lack of the synchronization barriers. Asan example, we discuss here the simple problem presented in the previoussection (for the BSP case): the optimal summation of P integers.

If we were to write a code sample for logp sum(), we would certainly omitthe bsp sync() function used in the BSP case. Instead, we would take forgranted that any message arrives at most after L time units.

As with BSP, the LogP processors will gradually construct a tree forcommunicating their values. The idea [18] is that each processor will suma set of the input elements and then (except for the root processor) it willtransmit the result to its parent as quickly as possible (we must ensure thatno processor receives more than one message). The elements to be summedby a processor consist of original inputs stored in its memory, together withpartial results received from its children in the communication tree. Themain difference from BSP is that the LogP tree will not be binary 10. Itwill be an unbalanced tree, with the fan-out of each node determined by thevalues L, o, g and the following analysis.

To specify an optimal algorithm, we must determine (off-line) the optimalschedule of communication events and then determine the distribution of theinitial inputs over the P processors. We start by considering how to sum asmany values as possible within a fixed amount of time T . If T < L + 2o,the optimal solution is to sum T + 1 values on a single processor, since thereis not sufficient time to receive data from another processor. Otherwise, thelast step performed by the root processor (at time T − 1) is to add a valueit has computed locally to a value it just received from another processor.The remote processor must have sent the value at time T − 1 − L − 2oand we assume recursively that it forms the root of an optimal summationtree with this time bound. The local value must have been produced attime T − 1 − o. Since the root can receive a message every g cycles, itschildren in the communication tree should complete their summations attimes T − (2o+L+ l), T − (2o+L+ l+g), T − (2o+L+ l+2g), . . .. The rootperforms g−o−1 additions of local input values between messages, as well as

10the root of the tree will have more children than other nodes nested deeper in thetree. The reason is that as a message travels, the source processor has time to send newmessages.

26

the local additions before it receives its first message. This communicationschedule must be modified by the following consideration: since a processorinvests o cycles in receiving a partial sum from a child, all transmitted partialsums must represent at least o additions.

A good LogP algorithm should coordinate work assignment with dataplacement, provide a balanced communication schedule, and overlap com-munication with processing.

2.2.3 Fixed-Connection Networks

The BSP and LogP models presented above abstract the structure of the in-terconnection network of the processors. They introduce certain parametersto measure the communication cost during the computation, without takinginto account the relative location of the processors that exchange messages.Such a uniform approach simplifies the performance analysis and, moreover,is accurate in various situations (e.g., ethernet, or fully connected networks).However, in many applications, a parallel machine includes only a limitednumber of internal connections between predetermined pairs of processors.Communication is allowed only for the directly connected processors, whilethe remote destination messages are explicitly forwarded through intermedi-ate nodes. Here, studying the exact structure of the interconnection networkbecomes very important for the design of an efficient parallel machine.

In the fixed-connection network model the parallel machine is representedas a graph G, where the vertices correspond to processors and the edges cor-respond to direct communication links [2]. The computational power of eachprocessor may vary, although it is common to assume that they involve lowcomplexity control and limited local storage [19]. Depending on the archi-tecture, the computation can be either synchronous or asynchronous. In thefirst case, we assume that a global clock signal traverses the entire graphdetermining the steps of the computation. At each step, each processors canreceive data from its neighbors, read its local storage, perform local com-putations, update its local memories, or generate output data/messages. Inthe case of an asynchronous computation, there is no global clock. Instead, aprocessor can send or receive messages from its neighbors at any time instant.The communication might be blocking or non-blocking (i.e., the sender sus-pends its operations until the receiver sends back a response, or the sendercontinues independently of the receiver) and the activity of the machine iscoordinated via designated messages (coarse-grained synchronization).

27

Apparently, an important aspect in the development of an algorithm forthe fixed-connection network model is the design of a scheme for the inter-processor communication within the parallel machine. In the general case,the messages have to be transferred over a number of intermediate edgesbefore reaching their destination. The message transfer is handled by theprocessors in between the source and the destination of the message, basedon a predetermined procedure. Such a procedure is called routing and is ofgreat importance for the performance of the machine. Usually, several rout-ing problems have to be solved just to implement a single parallel algorithm.The routing problem is defined as a set of messages with specific destinations,which are either stored initially in distinct nodes of the network, or they aregenerated dynamically during the computation. Targeting solutions for dis-tinct problems and network topologies, several routing algorithms have beenproposed in the literature: online or offline, deterministic or randomized,with and without queues, greedy, flooding, etc [19].

In the network model, the parallel machines differentiate primarily withrespect to their interconnection topology. Among others, the topology ofthe network determines the implementation cost, the communication delays,and hence, the efficiency of the parallel algorithm. The design of the net-work depends on the application and, naturally, a plethora of these has beenpresented in the literature over the years. In fact, it can be shown thatthe machines benefit greatly from carefully tailored networks with judiciousprocessor-to-node mappings. Overall, considering the performance analysisof each network, the available routing algorithms, and the various techniquesfor cross-simulating networks, one can be led to an immense area of study(beyond the scope of this thesis) [19]. In the context of this review chapter,the following subsection presents a few of the most commonly used fixed-connection networks.

Common network topologies

We begin with the definition of three parameters characterizing the topologyof the network, i.e., of the graph G: the diameter, the maximum degree andthe bisection width (or edge connectivity) [20] [2]. The diameter of G is thelongest of the shortest paths between any pair of nodes of G (maxmin path).The maximum degree corresponds to the maximum number of edges incidentto any node of G. The bisection width measures the minimum number ofedges that need to be removed for the partition of G in two disjoint subgraphs

28

of equal size (i.e., ||G1| − |G2|| ≤ 1)11. Regarding the practical effect of theseparameters, the diameter of the network affects the communication delay,especially when the message routing is performed in a store-and-forwardfashion (each node buffers the entire message before retransmitting it, insteadof continuously forwarding its parts as smaller packets). The bisection widthis related to the bandwidth of the network (depends on the bandwidth of eachlink separately) and is important for communication intensive algorithmswhere the message destination follows a uniform distribution over the nodes.The degree of G has an impact on the implementation cost of the nodes.Examples for these three parameters are given bellow [19] [2].

Undoubtedly, the most simple fixed connection network is the 1D Mesh,or linear processor array [19]. It consists of p processors P1, P2, . . . , Pp con-nected in a linear array, i.e., Pi is connected to Pi−1 and Pi+1. In the case ofprocessors P1 and Pp being connected directly, with wraparound wires, weget the definition of a Ring, or 1D Torus. Note that the diameter of the 1Darray is p− 1, while that of the ring is p/2. The bisection width is equal to 1for the array and equal to 2 for the ring. Both networks have maximum de-gree equal to 2. The construction of these networks can be easily generalizedto two, three, or more dimensions (see figure 2.3).

The r-dimensional Butterfly has N = 2r(r + 1) nodes and r2r+1 edges(undirected graph) [19]. Each node is labeled, virtually, with a unique 〈w, i〉bit-string, where w is an r-bit number denoting the row of the node and i,0 ≤ i ≤ r, denotes the stage (the column) of the node. Two nodes 〈w, i〉 and〈w′, i′〉 are connected if and only if i′ = i + 1 and, (i) w = w′ or, (ii) w differsfrom w′ in exactly the i′th bit. Notice the recursive structure of the butterflynetwork (its butterfly subgraphs) and the uniqueness of the paths from any“input” to any “output” node of the graph (see figure 2.3).

The butterfly is a widely-used network and it features certain variations.The Wrapped Butterfly is constructed, essentially, by merging the first andthe last stages of an ordinary butterfly network. That is, we merge two nodes(one input and one output node) when they belong to the same row of theinitial butterfly. Hence, every node of the resulting network has degree equalto 4. Note that the two networks have similar computational power, i.e.,they can simulate each other with a slowdown factor ≤ 2 (the same holdsfor many networks and their wraparounds: the linear array and the ring,

11the ‘bisection problem’ is NP-hard, as opposed to the ‘mincut’, which can be solvedefficiently by using flow techniques (the mincut places not constraint on the partitions).

29

Figure 2.3: Outline and parameters of common fixed-connection networks

30

the mesh and the torus in 2D, etc). In another variation, the Benes net-work is, essentially, two butterfly networks connected back-to-back (showinga reflection symmetry, fig. 2.3).

The r-dimensional Hypercube has N = 2r nodes and r2r−1 edges (undi-rected graph) [19]. Each node is labeled, virtually, with a unique r-bit binarystring. Two nodes are connected if and only if their labels differ in exactlyone bit (such a connection is called a dimension-k edge, where k is the po-sition of the nonidentical bit within the label). Therefore, each node hasdegree r. Conversely, an r-dimensional hypercube can be constructed fromtwo (r–1)-dimensional hypercubes by connecting the nodes having the samelabel via a dimension-r edge (the labels in the resulting hypercube will haveone extra bit, ‘1’ for denoting the nodes of the first hypercube and ‘0’ fordenoting those of the second). Note that the hypercube can be viewed asa folded up butterfly: each butterfly row corresponds to a hypercube node(consider merging each row of the butterfly to a single node and removingthe resulting edge copies).

The r-dimensional Cube Connected Cycles (CCC) network is constructedfrom the r-dimensional hypercube by replacing each node with a cycle ofr nodes. In such a cycle, each node is connected to a distinct edge of theinitial hypercube (from those incident to the initial node, fig. 2.3). Overall,the r-dimensional CCC has r2r nodes, each with degree 3. Note that, froma computational point of view, the CCC, the Butterfly, and the WrappedButterfly are identical networks (compared to the hypercube, the CCC in-troduces a logarithmic slowdown [19]).

The r-dimensional Shuffle Exchange network has N = 2r nodes and 3·2r−1

edges (undirected graph). Each node is labeled, virtually, with a uniquebinary string of r-bits and two nodes are connected if and only if: (i) theirlabels differ only at their last bit or, (ii) their labels are left or right cyclicshifts of each other. Edges of the former kind are called exchange, while thoseof the latter kind are called shuffle. The Shuffle Exchange is closely relatedto the de Bruijn network, which can be obtained by contracting out all theexchange edges from the first (we start with a shuffle exchange of dimensionr+1 to derive a de Bruijn of dimension r). Both graphs share very interestingproperties, which can be used even to formulate card tricks [19].

31

2.3 Circuit Models

The study of switching circuits dates back to the late 1930s and specificallyto the papers of Shannon and Lupanov. Shannon used Boolean algebra todesign and analyze circuits, while Lupanov worked on the question of howmany gates a circuit must have to perform certain tasks [21]. Thereafter, theswitching circuits and the logical design developed rapidly both in theoryand in practice, with notable contributions by Pippenger, Borodin, Ruzzo,Savage, Cook, Allender, and many others.

Like the fixed-connection networks of processors, the circuits capture par-allelism due to the ability of their network nodes to operate concurrently. Themajor difference of the two models lies in the reduced computational powerof each node-gate; instead of processors, the circuits use gates implement-ing single operations. The following subsections present various models andvariations. We start from the most simple abstraction, namely the Booleancircuits, and, by successively introducing more possibilities and details, weconclude with the VLSI circuits, which are used to model modern technology.

2.3.1 Boolean Circuits

The PRAM is a very attractive model of parallel computation due to itssimplicity and its natural parallel extension of the RAM model. However,designing in such high level raises certain feasibility concerns. For example,does the PRAM model correspond to a physically implementable device? Isit fair to allow unbounded numbers of processors and memory cells? Howreasonable is it to have unbounded size integers in memory cells? Is it suffi-cient to simply have a unit charge for the basic operations? To expose issueslike these, it is useful to have a more primitive model that is closely related tothe realities of physical implementation. A perfect candidate for this purposeis the Boolean Circuit model [22] [1] [5].

The boolean circuit is an idealization of real electronic computing de-vices. It abstracts their basic principles while, at the same time, it makes acompromise between simplicity and realism by ignoring many of the imple-mentation details. Overall, a circuit consists of gates performing elementarylogical functions and wires carrying information among the gates (figure 2.4).Formally, we let Bk = f |f : 0, 1k → 0, 1 denote the set of all k-aryBoolean functions, and we define

32

Figure 2.4: A Boolean Circuit (size=16, depth=4, width=3, n=5)

Definition 2.3.1. A Boolean Circuit Cn is a labeled, finite, directed acyclicgraph. Each vertex v has a type τ(v) ∈ I ∪ B0 ∪ B1 ∪ B2. A vertex of typeI is called an input and has zero indegree. The inputs of Cn are given asa tuple 〈x1, x2, . . . , xn〉 corresponding to n distinct vertices. A vertex withzero outdegree is called an output. The outputs of Cn are given as a tuple〈y1, y2, . . . , ym〉 corresponding to m distinct vertices. Finally, any other vertexwith τ(v) ∈ Bi has indegree i and is called a gate.

Notice here that a Boolean circuit is memoryless. Most often, the gatesused are the AND, OR, and NOT, and thus, the order of the inputs of eachgate is irrelevant. We consider that, by inputing 〈x1, x2, . . . , xn〉 and out-putting 〈y1, y2, . . . , ym〉, Cn realizes a certain function f : 0, 1n → 0, 1m.

The resources of interest are the size of the circuit Cn, i.e., the totalnumber of vertices in Cn, and the depth, i.e., the length of the longest pathin Cn from an input to an output node. Additionally, we can measure thewidth of the circuit, which corresponds to the maximum number of gatevalues needing to be preserved, excluding inputs, when evaluating the circuitlevel by level (we consider as level i of Cn the set of vertices, which are locatedexactly i edges away from the input nodes of Cn). A circuit is said to besynchronous if all inputs to level i are outputs of level i− 1.

Similar to the description of Turing Machines via strings of predefined for-mat, each circuit Cn is described by a string denoted Cn. Such a “blueprint”can be expressed in various formats, but most naturally, as a graph and a listof vertex types (labels)12. To be precise, we will describe here the standard

12another common format is the straight line program, a sequence of assignments to

33

encoding. Assume that we use the binary alphabet augmented with paren-theses and commas. The descriptor Cn is a sequence of 4-tuples of the form(v, g, l, r), followed by two strings, (x1, . . . , xn) and (y1, . . . , yn). For eachvertex v of Cn, the Cn contains a distinct 4-tuple (v, g, l, r) at some arbi-trary position within the encoding. The numbers l and r correspond to thevertices connected to the left and right inputs of v, respectively (all verticesare numbered uniquely and arbitrarily in the range 1, . . . , size(Cn)O(1)). Thenumber g denotes the type of the vertex v. In the two concluding strings,the number xi is the vertex number of the ith input, while yj is that of thejth output of Cn. Overall, Cn is an easy to generate and manipulate stringof length O (size(Cn) · log (size(Cn))).

At this point, we can make a general observation regarding the circuitmodels of computation (not only Boolean). The circuits have a character-istic, which significantly differentiates them from any of the aforementionedprocessor-based models. A circuit consists of a fixed number of fundamen-tal elements, namely the gates. Since a gate processes a fixed number ofinformation bits, a circuit can only process inputs of fixed length. Hence, inorder to solve a problem we must assemble an infinite number of well-definedcircuits, one for each input length (contrast this to a TM, which can inputa string of arbitrary length). To be precise, consider that the output length,m, is a function of the input length, n, and let us define

Definition 2.3.2. A Family of Boolean Circuits Cn is a collection of cir-cuits, with each member Cn computing a function fn : 0, 1n → 0, 1m(n).The function computed by Cn is the function fC : 0, 1∗ → 0, 1∗, definedby fC(x) ≡ f |x|(x).

One step farther, we can use circuit families to decide languages. We saythat L ⊆ 0, 1∗ is decided by the boolean circuit family Cn computingfC : 0, 1∗ → 0, 1, when we have that x ∈ L iff fC(x) = 1.

Notice that the algorithmic solution (TM, or PRAM, or BSP, etc) of aproblem is a finite object. On the contrary, the circuit family of a problem isan infinite collection of finite objects. Inevitably, this peculiarity of the modelrenders the portability and the study of circuit solutions questionable. Themost common remedy is to impose a restriction on the construction of thecircuit families by requiring a computationally simple rule for generating the

variables (wires) using instructions based on binary operators (as in Hardware DescriptionLanguages). In fact, a circuit is a DAG of a SLP [23].

34

various circuit-members of Cn. Intuitively, we require that all the membersof the family are similar in structure, but differ in input sizes (consider, e.g.,a XOR-tree to decide PARITY). In general, a family is said to be uniform ifthere exists an efficient algorithm to output Cn upon request.

Definition 2.3.3. A family Cn of Boolean circuits is “logspace uniform”if the transformation 1n → Cn can be computed in O(log(size(Cn))) spaceon a deterministic Turing machine.

The logspace uniformity (Borodin-Cook uniformity) is the most widelyused and, for polynomial size circuits, it can be equally defined by DTMs oflogarithmic space in –unary– n (hereafter, the two versions of the definitionare used interchangeably). Even though logspace DTMs solve a small classof problems, they can describe a wide class of useful circuit families. Fur-thermore, we can define several types of uniformity based on the resourcesand the capabilities of the machine generating the circuit descriptions [24].Our choice on the exact type of uniformity depends mostly on the intendeduse of the circuit and on the complexity of the class under examination (seechapter 4). Generally, we avoid allowing more computational power to theconstructor than to the circuit itself. Notice that the circuit constructorserves as a single object representing the entire family.

Not to overlook the non-uniform families, we mention here that theyare also used in the study of circuits. First, they are famous for decidingnon-recursive languages (i.e., undecidable by TMs). Consider for example anon-recursive unary language Lu ⊆ 1∗ (at least one exists, because any non-recursive L can be reduced to some Lu, e.g., via binary-to-unary expansions).Strikingly, there exists a Cn to decide Lu: for each n ∈ N we define Cn

to be a tree of AND gates iff 1n ∈ Lu [5]. Unfortunately, even though Cnexists, there is no way to construct Cn algorithmically, because no TM candecide whether 1n ∈ Lu when generating Cn. Second, non-uniform familiescan be used to prove lower bounds; if a function cannot be computed bysuch a family, then it cannot be computed by a weaker, uniform, family ofthe same size.

The aforementioned definitions give us the basis of the Boolean circuitmodel. Several variations have been proposed. In many cases we allow the useof gates with fan-in greater than 2, or even the use of unbounded fan-in gates.Such possibilities can “speed up” the computation by up to a O(log n) factorwithout increasing its size. Other variations limit the gate types to be used.For example, we can use only AND and OR gates to study monotone circuits.

35

The term monotone connotes here that the change of any input variablefrom ‘0’ to ‘1’ cannot result in an opposite change of the output (from ‘1’ to‘0’). Note that, since AND,OR is not a functionally complete set of logicoperators, monotone circuits cannot compute all Boolean functions. Othercircuits might even introduce new types of gates as, e.g., the threshold gate(outputs ‘1’ iff the number of its input values add up to some threshold) formodeling neural networks. In another direction, we can impose restrictionson the structure of the circuit graph. For example, we can confine ourselvesto the study of planar circuits, i.e., circuits that can be drawn in 2-D with noedge crossovers. Besides the above, many criteria and resource bounds canbe used, or combined, allowing the power analysis and/or the developmentof circuit solutions for specialized applications.

The Boolean circuits will be examined in detail in the following chapters.In fact, we will consider them as the basic model of our parallel compu-tation study. However, not all circuits proposed in the literature are builtupon Boolean operations. We mention here the arithmetic circuits [22]. Anarithmetic circuit is like a Boolean circuit except that the inputs are in-determinate 〈x1, x2, . . . , xn〉, possibly constants ci ∈ F , F a field (e.g., therationals Q), the internal gates are arithmetic operations +,−,×,÷, and theoutputs are elements of F (x1, x2, . . . , xn). Arithmetic circuits are used tostudy the complexity of multivariate polynomials or rational functions. Aswith their Boolean counterparts, we are interested in the depth and size ofuniform arithmetic circuits, such that we can bound the cost of computingcertain polynomials. Overall, arithmetic and combinatorial (boolean) com-plexity are related and questions of one domain can be translated to theother; we even encounter the algebraic analog of “P vs NP”, i.e., the “VP vsVNP” conundrum (VP is the class of polynomials computed by polynomialsize arithmetic circuits).

2.3.2 FSM, Conglomerates and Aggregates

The Boolean circuits were defined as acyclic graphs. Such a definition pro-hibits the resource reuse during the computation, as well as the constructionof memory cells. Arguably, an extension of the model allowing feedback pathswithin the circuit would enhance its ability of estimating hardware costs inthe modern technology. Before presenting the Aggregate model for this pur-pose, we briefly describe the state machines, which precede the aggregateswithout necessarily basing on circuits.

36

Finite State Machines

Let us mention first the Finite State Acceptor (FSA), which is a weak relativeof the Turing machine. This automaton is defined similarly to the TM,including a set of states Q, initial and final states, input alphabet A, anda transition function δ. However, the FSA has the ability to read its inputtape only once starting from the leftmost symbol and moving to the right. Ateach step, the state qi of the FSA changes according to a transition functionof the form δ : (qi, α) → (qj), α ∈ A. If we allow the FSA to output a symbolat each step, i.e., if we define an output alphabet B and a transition functionδ : (qi, α) → (qj, β), β ∈ B, we get a Finite State Machine (FSM) [25].

The FSM is a notion used beyond the context of Turing machines. Forinstance, we can envisage an FSM as a graph consisting of a memory nodeand a processing node. Synchronously, the memory stores the current state,while the processing node reads the input symbol and the current state togenerate a new state and an output symbol. Note that, if we restrict theinput values to some fixed length, then we can implement the above nodesby using circuits. In fact, a common application of the FSM is in the designof digital systems.

To conclude the deterministic FSM presentation, we define two FSMtypes based on the form of the output function. The first is the MealyMachine (introduced by Mealy in 1955) where the output value is deter-mined at each step by both the current state and the current input symbol,i.e., fout : Q× A → B. The second is the Moore Machine (introduced by E.F.Moore in 1956) where the output value is determined at each step by the cur-rent state alone, i.e., fout : Q → B. It can be shown that the two machinesare computationally equivalent. Moreover, the FSMs decide exactly the classof regular languages (languages described by regular expressions) [23].

Conglomerates

In an attempt to study parallel computation with the use of state machines,Goldschlager introduced in 1977 the Conglomerates. A conglomerate is, es-sentially, a collection of interconnected, synchronous, deterministic, FSMs [7].

To be precise, a conglomerate C = 〈I, F, f〉 is an infinite set F of isomor-phic copies of a single FSM, Mi = 〈Σ, S, δ, r〉, where I is the input alphabet,Σ is the communication alphabet, S is the set of states, r is the number ofinputs to each Mi, and δ is the transition function. The function f defines the

37

interconnection network within the conglomerate by generating the indicesof the predecessors of each Mi. All sets, except F , are finite.

To explain the definition, a conglomerate consists of identical FSMs. EachMi has a predefined number of r inputs and one output, which is connectedto a number of distinct Mj’s. That is, the fan-in of each Mi is bounded, thefan-out is unbounded, and the interconnection network of the Mi’s is fixed.By convention, the input w of the conglomerate is distributed among thefirst |w| FSMs (one bit per Mi). Hence, given that the number of FSMs usedduring the computation reflects its hardware complexity, all conglomerateshave Ω(n) cost. The computation advances in steps: each Mi starts at itsinitial state and it successively generates internal states and output values.The total number of steps is considered as the time cost of the model. Aconglomerate accepts w if at any time Mo enters a designated state qacc.

The restrictions imposed on the singularity and the output of the aboveFSMs might seem restrictive for the computational power of the model. How-ever, it is shown that the above conglomerate can simulate with no time lossthe variant of the model, which allows many outputs and distinct types ofFSMs (finitely many). Moreover, it is shown that the SIMDAG (see p. 13)can be implemented with a simple conglomerate, the connection functionof which can be computed with only logarithmic-space RAMs. Analogouslyto the boolean circuit model, when studying the computational power ofthe conglomerate we impose a uniformity constraint on the construction ofits interconnection. Arguably, a conglomerate is implementable if its con-nection function can be computed by polynomial-space RAMs. It is shownthat the class of these conglomerates is equal to the class of polynomiallyspace bounded computers. More specifically, the –parallel– time of such aconglomerate is polynomially related to the –sequential– space of a TuringMachine [7].

Aggregates

By combining the boolean circuits and the conglomerates, Dymond andCook introduced the Aggregate model in 1980 [26]. Like bounded fan-incircuits, each aggregate consists of 2-pin boolean gates and works for inputsof fixed length. Like the conglomerates, the gates of an aggregate work syn-chronously. Unlike either circuits or conglomerates, the input conventionfor aggregates is such that either the hardware used or the time taken in acomputation can be sublinear in the input size.

38

Formally, an aggregate βn works on n input bits x0, x1, . . . , xn−1. It is adirected graph (not necessarily acyclic) of boolean gates and input nodes. Aconfiguration C of βn is an assignment of 0 or 1 to each node u. The assign-ment of u at Ct+1 is determined by the operation of u and by its input pinsat Ct. A computation of βn is a sequence Co, C1, . . . , CT of configurations.The maximum T over all inputs x, |x| = n, is the time complexity T (βn),while the total number of nodes is the hardware complexity H(βn). Given adistinguished pair 〈u0, u1〉, the output of βn is equal to the value of u0 whenu1 turns to 1 for the first time (all nodes are initialized to 0). A family ofaggregates βn recognizes a set A ⊆ 0, 1∗ if βn recognizes An for all n.An aggregate family βn is uniform if each member βn can be described bya DTM in space log (T (βn) ·H(βn)) on input 1n.

The input convention of the aggregate model is analogous to the TM,which ignores the space cost of its input tape. Here, each input node v isassociated with a unique register Rv specifying the input bit xi to be assignedto v. The Rv is considered as a sequence of log n special purpose gates. Wecan envisage v as a binary multiplexer controlled by the contents of Rv, whichcan change dynamically to specify any index of x. Overall, the hardware costof node v is 1 and its assignment is delayed by log n steps. Note that if weassume that an aggregate βn examines all the information of its input, thenwe have T (βn) ·H(βn) = Ω(n).

Regarding the computational power of the aggregates, it is shown thatAG-TIME(T )=BC-DEPTH(T ) and that UAG-TIME(T )=UBC-DEPTH(T ).In other words, aggregates define exactly the same parallel time classes asboolean circuits, with and without the uniformity constraints (models ofequal power). Moreover, the time of the uniform aggregate is polynomiallyrelated to the space of the Turing Machine. In terms of hardware, we getDTM-SPACE(S)=UAG-HARDWARE(S)=UBC-WIDTH(S). Finally, uni-form aggregate hardware and time are themselves polynomially related [26].

2.3.3 VLSI

The aforementioned circuit models ignore the cost of placing wires within thecircuit to connect its gates. Nonetheless, such cost is far from negligible intoday’s technology. Given the tremendous reduction of the size of a booleangate, the wiring accounts for more than half the cost of the entire modernmicrochip [23]. The VLSI model, along with its several variations, was in-troduced in an attempt to include the details of the microchip technology to

39

Figure 2.5: VLSI layout of a NAND gate (substrate, wells and wires shown)

the theoretical study of computation.

Design at the Physical Layer

Very Large Scale Integration (VLSI) is the process of creating integrated cir-cuits by combining thousands of transistor-based boolean gates and memorycells into a single chip. As figure 2.5 shows, each gate is itself the combina-tion of a number of semiconducting elements (forming a few interconnectedtransistors) occupying little square micrometers on the wafer. Starting froma thousand transistors in the early 1970s, microelectronics allow today theintegration of a billion transistors in a single chip.

In the VLSI setting, the surface of a wire within the circuit is comparableto that of a single gate. Moreover, the fabrication process won’t allow morethan a constant number of wires to cross each other at any point of the wafer(typically, less than a dozen metal layers are used). As a consequence, theinterconnection of an architecture has a definitive impact on its area cost.Besides, it affects the timing of the chip, as the signals propagate throughthe chip following the laws of physics and not instantaneously. Clearly, thewiring is as important as any other algorithmic aspect of the design andrequires extra study (compared to the boolean circuit model).

The crossover constraint is even more strict for the gates, which cannotoverlap at all. Even in the emerging technology, namely the 3-D IntegratedCircuit, this constraint is still present, preventing the design of arbitrarily“thick” circuits. Overall, a VLSI circuit is quite similar to the planar circuit of

40

the boolean model. To underline their similarity we use the term quasiplanar[23], which denotes the fact that the VLSI circuit allows only a constantnumber of crossovers (planar implies zero crossovers).

Additionally to boolean gates and wires, the VLSI circuits include certainmemory cells. Such a characteristic element is the flip-flop. We can envisagethe flip-flop as a special purpose gate, which can store the value of its inputbit for an arbitrary amount of time. The amount of storage time is controlledby designated signals, amongst which is the clock of the design. The clocksignal traverses the entire circuit defining the time instants at which thecomputation advances to the next step. That is, in general, the VLSI circuitoperates synchronously. The time cost of the computation is equal to therequired number of steps (clock cycles) multiplied by the clock period. Theclock period (operating frequency) is defined prior to the computation andis bounded by the propagation time of the largest asynchronous path in thecircuit (path between two successive flip-flops, with transistors and wires).

The inputs and outputs (I/O) of a VLSI circuit are communicated viadesignated pins. The I/O pins are wires of much greater surface than the in-ternal wires of the circuit, and thus, they are studied separately. Analogouslyto the I/O conventions adopted in the other models, the VLSI features itsown I/O specifications. A port of a VLSI circuit can receive several inputvalues, each one at a distinct time instant, e.g., at specified clock cycles.Also, an input can be supplied at several ports simultaneously. Moreover,a VLSI circuit has the possibility of reading a certain value more than oncefrom its input ports. The VLSI computation is designed as semellective (readonce) or multielective (read/request data multiple times). Usually, the I/Ooperations are where- and when-oblivious [23], i.e., they occur at specifiedpins and cycles. Similar conventions are adopted for the output ports of acircuit. In fact, any pin can be used as an input, output, or input/output ofthe chip. Given the above I/O possibilities, it is clear that a VLSI circuit canperform several, distinct, computations concurrently. In a pipeline fashion,it can input an new problem instance (a set of new values), while still pro-cessing some older values at a deeper stage within its architecture. Hence, tomeasure the performance of a chip we use (besides the computation time andarea) the rate at which the chip can receive new instances, or the throughput :the number of instances solved, or information processed, over the total timeconsumed.

To complete the description of the VLSI technology we mention the pa-rameter λ. In photolithography, a technique employed in microfabrication,

41

the geometrical patterns comprising the integrated circuit (figure 2.5) areprinted on the wafer using light projection equipment. The projection fea-tures certain resolution, determined by physics and state-of-the-art technol-ogy. This resolution defines the minimum separation length, λ, between theelements of the chip (wires, transistors, etc). Typically, the width of an in-ternal wire is a small multiple of λ and the area of a transistor is a multipleof λ2. That is, the minimum feature size λ specifies the transistor densityof a VLSI wafer, and to some extent, the computational power of a chip.Naturally, λ is significantly improved over the years. Starting from 10 µmin the 1970s, today we use 0.045 µm and we target 0.032 µm manufacturingprocess13.

VLSI Technology Abstraction and Study

In all theoretical models of VLSI, circuits are made of two basic components,namely the wires and the gates (boolean operations, or flip-flops) [27]. Thewires are embedded in the plane (quasiplanar) and have unit width and arbi-trary length. They usually have unit bandwidth (one bit per transmission).Their delay is determined as follows: the synchronous model assumes unitdelay independently of the length, the transmission-line model assumes thatthe delay is proportional to the length of the wire, whereas the diffusionmodel assumes that the delay is quadratic in the length [23]. Regarding thegates and the I/O pins, they are considered as nodes, which are connectedvia the above wires (hyperedges). Each node is characterized by a specifictransmission function and it has constant area, constant delay and constantfan-in/fan-out. The space units are usually related to λ (e.g., width unit=λ,area unit=λ2, and each gate, memory cell, port, or pair of crossing wires,occupies λ2 area). Similar assumptions are made for the time unit in an at-tempt to abstract the fabrication parameters and measure the performanceof the architecture independently of the technology, e.g., in terms of clockcycles and area units (the implementation values of which tend to reduceyear after year). In practice, of course, an architecture is evaluated based onits implementation results including the operating frequency and the actuallayout size of the chip.

Each VLSI model further defines the I/O, the degree of concurrency, theenergy consumption, the performance, etc [23] [27]. These definitions, more

13according to Moore, Intel co-founder, the transistor density doubles approximatelyevery two years. Of course, his “law” cannot hold for ever (due to atomic limitations).

42

or less, choose among the aforementioned VLSI technology possibilities. Wecan additionally mention here that, some authors assume an off-chip memoryto store the I/O bits, such that multiple multilocal accesses are possible (thememory cost is usually excluded from the analysis, allowing a reduction ofthe chip’s cost). The concurrency, c, of an architecture is equal to the numberof problem instances solved simultaneously. Its area performance is equal tothe total area units of the chip (smallest enclosing rectangular), divided byc. The time performance is equal to the time between the first and last I/Omemory access, divided by c. For a detailed analysis, we even define thedata format according to which the problem instance will be presented tothe chip.

Having defined the model, one can continue with studying the computa-tional power of the VLSI circuits. First, note that every boolean functioncan be implemented on VLSI because it has a planar boolean circuit (everyfunction has a boolean circuit, which can be transformed into planar with atmost a quadratic increase of its size). Second, note that the area of a VLSIcircuit is proportional to the number of, say, the PRAM processors that canfit in the chip14. Given the similarity of the time resource of the two models(PRAM and VLSI), we are driven to the study of the product area×time forthe VLSI circuits (the AT measure). Clearly, this is analogous to the productprocessors×time used in the PRAM case. This product constitutes a PRAMperformance measure, which is lower bounded by the minimum sequentialtime, Ts, of the problem. Similarly, we have that AT = Ω(Ts). Note that,for low complexity problems we can devise circuits, which are optimal interms of AT (performing at its lower bound). However, the AT bounds arenot the strongest studied in VLSI. Stronger complexity bounds are achievedfor the AT 2, which is the most common VLSI complexity measure used inthe literature (A2T is also used) [23].

Various methods can be employed in order to prove an AT 2 lower boundfor a given problem. The most common method is to use a “fooling argu-ment”, i.e., show that a minimum amount of information must be transferedbetween two parts of the chip, or else one part can be fooled into thinkingthat the other inputs data different than the actual. Transferring such in-formation requires either some large amount of time or a large number of

14likewise, the area is analogous to the amount of information memorized by the VLSIcircuit at any time instant. Recall that this also holds for the space of a Turing machine.Intuitively, the area complexity of an algorithm is loosely coupled to its space complexity.

43

wires (area), and thus, a lower bound on the AT 2 of the chip can be proved.Another method to prove VLSI lower bounds for a specific problem is bystudying the complexity of the planar circuit of that problem (recall that theVLSI is already quasiplanar). It can be shown that any VLSI computationcan be simulated by planar circuits of size O(AT 2) and O(A2T ) [23].

To report some examples, the “n-tuple prefix sum” has VLSI complexityAT = Ω(n), while the “n×n matrix-vector multiplication” has AT = Ω(n2).Both problems feature AT–optimal circuits, which base on H-trees15. The“integer multiplication” and the “n-tuple shifting” have VLSI complexityAT 2 = Ω(n2). Moreover, they both feature AT 2–optimal semellective cir-cuits based on 2-D systolic arrays. The “n-tuple sorting” has VLSI complex-ity AT 2 = Ω(n2 log n), while the “n-point FFT” has AT 2 = Ω(n2 log2 n) [27].The “n×n matrix-matrix multiplication” has complexity A2T, AT 2 = Ω(n4).Separate bounds for A, T are also computed for many problems, as for exam-ple the A = Ω(n) of the “n-vector shifting” (shifting in T = O(

√n/ log n) is

a characteristic case where most of the VLSI area is consumed by wires) [23].We conclude the VLSI model by noting its area-time tradeoffs (Thomp-

son, 1980). As mentioned above, problems feature certain lower bounds ontheir AT , AT 2, and A2T complexities. Such bounds imply that, when de-signing optimal VLSI circuits, in order to improve A we have to degrade Tby some certain factor, and reversely. Of course, analogous quantitative orqualitative estimations hold for parallel computation in general.

2.4 More Models and Taxonomies

Besides the aforementioned, the literature includes a variety of models forstudying and developing parallel algorithms. A number of these models arebased on extensions of the ideas described so far, while others incorporateentirely novel techniques and technology trends. To complete the survey ofthe current chapter, the following subsections refer to a representative setof the remaining parallel models of the literature. Moreover, we report theproposed architectural/programming taxonomies, which are used to classifythe parallel computers today. We begin with the details of a classical model,which is widely used in complexity theory.

15similar to the binary tree, except that each node has 4 children. In the binary tree,one can see a repetition of the shape Λ in a recursive fashion. In the H-tree, the repeatedshape is H, rendering the mapping of the tree on the plane more efficient in terms of area.

44

2.4.1 Alternating Turing Machine

Alternating Turing Machines (ATM) were introduced in 1981 by Chandraet al. [28]. They are generalizations of Non-Deterministic Turing Machines(NDTM). Like the NDTM, the ATM is not a realistic model of computation.However, it is very useful in studying certain languages with no obvious shortcertificate for membership. Moreover, ATMs capture the parallel complexityof various classes of problems.

Recall that a NDTM is essentially characterized by a computation tree,the nodes of which represent the configurations and the edges represent thepotential transitions of the machine [5]. In the case of an NP language, theNDTM accepts if there exists an accepting leaf in the tree. In contrast, inthe case of a co-NP language, the NDTM accepts if for all leaves we get anegative response. Loosely speaking, one can see only OR (∃) nodes at thefirst tree and only AND (∀) nodes at the second. The ATM generalizes byallowing both types of nodes in the same computation tree.

In an ATM, each state qi ∈ Q is labeled with its own logical connectivel ∈ ∨,∧. Consequently, each configuration of the ATM is also labeled byl. Let us define recursively a configuration to be accepting if: (i) the stateof the configuration is a ‘yes’ state, (ii) it is labeled ∨ and at least one of itschildren is an accepting configuration, or (iii) it is labeled ∧ and all of itschildren are accepting configurations. The ATM computation accepts if theinitial configuration is an accepting configuration (notice the resemblance tothe evaluation of a circuit).

The remaining of the ATM is defined exactly as the NDTM, i.e., it isa seven-tuple 〈k, Q, Σ, Γ, δ, q0, g〉 where k is the number of tapes, Q is thefinite set of states, Σ is the input alphabet, Γ is the work tape alphabet, δis the transition relation, q0 the initial state and g : Q → ∨,∧, yes, no is afunction labeling the states of the ATM. The input tape is read only. An ATMdecides a language L ∈ Σ∗ when it accepts x iff x ∈ L. Let ATIME(f(n)) bethe class of languages decided by ATMs, all computations of which on input xterminate after at most f(|x|) steps. Similarly, ASPACE(f(n)) includes thelanguages decided in alternating space f(|x|). Note that, besides time andspace, we use the alternations to study the complexity of an ATM algorithm;this measure counts the maximum number of changes between ∨ and ∧ statesover all paths of the computation tree.

Regarding the computational power of the model, the ATMs accept ex-actly the same set of languages as the DTMs (the recursively enumerable

45

languages) [28]. To be more precise, it was proved that alternating space isequal to deterministic time, only one exponential higher. Also, alternatingtime is polynomially related to deterministic space. We mention here thatthe idea of alternation can be applied in other models too. In some cases,alternation enhances the computational ability of the entire model: alter-nating pushdown automata are strictly more powerful than nondeterministicpushdown automata (the former accept languages accepted by DTM in timecn, while the latter do not).

Regarding the interests of this thesis, the ATM can be viewed as a ma-chine spawning processes with unbounded parallelism. That is, when a nodev of the tree branches, it spawns new processes, which will run to comple-tion and report acceptance or rejection back to the node v. Afterwards, thenode v combines the answers in a simple way (AND, OR) and forwards theresult to its own parent, and so on. Note that the process communicationis confined only between parents and children. Overall, the ATM is closelyrelated to the models of parallel computation. A characteristic example isthe correspondence between the number of alternations and the depth of theunbounded fan-in circuits. The ATM is used to study parallel complexitytheory and we will come back to it in chapter 4.

2.4.2 Other Models

The k-PRAM model was introduced in 1979 by Savitch and Stimson [29].Unlike the ordinary PRAM, this model has no shared memory. Instead, aprocessor communicates directly with another processor by making a calland passing parameters via some channel registers. The processor cannotcommunicate with an unbounded number of distinct processors. The param-eter k denotes the branching factor out of each processor. The above idea issimilar to that of the recursive-TM. Such a machine has two input tapes,one containing the input of the problem and one containing the output ofthe calling machine. That is, the TM can spawn an identical TM by passinga parameter via its output tape (to the input tape of the new TM). In thisway, the TM performs a subroutine call and waits until its termination (ac-cept/reject state) before continuing with the remaining of the computation.One step farther, the parallel-TM is defined by allowing two output tapesper machine and two simultaneous subroutine calls, which will spawn twonew machines working in parallel.

The Hardware Modification and the Parallel Pointer Machines

46

are essentially two versions of the same model, which is characterized by itsreconfigurability at run-time [30]. As with the Conglomerate model, HMMand PPM consist of a finite collection of FSMs pairwise connected in a directfashion. However, as opposed to the Conglomerate where the connectionsare fixed, the communication channels between the FSMs of the HMM andPPM can change during the computation. To be precise for the PPM case,each FSM has k input lines (links, or pointers) and, at each step, it canredirect one of its links to any other FSM, which is located no more thantwo hops away from it. These models were introduced (1980s) to study thecomputational power of machines falling in between the two major categoriesof models: those with bounded and fixed inter-process communication (e.g.,circuits), and those with unlimited communication links (e.g., shared mem-ory). Interestingly, the results of the literature show that the PPM fallsvery close to the second and more powerful category, i.e., it can simulatesequential-space S(n) in only O(S(n)) parallel-time (and not O(S2(n))).

The Reconfigurable Mesh is a model capturing features from the fixed-connection networks and the VLSI, characterized by broadcast buses andreconfigurability at run-time [31]. Specifically, it consists of a

√N ×√N ar-

ray of processors connected via a grid-shaped reconfigurable broadcast bus.Each processor has four local switches, which allow the bus to be dividedin sub-buses to connect various sets of processors, instantly. The broadcaston the bus can be performed in an exclusive-write or common-write fashion.Similar to the VLSI, we assume Θ(N) total area cost, unit area for a link be-tween adjacent switches, and either unit-time delay (any broadcast operationconsumes Θ(1) time), or logtime delay (broadcast in Θ(log s) time, wheres counts the switches traversed by the signal). Note that under the unit-time delay assumption, the reconfigurable mesh owns greater power than theordinary circuits and the PRAM, e.g., it computes the PARITY in O(1) time.

The Vector Machine is, roughly, a RAM including bit-vector registersand bitwise instructions: AND/OR/NOT operations, vector shifting, etc [32].These instructions are executed in one cycle and involve all bits of the register(of infinite length), i.e., we get parallel processing at bit level. The ability ofthe machine to shift integers by multiple bit-positions to the left (equal tothe length of the integer) in a single cycle, gives extraordinary power to themodel. In fact, the Vector Machine can generate exponential informationin only polynomial number of steps. This is also a characteristic of theM-RAM model, which is a RAM equipped with a unit-cost instruction formultiplying variables. It is proved that both machines can solve NP-complete

47

problems in polynomial time [32] [5].The CLUMPS model is a generalization of the LogP (Campbell, 1994)

[15]. A set of autonomous processors communicate directly via a partition-able interconnection network. Similar parameters to that of the LogP areused for measuring the performance of the parallel algorithms. Also, a net-work capacity is determined. However, instead of the global parameters Land g of the LogP, the CLUMPS uses latency and bandwidth parameters,which are defined for certain regions of the machine. Each region determinesa collection of processes allocated to a specific partition of the interconnectionnetwork. Inter-region communication is costed by merging the parametersof the corresponding regions. The distributed memory of the model can beviewed as a multi-level hierarchy with each level consisting of the collectivememory of a region (the local memories of each processor are located at thelowest level, while at the highest level we get the memory of the entire ma-chine). Overall, introducing numerous regions and parameters is liable toincreasing the difficulty of the analysis, but it also allows algorithms to runfaster if we carefully map the processes to appropriate partitions.

The Cellular Automata came into existence with the work of von Neu-mann, among others, during the 1940s. Today, the model is used in biology,chemistry, physics, as well as in computation theory [33]. Similar to theConglomerates, a CA is a –possibly infinite– collection of finite state au-tomata. These FSAs, however, are not defined here with explicit I/O orinterconnections; we are interested only on their states. Their communica-tion is performed indirectly by relating their states. Specifically, each FSAis called here a cell. The cells are placed on a discrete regular grid, i.e., on alattice (most often, the CA is a 2D array of cells). A neighborhood formulaassociates each cell with a finite number of specific cells in its vicinity. Theneighborhood determines the state of the cell during the computation. Thecomputation is a synchronous evolution of configurations, i.e., of vectors de-scribing the states of all cells at each time instant. Starting from an initialconfiguration, the state of each cell changes at each step according to a prede-fined rule (a transition function), which takes as input the previous states ofits neighborhood cells. This rule is the dominant characteristic of each CA.Interestingly, one can see certain patterns evolving within the configurationsduring the computation (shapes often appearing in the study of CAs, or evenin the nature itself). Well-known problems expressed in terms of CAs are the“configuration reachability” (can we obtain picture X starting from Y) andthe “predecessor existence” (deciding whether X has a certain predecessor

48

Y). The former is undecidable for infinite 1D CAs (analogously to the halt-ing of a TM), while the latter is NL-complete for 1D CAs and NP-completefor 2D CAs. We know that CA can simulate TM and vice versa.

Natural Computing is a field of research swarming with peculiar mod-els of parallel computation [34]. It is concerned with the computing paradigmsencountered in nature, i.e., it studies the principles/mechanisms underlyingnatural systems, which can inspire humans to design new algorithms andcomputers. Since nature works most often in parallel, most of the result-ing models perform parallel computations. The cellular automata mentionedabove is a first, characteristic, example of natural computing inspired by theself-reproductive mechanisms in biological organisms. Other examples in-clude neural networks, membrane computing, molecular (DNA) computing,evolutionary (genetic) algorithms, ant colony optimization, artificial immunesystems, etc. Some of these attempts lead to heuristic algorithms (e.g., forsearch problems), while others can lead to the design of computational mod-els with much greater power than the ordinary TM (e.g., the “transitionP system” can solve PSPACE-complete problems in polynomial time [35]).The literature on natural computing is vast, varying from pure theoretical toexperimental laboratory research.

Dynamic Multithreading is a more of a programming model, whichcan also be used to study parallel algorithms [3]. The model assumes an idealmultiprocessor and abstracts the need for communication protocols and loadbalancing (such issues are handled in practice by a concurrency platformplaced between the programmer and the actual multiprocessor). A parallelprogram consists of ordinary instructions (add, jump, etc) plus three spe-cial purpose instructions: spawn, parallel, and sync. The spawn commandallows for nested parallelism, as the programmer uses this to start a newsubroutine (a thread) to work in parallel with the caller routine (via a callwith parameters). The parallel command is used to execute certain loopsin parallel without explicitly calling subroutines. The sync command deter-mines specific time barriers within a routine, forcing the routine execution towait until all of its subroutines have terminated and returned their results.The above three commands are concurrency keywords and are inserted in a–otherwise serial– pseudocode to express the logical parallelism of the com-putation, indicating which parts of the computation may proceed in parallel.Alternatively, one can construct the directed acyclic graph of the compu-tation with nodes-instructions and edges-dependencies (fanout>1 denotes aspawn and fanin>1 denotes a sync). We are interested in measures such as

49

Model Complexity Measures

Turing Machine time(steps), space(cells), reversals

Alternating TM alternations, space(cells), time(steps)

PRAM time(instr), processors, processors×time

BSP supersteps, processors, g·hi + s + mi

Boolean Circuit depth(gates), size(gates), width(gates)

VLSI Circuit time(sec), area(µm2), area×time2

Table 2.1: Most common complexity measures (models’ resources)

the work of the computation (its sequential time) and the span (its paralleltime, i.e, the longest path in the dag).

Papadimitriou and Yannakakis described in 1990 a method for eval-uating the performance of a parallel algorithm with the solution of a schedul-ing problem [36]. PY view the parallel algorithm as a directed acyclic graphindicating the elementary computations of the algorithm and their interde-pendence. They do not simply measure the depth of this graph to find thetime complexity of the algorithm, because such a method disregards the com-munication delay of the processors. Rather, they introduce a parameter τcapturing the delays of the underlying machine and they define a schedulingproblem on an unbounded number of processors. As a schedule constraint,each task Ti must be completed by time t–1, where t denotes the start of anychild Tj of Ti, or by time t–1–τ when Ti and Tj are not scheduled on the sameprocessor. The optimum makespan of this scheduling problem is regardedas the complexity of the algorithm under examination (given as a functionof τ). Since scheduling is a NP-complete problem, PY devise a schedulingmethod to approximate the optimum makespan to within a factor of 2.

We conclude the model presentation in this thesis by summarizing theresources of interest for the most widely used models in table 2.1. Theseresources are used to measure the complexity of the implemented algorithmor circuit. Note that the first column, provably, captures the time complexityof the computation. Interestingly, the measures of the second column are alsorelated: they reflect the amount of information carried during the course ofthe computation.

50

2.4.3 Taxonomy of Parallel Computers

The models mentioned so far, along with their variations, lead to the de-velopment of a plethora of processor-based parallel machines. Considerableeffort has been spent in identifying the commonalities of these machinesand classifying them based on practical criteria (programming, architectural,etc) [2] [37].

The first well-cited attempt was made by Flynn in the 1960s. Flynnbased his taxonomy primarily on the programmer’s viewpoint. Specifically,Flynn was interested in whether the processors of a system will execute thesame instruction at any given cycle (central control), or not. Moreover,Flynn was interested in whether each processor will operate on the samedata with the other processors at each cycle (common input), or not. Inother words, Flynn’s criteria are the instruction stream and the data streamof the system. In a “Single Instruction - Multiple Data stream” computer(SIMD) all processors execute the same command per cycle, even thougheach one of them processes distinct data (recall the SIMDAG). The SIMDsare also called array or systolic processors. In a “Multiple Instruction -Multiple Data stream” (MIMD) every processor has its own program andprocesses a distinct data set (recall the general PRAM). Of course, thereexist uniprocessor systems (SISD), as well as specialized MISD systems. TheSIMD multiprocessors were widely used in the past decades. Today, the topsupercomputers are based on MIMD architectures. During the 1980s, theMIMD class was further divided into the SPMD and the MPMD subclassesby considering the diversity of the programs stored in the processors of theparallel machine. The “Single Program - Multiple Data stream” systemscompose a wide and important class, which can be viewed as a compromisebetween SIMD and MIMD: each processor is given its own copy of the sameprogram. As a result, each processor can follow a separate branch duringthe program execution, e.g., with an if–then–else command. This allows formore flexibility and less idle time per processor. However, SPMDs requiresynchronization points within the program (in contrast to the SIMD where,by definition, every processor knows the state of the others). Note thatSPMD is nowadays the most common style of parallel programming andforms the basis of the Message Passing Interface (MPI)16.

16MPI is a language-independent communication protocol used to program a distributedmemory system. It emphasizes in message passing among the processors. It is widely usedtoday in computer clusters and supercomputers.

51

Figure 2.6: The Flynn-Johnson classification of computer systems

Following Flynn, Johnson proposed his own taxonomy in 1988 [37]. John-son was interested solely in the MIMD class and extended his view to in-clude architectural criteria. That is, he distinguished between shared anddistributed memory systems. He also distinguished between systems, whichuse messages for the communication of the processors and those, which useshared variables. The characteristic member of the “Global Memory - SharedVariables” (GMSV) class is the PRAM, while BSP characterizes the “Dis-tributed Memory - Message Passing” (DMMP). The GMMP systems haveisolated process address spaces (usually virtual, within the same memory).In the DMSV, or “hybrid” systems, the memory is distributed among pro-cessors, but communication and synchronization take place through sharedvariables (e.g., via a Butterfly network). Figure 2.6 illustrates the Flynn-Johnson taxonomy.

We continue with an alternative view of the MIMD class, which high-lights the structure of the interconnection network and not the programmingmodel underlying the processor communication. That is, we examine thephysical connections of the processors. We distinguish between a bus archi-tecture and a switched network. The former is a single medium connectingall processors, while the latter uses fixed connections between certain pairsof processors. In a bus architecture the bandwidth is shared, whereas in aswitched network every pair has its own channel. The second criterion used

52

here is the standard distinction between shared and distributed memory.However, the shared memory architectures are now called multiprocessors,while the distributed memory architectures are called multicomputers to de-note the autonomy usually encountered in each component of the system.Multiprocessors are considered as tightly coupled systems, in the sense thattheir components tend to be in close proximity with short communication de-lay and high bandwidth. Multicomputers on the other hand are consideredas loosely coupled systems, (e.g., a number of workstations on an EthernetLAN). Roughly, bus–multiprocessors correspond to GMSV systems, whilebus–multicomputers correspond to DMMP systems.

Besides the above classifications, one can point out the UMA and NUMAdistinction. In the “Uniform Memory Access” architecture all processorsshare the common memory uniformly, i.e., the penalty of accessing a non-local memory does not depend on the relative position between the processorand the memory module. If this is not the case, we have a “Non-UniformMemory Access”. The first UMA example that comes in mind is a sim-ple PRAM, while the first NUMA is a fixed-connection network of mem-ory/processor components. In general, multiprocessors are considered UMAand multicomputers are considered NUMA, but the precise classification de-pends on the specifics of the implemented machine. Let us finally mentionan architectural distinction between “Paracomputers” and “Ultracomputers”introduced by Schwartz in 1980; the former use an idealized central mem-ory like the PRAM, while the latter use a number of memory modules andprocessors connected via an interconnection network (message passing).

To conclude this subsection, we report an attempt to classify entire mod-els of computation instead of mere processor-based systems. Cook proposedin 1981 a “structural” classification [29], according to which we have thefixed and the modifiable structure models. The first is a class representingparallel models whose interconnection pattern is fixed during the computa-tion, while the second class represents models whose communication linksare allowed to vary. Examples of fixed structure models are: the ATM, thebounded fan-in uniform Boolean Circuit families, the uniform aggregates,and the k-PRAM. Examples of modifiable structure models are: the Hard-ware Modification Machines, the PRAM, the SIMDAG, the Vector Machine,and the unbounded fan-in uniform Boolean Circuit families. Note how theabove distinction would reflect to the sequential computation models: wewould get the fixed versus the modifiable storage structure models, e.g., theTuring Machine versus the RAM. Interestingly, we know that the fixed struc-

53

ture models are slightly less powerful than the modifiable: for the former weusually get DSPACE(S(n)) ⊆ F-TIME(S2(n)) ⊆ DSPACE(S2(n)), while forthe latter DSPACE(S(n)) ⊆ M-TIME(S(n)) ⊆ DSPACE(S2(n)).

54

Chapter 3

Model Simulations andEquivalences

The literature includes a plethora of computational models and model vari-ations, either sequential or parallel. The extent at which these models differfrom each other can be explained, at some level, by the needs of the scientificcommunity. We can describe here two opposite modeling trends. The firstmandates the definition of parameters and structures, which comply with thecutting-edge technology. The second abstracts as many details as possible tocreate “simple” models. Clearly, the first is a trend promoted by the fabrica-tion industry, while the second is promoted by the theorists. In between, onecan find the numerous model variations described in the previous chapter.

Such a plethora of models might lead to a confusion regarding the scopeand the computational abilities of each machine. From a theoretical view-point, we must settle whether these models are able to compute the samefunctions. From a practical viewpoint, we must find ways to “translate” pro-grams automatically from one model to the other. Both of these goals canbe achieved with the development of certain simulation techniques. Looselyspeaking, we say that a machine simulates another machine when it executesa program mimicking the functionality and following the computation stepsof the second machine, in order to generate the same output on any giveninput. If we develop a generic technique to simulate one model with another,we can run any program of the first to the second, at the expense of someextra resources (time, space, etc). The amount of the extra resources is usedto compare the computational power of the models. Trivially, if two modelscan simulate each other, then they have the same computational ability.

55

3.1 Shared Memory Models

The existence of numerous PRAM variations gives rise to a common theo-retical question regarding the computational power of each machine. On onehand, we expect that they are all computationally equivalent, i.e., they cancompute the same functions regardless of the required time/space/processorresources. On the other hand, we expect that certain PRAM variations re-quire less steps than others when computing specific functions.

3.1.1 Overview

The PRAM variations share common instruction sets and operating prin-ciples (see section 2.1.1). Consequently, it is possible to “translate” a pro-gram, which was coded for a specific PRAM model, to a program runningon a different machine (to compute the same output). Most often, onlythe read/write operations to the shared memory need to be restructured.Several techniques have been published in the literature for this purpose,allowing the simulation of one machine by another. These simulations provethe equivalence of the various PRAM models with respect to their compu-tational ability. However, they also show that not all models are equallypowerful in terms of time, space, or processors. The “weak” models usuallyrequire more resources in order to complete their calculations.

We say that two PRAM models are equally powerful if they can simu-late each other with a constant factor slowdown [2]. A PRAM model M1 isstrictly less powerful than another model M2 (denoted as M1 < M2) if thereexists any problem for which the former, M1, requires significantly more com-putational steps than the latter, M2. For instance, the detecting-CRCW isstrictly less powerful than the max-CRCW PRAM (writing the maximumvalue offered): the latter can find the largest number of a size-p vector Vin a single step (assuming that p is equal to the number of processors, con-sider a parallel execution where each processor Pi reads V [i] and writes it tothe shared memory cell p + 1), whereas the detecting-CRCW needs at leastΩ(log p) steps.

The PRAM models can be ordered according to their power simply byexamining their cross-simulations. More specifically, given two models M1

and M2, if M2 can simulate M1 with only a constant factor slowdown, whereasthe converse cannot hold, clearly, we have that M1 < M2. Such modelcomparisons can also involve the number of processors and/or space required

56

Original Model Simulating Model Simulation

#Proc #Proc Time

priority-CRCWEREW

random-CRCW pCREW

≥ p Θ (log p)

common-CRCW

priority-CRCWp common-CRCW kp Θ

(log p

k(log log p−log k)

)random-CRCW

priority-CRCW p random-CRCW kp O(

log log plog(k+1)

)

CREW p EREW ≥ 1 Ω( √

log plog log p

)

priority-CRCWEREW

random-CRCW pCREW

≥ p + m2 Θ(

pm

)

common-CRCW

priority-CRCWp common-CRCW p Ω

(log p

m

)random-CRCW

priority-CRCW p random-CRCW kp Ω(log p

m

)

Table 3.1: PRAM simulations (p: processors, m: memory cells) [6]

57

by each machine. A chain of some well known relations among models usingthe same number of processors is [2]:

EREW < CREW < detecting-CRCW < common-CRCW< random-CRCW

< priority-CRCW

Note that, even though all CRCW models are strictly more powerful thanthe EREW (the weakest model), the EREW can simulate the most powerfulCRCW listed above with at most logarithmic slowdown (assuming that extramemory cells can be used during the simulation).

Towards a more accurate view of the PRAM model relations, Table 3.1summarizes the results of various simulations [6]. The last column of the tablereports the slowdown factor for each case (the time of the simulation overthe time of the original execution). The upper part of the table assumes nomemory constraints, while the last three rows refer to simulations where themodels (both the original and the simulating) use only m shared memory cellsand the input is initially distributed among the processors’ local memories.The comparisons highlight the speed of each model. The slowdown sufferedin each case depends on the number of the processors and not –directly–on the input size. Notice that the differences between the CW models aresub-logarithmic, i.e., they are much smaller than the logarithmic slowdownof the EW model. Also, the memory constraint can lead to significant timelosses (in the EREW case it results in a potentially linear slowdown).

3.1.2 Simulations between PRAMs

To better understand the aforementioned simulation techniques, we willstudy the most prominent of them in the following theorems. We begin bymaking clear that each of the described techniques allows the simulation ofthe entire functionality of a PRAM model, and not some specific algorithm.In other words, given two models M1 and M2, we are interested in a generictechnique to support the execution of any M2 algorithm to the M1 machine.Most often, each technique employs certain parallel algorithms to simulatethe read/write operations of the original model. Naturally, we examine theamount of resources required to simulate a model with a weaker model (theconverse simulation is straightforward).

58

SIMULATION: priority-CRCW on EREW and CREW

The next theorem [20] establishes that any priority-CRCW algorithm withrunning time T (n) in p processors can be simulated by an EREW algorithmwith running time Θ(T (n) · log p) in p processors. Since the only differenceof the two models lies in the accessing of the shared memory, we are in-terested only in simulating a concurrent read –or write– operation with theEREW model. The remaining operations will be executed by the simulatingalgorithm in their original form (instruction copy).

Theorem 3.1.1. A concurrent read (or write) operation of a p-processorpriority-CRCW PRAM can be implemented on a p-processor EREW PRAMto run in O(log p) time.

Proof. Let Q1, . . . , Qp and P1, . . . , Pp denote the processors of the priority-CRCW and the EREW PRAMs respectively. Each Pi will simulate theoperations of each Qi. Moreover, we use the idea that the P1, . . . , Pp pro-cessors can cooperate to find out which Qj has the priority to write (read)on a memory cell by sorting a list of the Q1, . . . , Qp concurrent requestsin Θ(log p) time. For simplicity, we assume here a SIMD algorithm wherethe conflict resolution procedure is started only when the program counterreaches a read/write operation. Note that for MIMD algorithms the slow-down remains Θ(log p) even with a naive application of the conflict resolutionprocedure: we execute the procedure at each step of the algorithm, regardlessof which processors actually access the shared memory, and then continuewith the next non read/write instruction.

In a priority-CRCW PRAM each processor Qi has a unique IDi (log pbits). In a EREW we can also assign IDs and we can reserve memory loca-tions M1 to M2p to be used only during the conflict resolution procedure. Letus assume that during the CRCW computation Qi requests memory locationMf(i). To start the conflict resolution procedure, each EREW processor Pi

will append the IDi number to its Mf(i) number and will store the resultingvalue < Mf(i), IDi > in Mi. Next, all the EREW processors will execute aparallel mergesort algorithm in Θ(log p) time [20] to sort the list formed inmemory locations M1 to Mp. In this way, the requests will group accordingto their addresses (namely Mf(i)) and the rightmost member of each groupwill correspond to the lowest IDi of the processor that requested Mf(i) (i.e.to the one with the highest priority for accessing Mf(i)). Afterwards, eachEREW processor Pi will read the locations Mi−1 and Mi in two steps, in order

59

to determine whether there is an access conflict (same address request). Ifnot, Pi can perform the access described in Mi. Clearly, the accesses chosenby this procedure will correspond to the accesses performed by the CRCWprocessors. Memory locations Mp+1 to M2p can serve as a buffer for com-municating the request data between the requesting processor Pi and theprocessor Pj which actually executes the request. For example, in a writeoperation Pi will temporarily store its data in Mp+i and afterwards, if Pi getsthe highest priority, the data will be forwarded to Mf(i) by Pj (which, besidesMf(i) knows the IDi and thus can locate Mp+i).

The running time of the above simulation is determined mainly by themergesort execution time Θ(log p). The remaining operations can be com-pleted in O(1) time as the arithmetic computations and the consequent ac-cesses to the memory locations M1 to M2p require constant time (by con-struction of the simulation, no conflicts delay these accesses).

Using the same idea, any CREW algorithm can be simulated by a EREWwithin the same logarithmic bound. Moreover, since a EREW can be simu-lated by a CREW PRAM using the exact same operations (no slowdown),we deduce that the CREW can simulate the priority-CRCW with at most alogarithmic slowdown [20].

SIMULATION: priority-CRCW on random-CRCW

The next theorem [20] [6] shows that the computational power of the random-CRCW is quite similar to the power of the priority-CRCW. Their only dif-ference is in the writing of the shared memory, where

Theorem 3.1.2. A concurrent write operation of a p-processor priority-CRCW PRAM can be simulated on a p-processor random-CRCW PRAM inO(log log p) time.

Proof. We prove that resolving the conflict of an arbitrary number of random-CRCW processors, which try to access the cell MC requires O(log log p) steps.

Assume that the p processors are partitioned in√

p groups of√

p pro-cessors each. In the first step, the processors of group Gj that request anaccess to the cell MC will write their IDs into cell Mj (as usual, we reservea number of shared cells for use only in the conflict resolution procedure).A random –or none– processor P j

i in each group Gj will succeed in this op-eration. In the second step, all the involved processors read their group cell

60

Mj to identify P ji . This concludes the first phase of the simulation. For the

remaining phases of this simulation, P ji will be the representative of Gj.

In the second phase, we have a group GR of√

p representative processorsalong with the

√p groups of the first phase (missing their representatives).

In a recursive fashion, the processors of each of these groups are partitionedin

√√p sub-groups and, by following the same steps with phase 1, new

representatives are selected. Again, all groups operate in parallel.The recursion continues until the processors contract into groups of two.

At this point, it is easy to decide which one of the two processors has thelowest ID (thus, the priority) and “shut down” the other. Following therecursion backwards, when a representative is “shut down”, so are all of themembers of its group. On the other hand, if a representative is prioritized,at each backward step we compare its ID to the single other ID of its grouphaving survived the procedure up to that point (by working in parallel withthe representative). Such a comparison requires O(1) time. Fewer and fewerrepresentatives will stay “alive” until only two processors will reach theirstarting group Gj and decide the final winner of the procedure, i.e. thelowest ID which tried to access MC .

The problem described above is widely referred to as the LEFTMOST-PRISONER (find the leftmost processor participating in a specific conflict).The running time of the proposed solution clearly satisfies the recurrencerelation T (p) = T (

√p) + O(1). Therefore, T (p) = O(log log p).

Note: it is possible that we have more than one conflicts at each priority-CRCW step. Therefore, the above procedure should be executed more thanonce (once for each conflicting cell). However, all of these conflict resolutionscan be executed in parallel, not increasing the simulation time computedabove. Since each processor is interested only for its own target cell, he willparticipate only in one conflict resolution as follows. The reserved sharedmemory will be divided into m parts, one for each original memory cell, eachone sufficiently large to accommodate the execution of the above describedprocedure. Each processor will request a shared cell Mx by writing its IDin the reserved part which corresponds to the specific Mx conflict resolutionprocedure; it will continue working within this specific part and at the endof the procedure it will erase its ID (allowing the correct simulation of thenext step of the priority-CRCW).

61

SIMULATION: priority-CRCW on common-CRCW with more processors

As expected, the simulation of a model can consume much less time if we usemore than p processors (i.e. more than the original number of processors).In some cases, the weaker model can simulate the stronger model with onlya constant factor slowdown. The next theorem [20] [6] gives proof for such acase:

Theorem 3.1.3. A concurrent write operation of a p-processor priority-CRCW PRAM can be simulated on a (p log p)-processor common-CRCWPRAM in O(1) time.

Proof. We prove that the simulating model can solve the LEFTMOST- PRIS-ONER problem (i.e. find the leftmost processor participating in the conflict)for the p original processors in O(1) time. With the use of the last note ofthe proof of theorem 3.1.2, all conflicts arising in the simulation can be re-solved in parallel by allowing each processor to join the LEFTMOST solutionprocedure of its interest.

Assume that a number of priority-CRCW processors issue a request forthe same cell MC . The common-CRCW processors resolve the conflict asfollows. First, let us visualize a binary tree T whose leaves correspond tothe p original processors. Each node of this tree T is initially assigned “0”.However, if an original processor pi has requested MC , then we assign “1”to the corresponding leaf li. Moreover, we assign “1” to each ancestor aj,i

of each leaf li if li is assigned “1” and is located to the left subtree of aj,i.Clearly, this coloring procedure leads to the safe conclusion that li is theleftmost leaf marked “1” if and only if every ancestor of li which contains lito its right subtree is marked “0”.

Fortunately, all the above assignments can be done in parallel. Therefore,we associate log p simulating processors with each original processor andprogram them to place the corresponding marks on the tree T concurrently(each node of the tree T corresponds to a reserved shared cell of the common-CRCW). Specifically, each “slave” processor will be associated with a specificnode of a specific level of the tree T and a “master” original processor (i.e.a leaf). In the first phase of the conflict resolution, each slave will read itsmaster’s request and mark the corresponding node on the tree T according tothe aforementioned rule in O(1) time. Note here that the only possible writeattempts on a shared cell regard 1’s and therefore, the common-CRCW policywill permit the write operation independently of the number of processors

62

involved in it. In the second phase, each slave si associated to a node ni

containing the master pi to its right subtree will read ni. If ni is marked“0” then the slave will attempt to write “OK” at a specific cell Mi whichwill afterwards be read by processor pi. On the contrary, if ni is marked “1”the slave will attempt to write “notOK”. Since we use a common-CRCW,the “OK” value will be written only if all slaves agree, i.e. if the underexamination master is the leftmost (prioritized) processor. Clearly, bothphases require O(1) time.

The above simulation is not considered “efficient” since the number ofprocessors has to increase by log p (also, the shared memory has to increasefrom m to Θ(pm)). There exist a more efficient simulation technique [20]which requires only p processors and O(log p/ log log p) time. Note thatthis improved technique –which leads to the result reported in table 3.1 [6]–is characterized by a smaller product processors × time compared to thetechnique described in theorem 3.1.3.

SIMULATION: priority-CRCW on random-CRCW with limited memory

In general, when we limit the shared memory of the simulating model, thesimulation requires significantly more time. The next theorem [6] studies sucha case where the simulation time increases exponentially –when compared tothe unrestricted simulation.

Theorem 3.1.4. The simulation of a concurrent write operation of ap-processor priority-CRCW PRAM with m shared memory cells requiresΩ(log(p/m)) steps on a p-processor random-CRCW PRAM with m sharedmemory cells.

Proof. We give a lower bound for the solution of a fundamental problem onthe examined model. Assume that the p original processors are divided intom equal sized groups and that each group Gi tries to access cell Mi. Thesimulating model must identify the highest priority processor of each group.This problem is a restricted version of the LEFTMOST WRITER problem(a generalization of the LEFTMOST PRISONER problem where more thanone conflicts occur).

Initially, there are Ω((p/m)m) solutions: within each group, any membercan be the leftmost processor of the group. We can envisage the random-CRCW solution as a step by step procedure where a number of candidate

63

answers is eliminated until the final solution is reached. We will show that ateach step, the total number of possible answers is decreased by a factor of atmost 22m. Therefore, any algorithm cannot perform less than Ω(log(p/m))steps to determine the final solution of the problem.

We will use the idea of an “adversary” for the following worst case anal-ysis. Each processor in each Gi must have value either 0 or i (by problemdefinition). Our adversary will fix the values of the processors and the con-tents of the shared cells at each step by choosing the processors who willsucceed in their write operations to them (we examine a random model, thusany processor could be chosen in practice, unpredictably). More specifically,the goal of the adversary is to fix the values of certain processors after eachstep in a way that allows any free processor (not yet fixed) to be the leftmostwriter in its group. To accomplish this, the adversary never fixes a processor–except with value 0– as long as there is a free processor of lower index inthe same group. The algorithm cannot terminate unless every free processoris fixed.

Initially, the rightmost processor of each Gi is fixed to i and all otherprocessors are left free. In the remaining steps, as the processors executetheir instructions, the adversary monitors their requests and allows writeoperations to the shared memory as follows. It will allow, if possible, onlyfixed processors to complete their requests. Otherwise, it will allow –andfix– a processor by choosing only from the right half of the free processorsof a group. Accordingly, the number of possible answers is decreased by atmost a factor of 2. Moreover, the adversary fixes the values of the left halffree processors of a group at “0” to ensure that they will not write at theremaining unfixed cells. This move decreases the number of possible answersby at most a factor of 2m. By combining these two factors, we conclude tothe alleged 22m reduction factor.

3.2 Distributed Memory Models

The same features that make the PRAM an attractive model for the program-mer, unfortunately, also make the PRAM unattractive for the fabricationengineer. Its global memory is difficult to implement in hardware. Instead,the distributed memory models are widely used in the industry (especiallythe fixed-connection network models). Consequently, it is rational to developsimulations techniques that allow the execution of PRAM algorithms on the

64

distributed memory machines. Moreover, similar to the PRAM comparisonsof the previous section, we are interested in comparing the distributed mem-ory models in terms of computational power.

3.2.1 The simulation of a shared memory

Various techniques have been developed for simulating the shared memoryof the PRAM with a distributed memory model [19]. Naturally, one canthink of ways to distribute the shared data among the local memories of theprocessors, which will communicate during the computation to exchange therequired data via messages. The data distribution might involve randomizedand/or deterministic hash functions. Moreover, it might involve data repli-cation (multiple copies stored in distinct processors). In any case, the majorproblem encountered in such simulations is the contention of the network.

In the remaining of this subsection we describe two distinct simulationsof the PRAM. First, we consider a p-node Butterfly network using the samenumber of processors with the PRAM. We show that the Butterfly slowdownis at most logarithmic in p. Second, we examine the case of a BSP machinewith sufficiently less processors than the PRAM. The slackness of the BSP re-sults in an optimal simulation with respect to the total number of operations.In both examples, we will use probabilistic analysis, and thus, the results aregiven in terms of expected values (which hold with high probability).

SIMULATION : PRAM on a Butterfly Network

Consider a p-node Butterfly Network (BN) and a p-processor PRAM, whichincludes a shared memory of m cells [19]. We distribute, randomly and evenly,the m cells to the p processors of the BN by using some predetermined,pseudorandom, hash function h : [1,m] → [1, p]. Each node Ni of the BNwill simulate the operations of a distinct processor Pi of the PRAM.

Assume that Pi will access, during step t of the PRAM execution, theshared cell f(i, t). Hence, during the simulation, the Ni will send a message tothe node h(f(i, t)). The problem reduces now to the routing of such packetswith the smallest possible delay. We will use the straightforward approachof sending the packet to the first column of the BN and, afterwards, to itsdestination (in case of a read operation, a response packet will be sent back).Assuming an EREW PRAM, the above greedy routing algorithm results withhigh probability in an average of O(log p) slowdown of each PRAM step.

65

Moreover, by carefully distributing the queues along the nodes of each BNrow, the routing algorithm requires only constant size queues at the expenseof some extra constant factor slowdown [19].

The above technique works equally well for CRCW PRAM simulations,provided that we are allowed to combine (e.g., concatenate) the data destinedfor the same shared cell. Notice that, in any case, we use no data replication.Finally, we can speed-up the simulation by introducing more nodes to theBN [19]. For instance, if we use a log p dimensional BN, which has log p moreconnections than the ordinary BN (more bandwidth), then the simulationresults in only Θ(1) slowdown.

SIMULATION : PRAM on BSP

The CRCW PRAM can be simulated optimally (up to constant factors)by the BSP model if we assume constant bandwidth g and sufficient parallelslackness, i.e., sufficiently large ratio of PRAM processors per BSP processor.In particular, [12] reports that if v = p1+ε for any ε > 0, a v-CRCW-PRAMcan be simulated on a p-BSP, with L ≥ log p, in time O(v/p), where v andp denote the number of processors of each machine. However, the constantmultiplier which is hidden in the O(v/p) time bound grows as ε → 0. Suchconstants can be avoided, where possible, by using simpler solutions. Thefollowing simulation provides such a solution by using randomization andsufficient slackness.

Let us begin with some notation and assumptions. We shall say that aBSP algorithm is one-optimal in computation if it performs the same numberof operations, to a multiplicative factor of 1 + o(1) as n →∞, as a specifiedcorresponding sequential or PRAM algorithm. Also, a PRAM simulation isone-optimal in communication when T (n) read/write operations of a PRAMcan be simulated in (2gT (n)/p)(1 + o(1)) time units on the BSP 1. Letus assume that, during the simulation, the PRAM processors can be evenlydistributed among the BSP processors (v > p). Moreover, since we simulate ashared memory model with a distributed memory model, we will assume thatthere is a –computable– hash function h that will randomize and distributethe memory requests evenly among the p memory modules of the p-BSP(the contents of a memory address m will be mapped to h(m), which we

1The factor of 2g is necessary since each read operation is implemented as a message toa distant memory block followed by a return message, and time g is charged per message.

66

can assume random). The next theorem [14] gives the required slackness forensuring the optimality of the PRAM simulation on the BSP.

Theorem 3.2.1. Let wp be any function of p such that wp →∞ as p →∞.Then the following amounts of slack are sufficient for simulating any one stepof the EREW-PRAM or CRCW-PRAM on the BSP in one-optimal expectedtime for communication, and optimal time in local operations (if g = O(1)).

(i) wp · p · log2 p EREW-PRAM processors

(ii) wp · p2 · log2 p CRCW-PRAM processors

Proof. Separately:

(i) wp · p · log2 p EREW-PRAM processorsAfter the distribution of wp log p tasks to each processor of the BSP, thesetasks are completed in one superstep. The duration of the superstep willbe chosen large enough to accommodate the routing of an (1 + ε)wp log p-relation, thus allowing the choice of an L as large as (1 + ε)wp log p, for anyappropriate ε > 0 that depends on the choice of wp.

Under the perfect hash function assumption, all references to memory aremapped to memory modules independently of each other, and thus the prob-ability that a memory location maps to some module (say that of processori), is 1/p. We then use the bound for the right tail of the binomial distri-bution. The probability that the number of successes X in n independentBernoulli trials, each with probability of success P , is greater than (1+ε)nP ,for some ε (0 < ε < 1) is at most Prob(X > (1 + ε)nP ) ≤ e−ε2nP/3. Thus,the probability that among wpp log p requests to memory, we have more than

(1 + ε)wp log p requests to a specific memory module is at most p−wp1/3

, forsay, ε =

√3wp

−1/3. Since there are p such modules, the probability thatmore than (1 + ε)wp log p requests are directed to any of these, is at most

p1−wp1/3

= p−Ω(wp1/3). It follows that reads or writes can be implemented as

desired, in two or one supersteps respectively.(ii) wp · p2 · log2 p CRCW-PRAM processorsEach processor of the BSP will simulate wpp log p processors of the CRCW-PRAM. Assume that each simulated processor requests one datum from someother simulated processor. We will show how this communication can be re-alized with the BSP in two phases, each of (1 + o(1))wpp log p steps. Writerequests (with various conventions for concurrent writes) can be simulatedin a similar fashion, in just one phase.

67

Suppose we number the BSP processors 0, 1, . . . , p − 1. The simulationconsists of the following four parts.A) Each BSP processor will sort the up to wpp log p read requests it hasaccording to the BSP processors numbers to which they are destined. Thiscan be done in time linear in wpp log p using bucket sort.B) Each BSP processor hashes the actual addresses in the destination mod-ule of each request into a hash table to identify all such address (i.e. notmodule) collisions. This takes linear expected time. For each destinationaddress one request is considered “unmarked” and the rest are “marked”.C) A round-robin algorithm is used to implement all the “unmarked” re-quests. In the k-th of the p − 1 stages processor j will transmit all itsmessages destined for memory module (j + k) mod p, and receive back thevalues read.D) The “marked” requests –which did not participate in (C)– are now giventhe values acquired by their “unmarked” representative in phase (C). Thistakes time linear in the number of requests per processor.

We can implement each of the p − 1 stages of (C) in two supersteps, where

L = g(1 + ε)wp log p, ε = w−1/3p . This is because the probability that the up

to wpp log p read requests from a fixed processor will have more than (1 +

ε)wp log p addressed to any fixed module is then at most exp(−Ω(w1/3p log p))

by a similar estimation to part (i). Since there are p2 choices of the fixed pro-

cessor and module, a bound of p2exp(−Ω(w1/3p log p)) = exp(−Ω(w

1/3p log p))

follows. Finally we note that instead of doing the round robin in 2(p − 1)supersteps we could do it in just two and allow L as large as L = g(1 +ε)wpp log p.

3.2.2 Simulations between BSP and LogP

This subsection compares the computational power of the two most com-monly used distributed memory models. BSP and LogP are similar in con-struction and can be viewed as closely related variants within the bandwidth-latency framework for modeling parallel computation. Consequently, it turnsout that they can compute the same functions and that their power is al-most equal in theory (BSP is slightly more powerful). Moreover, it is shownthat both models can be implemented with similar performance on mostpoint-to-point networks [13].

68

Before describing the cross-simulations of the two models, we give sometechnical clarifications. First, the presented schemes assume well-behavedLogP programs. That is, (i) every LogP processor sends/receives at mostone message every g

logptime steps, (ii) every message is received at most L

time units after departure, and (iii) capacity constraints are fully compliedwith. In other words, no processor stalls. Second, we set the maximum num-ber of messages that are simultaneously in transit, to be equal to L/2g

logp

(instead of using the original L/glogp

). The factor 1/2 is used to avoid unre-alistic situations in the underlying machine [13]. Third, to better facilitatethe comparison between BSP and LogP, we omit the overhead parameter oof the LogP (in this way, we have only three parameters2 to take into con-sideration for both models). This is a convenient approximation technique(setting g

logp= o) leading to a discrepancy of at most 2 [18] (practically, the

contribution of o to the overall cost of an algorithm is embodied in the valueof the parameter g

logp).

SIMULATION : LogP on BSP

The BSP can efficiently simulate the LogP with only a constant slowdownwhen the router parameters of the two models have similar values [13].

Theorem 3.2.2. LogP can be simulated by BSP with slowdownO(1 + g

bsp/g

logp+ s/L). When s = Θ(L) and g

bsp= Θ(g

logp), the slowdown

becomes constant.

Proof. The i-th BSP processor mimics the activities of the i-th LogP pro-cessor and uses its own local memory to represent the contents of the localmemory of its LogP counterpart. Each superstep of the BSP simulates theeffects of L/2 consecutive LogP steps. Such a superstep consists of two peri-ods. In the first period, each BSP processor interleaves the reception of themessages sent to it during the preceding superstep with the local computa-tion and message generation prescribed by the LogP program for the currentsuperstep. Specifically, the processor receives incoming messages every g

logp

time units, until its input pool is exhausted, and executes local computationbetween consecutive receptions. In the second period, all messages generated

2note that both models use p to denote the number of their processors, they use g tosomehow describe the communication bandwidth and they also use the parameters s andL, respectively, which are related to the latency of the network (loosely and indirectly fors, which can only serve as a time bound for the barrier synchronization in BSP).

69

in the first period are delivered to the intended destinations. Notice that sincethe LogP program is well-behaved, in L/2 consecutive steps, no more thanL/2g

logpmessages are generated by any processor, and no processor can be

the destination for more than L/2glogp

such messages. This implies that inthe first period of a simulation cycle each BSP processor has at most L/2g

logp

incoming messages to read, and that the second period involves the routing ofan h-relation, where h < L/2g

logp. Hence, the overall running time of a cycle

is at most L/2 + gbsp·L/2g

logp+ s. Considering that a superstep corresponds

to a segment of the LogP computation of duration L/2, the slowdown statedin this theorem is established by simple division.

SIMULATION : BSP on LogP

The LogP is slightly less powerful than the BSP: in theory, we get a logarith-mic slowdown when we run a BSP algorithm on the LogP (again, the com-parison is made when the router parameters of the two models have similarvalues). However, there is a notable range of values for which the slowdownis constant, i.e., the LogP is equally powerful to the BSP. Specifically, weknow the following [13]:

Theorem 3.2.3. Any BSP superstep involving at most m local operationsper processor and the routing of an h-relation can be simulated with LogP intime O

(m + (g

logph + L) · S(L, g

logp, p, h)

), where S(L, g

logp, p, h) is at most

O(log p). When glogp

= Θ(gbsp

) and L = Θ(s), then S(L, glogp

, p, h) is the slow-down of the simulation. Moreover, if h = Ω(pε+L log p) then S(L, g

logp, p, h) =

O(1).

Proof. Each LogP processor simulates the local computation and messagetransmissions of the corresponding BSP processor. The execution advancesin stages, with each one of them corresponding to a superstep of the BSP.The two important points that have to be examined here are the simulationof the barrier synchronization and the routing of the h-relations. These arethe only true requirements for the correct execution of the BSP algorithm–given of course that the LogP processors will execute faithfully the localoperations of the BSP processors.

We will make use of the Combine-and-Broadcast (CB) and Prefix algo-rithms of LogP. Assume that we are given an associative operator, op, anda number of input values (x0, x1, . . . , xp−1), with each one initially held by a

70

distinct processor. CB returns op(x0, x1, . . . , xp−1) to all p processors, whilePrefix returns op(x0, x1, . . . , xi) to the i-th processor, 0 ≤ i ≤ p− 1. The useof the Prefix operation is instrumental to the routing of the BSP h-relation,while the CB is used for barrier synchronization purposes.

Barrier SynchronizationA simple algorithm for CB is obtained by viewing the p processors of

LogP as the nodes of a tree of degree dL/(2glogp

)e. At the beginning ofthe algorithm, a leaf processor just sends its local input to its parent. Aninternal node waits until it receives a value from each of its children, thencombines their values and forwards the result to its parent. Eventually, theroot computes the final result and starts a descending broadcast phase3.Note that the algorithm complies with the LogP capacity constraint, sinceno more than dL/(2g

logp)e messages can be in transit to the same processor

at any time.Upon completion of its own activity for the stage, a processor enters a

boolean “1” as input to a CB computation with boolean AND as the associa-tive operator op. A stage terminates when CB returns “1” to all processors.More specifically, the barrier can be implemented by interleaving the routingalgorithm with executions of CB which count the number of messages sentand received by all processors (the barrier establishes that all the messagesof the h-relation have reached their destination). Given the depth of the CB

tree, the running time of this algorithm is Tsync = O(L log p

log 2dL/(2glogp

)e

).

Routing of h-relationsThe difficulty here is to avoid violating the capacity constraint of the

LogP, which is not imposed by the BSP. We will devise a mechanism todecompose the h-relation into sub-relations of degree at most dL/(2g

logp)e.

Before continuing, we note that we can find the maximum r over all proces-sors in time TCB, where ri denotes the number of messages sent by processorpi during the current superstep: the processors will execute the CB describedpreviously with op = max and xi = ri (assuming pi knows ri).

During the simulation on the LogP, the routing of an h-relation consistsof the following two phases:

3Prefix admits a very similar implementation with CB, the only difference being that aninternal processor has to store the values received during the ascending of the messages.These will be used by the processor in the descending period, in combination with thevalue received from its parent, to determine the values to be sent to its children. Prefixand CB have the same running time.

71

(A) Sorting. Each processor creates a number of dummy messages (withnominal destination p) to make the number of messages held by each pro-cessor equal to r. Then, the p · r messages are sorted in order of increasingdestination, so that at the end each processor holds h consecutive messagesin the sorted sequence. Such a procedure requires Tsort = O((g

logpr+L) log p)

time when executed in LogP by using, for instance, the AKS network [13].This time bound can be reduced to O((g

logpr + L) for large values of h (e.g.

when r = pε) by using, for instance, Cubesort [13].(B) Message Delivery. After the sorting phase, messages destined to thesame processor are adjacent in the sorted sequence and, therefore, form asubsequence held by processors indexed by consecutive numbers. At thispoint, we will make use of the aforementioned Prefix computation4 to assignconsecutive ranks to messages destined to the same processor. Then, eachprocessor partitions its messages into two sets. The first set contains allmessages belonging to subsequences starting within the processor (i.e., theprocessor stores the message in the subsequence with rank one). The secondset contains all remaining messages. Note that, in each processor, the secondset contains only messages for a single destination. These messages are partof long subsequences spanning more than one processor. The actual routingof the messages is done in two stages (dummy messages are discarded). Inthe first stage, each processor sends the messages belonging to the first setin an arbitrary order, one message every g

logpsteps. The capacity constraint

is never violated since no two processors send messages for the same des-tination. In the second stage, the processors send the messages belongingto the second set according to their rank. If a processor has one messageof rank i, it sends such message at time g

logp· i. Both stages require time

Tdelivery = O(g

logph +

(L log p

log 2dL/(2glogp

)e

)).

To sum up [13], the overall time T to simulate a BSP superstep in LogP isT = O(m+Tsync +Tdelivery +Tsort). Replacing the previously computed run-ning times leads to the analytic form given in the hypothesis of the theorem.Moreover, we get an explicit expression for S(L, g

logp, p, h), which in all cases

is O(log p), and for sufficiently large h is constant.

4to be precise, we use the “segmented prefix” [19]

72

3.3 Boolean Circuits Simulations

The model simulation survey takes us, inevitably, to the Boolean circuits. Infact, the circuits will be at the basis of our study for the remaining of thisthesis. At first sight, the Boolean Circuit model might seem quite different innature from the rest of the models mentioned so far. The PRAM (as well as,the BSP, the LogP, etc) consists of processors and memory cells, i.e., it usesstructural units of much greater complexity than the simple AND/OR/NOTgates. Moreover, a PRAM machine is a single, finite, object (a program)solving some specific problem. That is, when given a PRAM algorithm, wecan modify it offline (e.g., by paper and pencil) to generate an equivalentprogram for any another model. Every simulation presented so far fulfillsthis purpose. Instead, an “algorithm” designed for the circuit model is nota single object. Rather, it is an infinite family of objects-circuits (recallthat a processor can operate on inputs of variable length, while a circuit Cn

can process only fixed length inputs). In general, it is impossible to modifyoffline an infinite number of circuits in order to develop a program for anothermachine. So, how can we compare a model with such distinct characteristicsto a processor based model like the PRAM?

Fortunately, we can deal with both of the above peculiarities of the circuitmodel. Let us explain intuitively before studying the corresponding simula-tions. First, consider that the processors and the memories that we use inour everyday life are all made of ordinary circuits. Hence, it should be clearthat no processor possesses greater power than some predefined number of–carefully interconnected– boolean gates. Second, and most important, re-call the uniformity constraint that we can impose on each circuit family. Auniform circuit is defined by a single, finite, object: the Turing Machine de-scribing its structure (a constructor). Consequently, any simulation involvingcircuits turns, in practice, to the modification of the circuit constructor, or tothe incorporation of the constructor in a PRAM program, or to the design ofa constructor “following” the instructions of a PRAM program. Of course, acomparison among models (or algorithm complexities) requires further anal-ysis regarding the characteristics of the constructed circuit (depth, size, etc).Notice that the PRAM was used in the above discussion as an example andthat the circuit simulations can involve any other model of computation.Indeed, we begin with a comparison to the most common model, the Deter-ministic Turing Machine.

73

3.3.1 Equivalence to the Turing Machine

We will establish here a relation of the boolean circuit model to the deter-ministic Turing machine (DTM). More specifically, we will show that by im-posing certain constraints on both models we get two equal models in termsof computational power. That is, the set of functions computed by DTMsin polynomial time is equal to the set of functions computed by logspaceuniform circuits of polynomial size. The equivalence follows the two theo-rems [5] [23]:

Theorem 3.3.1. Every total function f : B∗ → B∗ computed by a polynomial-time DTM can be computed by a logspace uniform circuit family C of poly-nomial size.

Proof. Let M f be the DTM which computes f . It suffices to show that thereexists a logarithmic-space DTM M c, which on input 1n will output the circuitCn, which on input x will output the value M f (x), |x| = n.

We construct M c as follows. The description of M f is incorporated inM c, together with the running time of M f (a priori known). Upon input 1n,M c will construct Cn based on the computation table 5 of M f . On this table,the value of each cell Tr,j depends only on the previous cells Tr−1,j−1, Tr,j−1,Tr+1,j−1 (including the state). Since all of these values can be representedwith a constant length identifier, M c can construct constant size circuits tocompute the value of Tr,j (by constant we mean independent of the input x).The construction requires only the description of M f . These small circuitswill be connected in a cascade fashion to cover the entire computation tableof M f (we copy them cell after cell, row after row). The depth of the entire,cascaded, circuit is that of the running time of M f , and it depends only onthe length of the input 1n (can be calculated).

The cascade described above is actually the requested circuit Cn. Theinput of the cascade will be the input |x| of M f and the output will beexactly M f (x), because the circuits will perform level after level the sameoperations as M f . The DTM space cost for constructing Cn is logarithmic

5a table representing the entire computation of the DTM [5]. Each row r correspondsto the configuration of the DTM at step r of the computation. Each cell of the row depictsthe contents of a corresponding cell of the DTM tape. Moreover, each row r includes a celldepicting the state of the DTM at step r; the position of this cell in the row corresponds tothe position of the DTM head on the tape. Notice that we use t padding to fix the lengthof the configurations (so that we can combine them to form the table of the computation).

74

in 1n; the required variables are used for positioning within the computationtable which, by definition, is of polynomial size in n.

Theorem 3.3.2. Every total function f : B∗ → B∗ computed by a logspaceuniform circuit family C of polynomial size can be computed by a polynomial-time DTM.

Proof. Let M c be the logspace DTM that computes the circuit family C.We design the DTM M f to compute the function f in polynomial-time .

The description of M c is incorporated in M f . On input x, M f computesa unary representation 1|x|, which uses to simulate M c(1|x|) and to obtain adescription of the circuit C|x|. It then simulates C|x| on input x to computeits output (it simply computes the value of each gate by looking up thedescription of C|x|).

The unary representation is computed in time polynomial in |x|. Also,since the size of the circuit is polynomial in |x| the evaluation of C|x| ispolynomial in the length of the input x.

Theorems 3.3.1 and 3.3.2 prove an equivalence which, in the way statedabove, affects only those problems within the complexity class P. However, weknow that the result can be extended as follows: DTM time is polynomiallyrelated to uniform circuit size [24]. By taking a closer look at the proofs ofboth theorems, we see that relaxing the polynomial size and the logspaceuniformity constraints on our circuits, we can derive

DTIME(T ) ⊆ UniformSize (T log T ) ⊆ DTIME(T log3 T

)

An important implication here is that every language in P has a polynomialsize circuit. A similar result establishes that TM space is polynomially relatedto uniform circuit depth [24], i.e.

NSPACE(S) ⊆ UniformDepth(S2

) ⊆ DSPACE(S2

)

In the general case, we cannot assert that the two models –circuits andTM– are equally powerful6. For instance, completely removing the uniformityconstraint allows even some polynomial-size circuits to decide non-recursive

6any language L ⊆ 0, 1∗ can be decided by a (non-uniform) family of circuits ofsize O(n2n): it suffices to implement the characteristic boolean function of L (e.g. viaits disjunctive normal form). This implies that the set of circuit families is uncountableinfinite (compare this to the set of TM, which is countable infinite).

75

languages [38]. In this direction, note that although (non-uniform) poly-nomial circuits have such exquisite power, they cannot decide all recursivelanguages. In fact, we know that most languages do not have polynomial cir-cuits (by a counting argument) and moreover, we can show that there existssome language decided by an exponential-space DTM, which has no polyno-mial size circuit [5]. Overall, it is conjectured that NP-complete problems donot have polynomial circuits, uniform or not [5]. More on circuit complexitywill be covered in the next chapter.

3.3.2 Equivalence to the PRAM

This subsection compares the computational power of the logspace uniformcircuits to that of the CREW PRAM. We begin by giving a lemma, whichhighlights the way we simulate a specific bounded fan-in circuit (not family)with a CREW PRAM [20].

Lemma 3.3.3. A bounded fan-in circuit of size Sn and depth Dn, where ndenotes the size of its input, can be simulated by a CREW PRAM with Sn

processors and Sn common memory cells in O(Dn) time.

Proof. Let us assume that any logic gate has at most k inputs (boundedfan-in). We will associate each of the Sn PRAM processors to a distinct logicgate of the circuit (“1-1” correspondence). We will do the same for the Sn

memory cells of the PRAM (again, “1-1” correspondence with the gates).That is, we will enumerate the gates of the circuit –starting from the firstlevel of the DAG– so that processor Pi will simulate the behavior of gate gi

and will write the output of gi to the PRAM cell Mi.Specifically, since the description of the circuit is known to us, we can

program processor Pi to read successively the contents Mj from all cellswhich correspond to the gates gj forwarding data to gi. Pi will then executethe operation described by gi (e.g. AND) and will write the outcome to thememory location Mi. Note that Pi is the only processor that will write toMi (by definition of boolean circuits) and thus, the exclusive-write propertyof our PRAM suffices. Also, Pi can read from any common cell withoutconflicts due to the common-read property of our PRAM.

A processor can be invoked either by using a FORK operation (executedfrom one of his predecessors in the DAG) or by using a private counter (witha value depending the time required by its predecessors to finish). Noticethat the FORK approach increases slightly the execution time, because the

76

first group of processors (which will read the input of the PRAM) will haveto be invoked explicitly (e.g. in log n steps). The last group of processorswill write their results at the output of the PRAM.

The time required for any processor to run is O(k) = O(1). Since thedepth of the circuit is Dn, at most Dn processors will be invoked in a serialfashion, resulting in O(Dn) execution time. The number of processors andthe number of the common memory cells is clearly Sn.

As already explained, we are interested in more generic results, whichinvolve entire families of circuits. We showed the equivalence of the logspaceuniform circuits to a well studied, sequential, model of computation (thepolynomial-time DTM). We will prove here their equivalence to a parallelmodel consuming subpolynomial time, which captures efficient parallel com-putation. As it turns out, in order to prove their equivalence to such fastmachines, we have to impose upper bounds on the depth of the circuits.Specifically, the following two theorems show that the logspace uniform cir-cuits of polylogarithmic depth are equivalent to the CREW PRAM with apolynomial number of processors running in polylogarithmic time [1] [5].

Theorem 3.3.4. Every total function f : B∗ → B∗ computed by a logspaceuniform family C of circuits with polylogarithmic depth and polynomial size,can be computed by a CREW PRAM in polylogarithmic parallel time with apolynomial number of processors.

Proof. The simulation consists of two phases. During the first phase, thePRAM uses n (the size of the input) to construct the description of thecircuit Cn. During the second phase, the PRAM simulates Cn on input x.Phase 1Assume that M c is the logspace DTM that describes the members of thefamily C. The description of M c can be incorporated in our PRAM. Since nis given as input to the PRAM (by definition), it suffices to simulate M c(n)in time O(log n).

Recall that M c uses only O(log n) space. Therefore, we have only O(nk)possible configurations of M c, for some arbitrary k. We can assign eachprocessor of the PRAM to each distinct configuration of the M c. Upon input,these processors will cooperate to construct the sequence of configurations,i.e. the computation path of Mc. They can do so efficiently, by using theparallel prefix algorithm of Ladner and Fischer [1]. The description of Cn

can be extracted from this sequence by locating every configuration (i.e.,

77

every processor) that writes a bit to the output tape of Mc (virtually), andthen, by doing parallel list contraction to produce a single string composedof these bits. All of the above operations can be performed in time O(log n).Moreover, the number of utilized processors is O(nk).Phase 2With the description of Cn available, we use Lemma 3.3.3 to simulate Cn(x)on the PRAM. According to the lemma, since the depth of the circuit ispolylogarithmic, the time required is also polylogarithmic. Also, since thecircuit is logspace uniform, the size of the circuit is at most polynomial andthus, the number of the processors is at most polynomial.

Theorem 3.3.5. Every total function f : B∗ → B∗ computed by a CREWPRAM in polylogarithmic parallel time with a polynomial number of pro-cessors, can be computed by a logspace uniform family C of circuits withpolylogarithmic depth and polynomial size.

Proof. Let Mpram be the PRAM which computes f(x) on some arbitraryinput x. Also, let p(n) be the number of processors utilized by Mpram and t(n)the time of the Mpram computation, n = |x|. We will use a similar techniqueto that used in the proof of theorem 3.3.1 (where a circuit simulates a DTM).That is, we will combine circuits designed to simulate distinct operations ofa PRAM, in order to simulate one potential step of Mpram. We will thenconnect these small circuits in a cascade fashion, layer after layer, so that theoutput of one layer will become the input of the next layer. With each layerof this cascade corresponding to a single step of Mpram, the entire circuit willhave t(n) layers to simulate step after step the Mpram computation of f(x).

Similar to the configuration of the DTM, the configuration of the PRAMis a tuple which contains every information required to describe a single stepof the PRAM computation: the program counters of all processors, the con-tents of their local memories (registers), the shared memory cells [5]. Asin theorem 3.3.1, we must fix the length of such a configuration at someworst case value depending on n. Again, this is done because the simulat-ing circuit will have an input of fixed size. Notice that the binary numbers(variables) used during the Mpram computation cannot exceed the lengthl(n) = n + t(n) + const, const > 0: at each step, Mpram can only increasethe length of its variables by using add/subtract instructions (such an in-struction –and any other possibly included in the machine– can only increasethe operand length by 1). Also, we know that at most t(n) local cells willbe used per processor, and at most t(n) · p(n) shared cells will be used by

78

all p(n) processors 7. The Mpram program is finite (a total of |Π| instruc-tions) and thus, we only need a constant number of bits for each programcounter. The above statements bound the size of every element of the PRAMconfiguration tuple. Consequently, the configuration can be encoded usingCb(n) = O (2 · t(n)p(n) · (n + t(n)) + p(n) log |Π|) bits, which in any caseis polynomial in n. Since p(n) and t(n) can be efficiently computed by anyDTM 8, the configuration tuple can be fixed at size Cb(n) for further pro-cessing by the DTM (it will be padded wherever necessary). Note here that,any element within the configuration can be found simply by using countersof logarithmic size.

At this point, we must construct circuits to compute one Mpram configu-ration from another, i.e. we must construct one layer of our aforementionedcascaded circuit. This construction is somewhat more complex than the cir-cuits developed in the proof of theorem 3.3.1: first, because the operationsof a RAM are more complex than those of a DTM, and second, becausethe value of one element of the configuration does not depend only on threeelements of the previous configuration. However, we can estimate the sizeand depth of the layer as follows. Recall that we have already bounded thePRAM variables to some polynomial in n (both their size and their number).The under construction layer will be composed of ‘small’ circuits simulatingdistinct operations of a PRAM (ADD, SUB, etc). Such circuits can be de-signed with polynomial size and logarithmic depth [23] [1]. Specifically, foreach one of our polynomially many Mpram processors, we will use a constantnumber of ‘small’ circuits able to simulate any potential step of the proces-sor. Such a ‘small’ circuit will depend, at most, on a polynomial number oflocal memory cells –configuration elements– and thus, we will not exceed ourpolynomial size limit. This holds even for those circuits simulating the oper-ations on the shared memory cells, which themselves depend on a polynomialnumber of processors. All these ‘small’ circuits (polynomially many) will becopied side-by-side to form a circuit layer. The exact output of this layer willbe controlled (e.g. with multiplexors) by the program counter –configurationelement– and its corresponding instruction of the PRAM (known correspon-

7for sake of simplicity we will assume that the available address space of Mpram ist(n) · p(n). However, this is not a prerequisite of this proof (there is a way around thisproblem by using feasible highly parallel algorithms during the simulation to dynamicallyallocate memory cells from a pool –each cell tagged with its own address) [1]

8to be precise, this is a condition of the theorem, i.e. p(n) and t(n) must be computedin logarithmic space upon input 1n [5]

79

dence, independent of the input size). Overall, the layer will have polynomialsize and logarithmic depth.

Finally, we return to our original goal: given Mpram, we must constructa logspace DTM M c which on input 1n will output the circuit Cn, which oninput x will output the value Mpram(x), |x| = n. Since Mpram is given, wecan incorporate its description (the program Π) in M c. Moreover, we haveshown above how M c can bound the configurations of Mpram and construct acircuit layer. On input 1n, M c will start with the initial layer and by copyingit in a cascade fashion, it will build the –description of– the entire circuitCn. The total number of layers is equal to the t(n) steps of Mpram (fixed atthe worst case value). As a result, the size of Cn will be polynomial in sizeand polylogarithmic in depth (considering all of the above analysis). Notehere that, M c can complete its computation in logarithmic space, becausethe required variables are used only for counting and indexing through thepolynomial sized configurations and the constant sized description of Mpram.

Theorems 3.3.4 and 3.3.5 imply that the two models –logspace uniformcircuits and PRAM– are equally powerful under the constraint of polynomialwork in polylogarithmic time. Moreover, their simulations reveal a closerelation of the circuit depth to the PRAM time. Especially in the firstsimulation we see that depth is related to time by some constant factor (theconverse is not exactly true, as the structural unit of a circuit is less powerfulthan that of a PRAM).

The current section has shown equivalences between certain model vari-ations. Such equivalences allow us to study parallel computation with aspecific model and, afterwards, to draw generic conclusions. Furthermore,the current section introduced the use of resource bounds and constraintson our machines/circuits. Such restrictions facilitate the detailed study of amodel by creating model variations with respect to its computational ability.This is a common technique used, among others, in complexity theory, whichis presented in the following chapter.

80

Chapter 4

Parallel Complexity Theory

Parallel complexity theory is the study of resource-bounded parallel com-putation. It is the branch of computational complexity theory, which focuseson the classification of problems according to their parallelization difficulty.

The issues raised here concern primarily the speedup gained from our par-allel machines. Common questions driving the research towards this directionare the following: –Are all problems amenable to fast solutions? –What isthe minimum amount of resources required for solving an important problemin parallel? –Is it possible to achieve a dramatic speedup of a sequentialsolution while maintaining reasonable hardware cost? –Is there a way oftransforming algorithmically a sequential solution to a highly parallel one?

Such questions offer an insight to the nature of parallel computation. Theanswers lead to certain problem classification with respect to the difficultyand/or the impossibility of designing efficient parallel solutions. Moreover,they assist the researchers in redefining ‘good’ models for parallel computa-tion and identifying the resources of interest. It is worth mentioning that,analogously to the classical “P versus NP” conundrum, some of these ques-tions remain open till today. The theory itself establishes relations betweenparallel and sequential classes, contributing in this way to the evolution ofthe entire computational complexity theory.

This chapter is a survey of the most important results presented overthe past four decades regarding parallel complexity. It presents commoncomplexity classes, inclusion proofs, classification of problems and equiva-lences of models (or classes). Further, it discusses some of the inherent limitsof parallelization together with the conjectures of the research community.Hereafter, our study bases mainly on the boolean circuits and the Turing

81

machines; due to certain model equivalences, the circuits and the TM cancover most of the study in the field of parallel complexity (we also report thePRAM, or any other model, wherever it is related to our results).

4.1 Complexity Classes

One of the first issues raised in classical complexity theory is the distinctionbetween feasible and infeasible solutions. It is important to identify thoseproblems which have a practical solution (one that can be obtained withinthe available resources, even when the size of the input increases signifi-cantly) and those which haven’t. The community adopted the polynomialtime bound to mark the territory of the feasible computation. That is, asolution (algorithm) is considered feasible (efficient) if it requires at mostpolynomial time in the length of the input. A problem is called tractable ifit has such a solution. The well-known class P was defined accordingly, tocapture the notion of feasible sequential computation.

Similar to the classical approach, the parallel complexity theory alsomakes its first distinction between feasible and infeasible solutions. In thecontext of parallel computation though, besides time cost, the feasibility ofa solution depends also on its hardware cost (the required number of pro-cessors, gates, etc). It is natural to readopt here the polynomial bound forcharacterizing feasible parallel computation. That is, a parallel solution isconsidered feasible if its implementation requires at most polynomially manyhardware resources, which operate for at most a polynomial amount of time.To defend the above choice intuitively, one could point out the following real-world analogy. Exponential (superpolynomial) functions grow fast enough tonumbers exceeding both the lifetime of our universe and the particles thatconstitute it. In other words, it is impractical not only to wait for the termi-nation, but even to start manufacturing an exponential-cost parallel machinein our universe. Note that the class of feasible parallel problems is equivalentto the class P. Imposing restrictions on the parallel time and on the hard-ware resources has led to the definition of certain parallel complexity classes,which are studied below.

When it comes to parallel time, the aforementioned polynomial boundseems rather modest for a theoretical study of feasible problems (their se-quential time is already polynomial). In fact, the most interesting applica-tions of parallelization are those resulting in dramatic time savings: as shown

82

for certain problems, the sequential and the parallel solution are separatedby an exponential time gap (e.g. polynomial to logarithmic). In general, weconsider as highly parallel those algorithms, which require at most polylog-arithmic parallel time in the size of their input. Significant effort has beenspent in identifying those members of P, which can be solved highly parallelby consuming only a polynomial amount of hardware resources. We referto such problems as feasible highly parallel [1] and we have defined severalclasses to capture and study them. The class definitions base on certainparallel time or circuit depth bounds (e.g. specific logarithms), on hardwareresource bounds (e.g. polynomial amount, uniformity, gate fan-in), on alter-nating TM space or time, on the number of alternations, etc. Probably themost notable of the parallel classes is the NC2, which is characterized by itsO(log2 n) circuit depth bound. As it turns out, NC2 contains a lot of thefeasible highly parallel problems/functions, especially those used by the en-gineers in a daily basis. Consequently, NC2 is considered as a perfectly goodcandidate for capturing the –more conservative– notion of efficient parallelcomputation [5].

In the following subsections we study several parallel complexity classes;we give definitions, inclusion proofs, implications and we show some of theknown relations between parallel and sequential computation. Moreover,throughout this section, certain parallel algorithms are described, commonsimulation techniques are employed and a representative set of parallel prob-lems is classified. Subsection 4.1.1 is concerned with feasibility, while sub-section 4.1.2 examines classes of greater complexity.

4.1.1 Feasible Highly Parallel Computation

We begin by defining the complexity classes, which, by convention, containthe feasible highly parallel problems. In general, the complexity of a problemis reflected on the amount of the resources consumed by its solution (time,space, etc). Therefore, to define a class of problems, we define an availableamount of resources over some predefined model of computation: a problembelongs to the class if it can be solved by the model within the availableresources (we always examine the worst case consumption).

The following class definitions base on a specific model of parallel com-putation, i.e., the boolean circuits. Note that, provably, the same classescan be defined by using other models, e.g., the PRAM, the ATM, etc (suchclass equivalences are presented later in this subsection). The definitions

83

place bounds on two kinds of resources, simultaneously. First, to meet thefeasibility requirement, we bound polynomially the hardware resources. Wealso impose a logspace uniformity restriction on the circuit families (considerany of the two definitions of section 2.3.1). Second, to meet the requirementfor fast solutions, we bound the depth of the circuit (which captures paralleltime). We do this by imposing certain logarithmic bounds. Other parame-ters of the model, such as the type of the gates, are handled separately ineach of the definitions. Specifically [1]:

Definition 4.1.1.

NCk For k ≥ 1, let NCk be the class of languages decided bylogspace uniform families of boolean circuits, Cn, of depthO(logk n) and size nO(1). The circuits must consist only ofbounded fan-in AND, OR, NOT, gates.

NC Let NC be the union of all NCk, i.e., NC=⋃

k≥1NCk.

ACk For k ≥ 0, let ACk be that generalization of NCk, where thecircuits consist of unbounded fan-in AND, OR, NOT, gates(logspace uniform, O(logk n) depth, nO(1) size circuits).

AC Let AC be the union of all ACk, i.e., AC=⋃

k≥0ACk.

TCk Let TCk be defined as the ACk, except that, the circuits mustconsist of NOT and MAJ gates (a boolean ‘majority’ gatewith fan-in = r outputs 1 iff at least r/2 of its inputs are 1).

TC Let TC be the union of all TCk, i.e., TC=⋃

k≥0TCk.

Before continuing, we must mention that the above definition is notunique throughout the literature. Even when it bases on the same model(boolean circuits), it might differ with respect to the underlying uniformity.For example, we can avoid imposing a direct polynomial bound on the num-ber of the circuit gates, by defining logspace uniformity to measure spacewith respect to the input 1n of the DTM (hence, we bound the output ofthe DTM by a polynomial) [39], [40]. Other authors [41], [25], use a slightlydifferent kind of uniformity; they say that the circuit family, Cn, must beDLOGTIME-uniform. That is, they can use a logarithmic-time DTM with

84

random access to its input tape (via an “index” tape, on which the DTMwrites a number j and then enters a state to receive the j-th bit of the in-put) for answering questions of the form “is there a path from node u tonode v in Cn?”, or “what type of gate is node u in Cn?”. This definitionpresupposes that the size complexity of the family is polynomial. Other def-initions use the notion of P-uniformity, i.e., the circuits must be constructedby polynomial-time DTMs. Notice that in any case, the circuit family isrequired to be uniform, as this is a prerequisite of feasibility. Essentially, allthese definitions, although not necessarily equivalent, give the same meaningto the classes NC, AC and TC.

To explain the nomenclature [41], Nicholas Pippenger was one of the firstto study polynomial-size, polylogarithmic-depth circuits in the late 1970s,and NC was dubbed “Nick′s Class”1. The “A” in AC connotes alternation,because of the close relation between the depth of the AC circuit and thenumber of alternations of the ATM (see page 96). Finally, the “T” in TCdenotes the use of threshold gates. Threshold gates model the technologyof neural networks. The general threshold gate inputs boolean arguments,it associates numerical weights w1, . . . , wr with each of the r inputs, it usesinternally a threshold t, and it outputs a boolean value. Specifically, thegate will output ‘1’ iff the inputs, multiplied by their corresponding weights,add up to t. The definition 4.1.1 uses one kind of threshold gates, the MAJ,which is the special case with w1 = . . . wr = 1 and t = r/2. Note that adepth-2 circuit of MAJ gates can simulate the general threshold gate [41].

The little details in the definitions of NC, AC, and TC, make the classesseem quite different from each other. Questions that naturally come in mindhere include: –Does the unbounded interconnection offer significantly greaterpower? –Is the MAJ gate making any difference at all? –How are theseclasses related to each other? –Which are the real-world problems containedin these classes? We proceed with answering such questions by studying,first, the inclusions between parallel complexity classes.

1Pippenger repaid this favor by giving the name “Steve’s Class” to the class of lan-guages decided by DTMs in polynomial-time and polylogarithmic-space, simultaneously[5]. Specifically, SCk=DTM[time(nO(1)),space(O(logk n))] and SC=

⋃k≥1SCk.

SC captures the efficient algorithms consuming little space, and is equal to the class oflanguages decided by logspace uniform circuits of polylogarithmic width. Intuitively, NCinvolves “shallow” circuits, while SC involves “narrow” circuits.

85

Relations between parallel complexity classes

Trivially, we have NCk ⊆ NCk+1, ACk ⊆ ACk+1, and TCk ⊆ TCk+1.Hence, these classes form potential hierarchies of infinitely many distinctlevels. Notably, it is not yet proved that the above inclusions are proper.For all that we know, it may be the case that, e.g., NC1=NC. However,this is not expected to be true and it is conjectured that the NC hierarchyis proper [5]. What we know for certain about the NC hierarchy, is thatit collapses at the k-th level if NCk = NCk+1 (the same holds for the ACand TC hierarchies) [5]. Moreover, it is proved that the monotone-NC, i.e.,the analog of NC consisting only of AND/OR gate circuits, indeed forms aproper hierarchy [42].

Then, we show how the three hierarchies interleave with each other. Westart by examining the relations between NCk and ACk [5], [41].

Theorem 4.1.1. For k ≥ 1, NCk ⊆ ACk ⊆ NCk+1.

Proof. The first inclusion is trivial: every circuit with bounded fan-in gates,can be viewed as a specific instantiation of a circuit with unbounded fan-ingates.

The second inclusion is based on a minor modification of the ACk circuit,namely the uCn, where n = |x| is the length of the input. Recall that ACk

includes only polynomial size circuits. Therefore, any gate of uCn must havea number of inputs which is at most polynomial in n. Notice that we cantransform any of these gates (AND/OR) to a functionally equivalent tree of2-input gates (AND/OR). Clearly, the size of this tree is polynomial, whileits depth is logarithmic. By transforming every gate of uCn in the aboveway, we derive a circuit with bounded fan-in gates, namely bCn, which hassize polynomial in n. Since each gate of uCn increases the depth of bCn by atmost O(log n), the entire depth of bCn will be O(logk n · log n). Finally, notethat bCn can be described by a logspace DTM; the above transformationcan be executed within a modified version of the DTM describing uCn (itrequires only extra counters of size logarithmic to the size of uCn, which ispolynomial).

According to theorem 4.1.1, whichever language can be decided (or, func-tion computed) by unbounded fan-in circuits, it can also be decided bybounded fan-in circuits, feasibly. What’s more interesting, though, is thatthe penalty of bounding our communication is only a logarithmic slowdown.

86

This result contradicts one’s –probable– overestimation that the unboundedfan-in provides tremendous computational power. The key point –probably–missed here, is that the gate fan-in is not totally “unbounded”. Rather, itis bounded by the polynomial size of the circuit. To better understand thedifferences between gate types, the following theorem compares the NCk andACk classes to the TCk [41]:

Theorem 4.1.2. For k ≥ 0, ACk ⊆ TCk ⊆ NCk+1

Proof. To show the first inclusion, we describe the simulation of an arbitraryAND/OR gate using MAJ and NOT gates. We can simulate one k-input ORgate with exactly one MAJ gate. Specifically, we use a MAJ gate with 2k−1inputs: the k inputs of the simulated OR gate plus k − 1 inputs of constantvalue ’1’ (dummy signal). Clearly, when at least one of the k variable inputsis ’1’, the output of the MAJ gate is ’1’. Similarly, we can simulate onek-input AND gate with one MAJ gate and k + 1 NOT gates. Again, ourMAJ gate has 2k − 1 inputs: the inverted k inputs of the simulated ANDgate (inverted with NOT) plus k − 1 inputs of constant value ’1’. Also, weuse one extra NOT gate at the output of the MAJ gate. Clearly, the abovecircuit outputs ’1’ if and only if all of the k variable inputs are ’1’. Noticethat a dummy signal can be easily acquired by driving any wire of the circuitto a 2-input MAJ gate, the first of which is inverted (with NOT). Both ofthe above simulating circuits increase the depth of the original NC circuit byat most a constant factor. Moreover, they increase the size of the originalcircuit by at most a polynomial factor; for each k-input gate, at most k + 4gates are required, where k is polynomially bounded in the size of the circuit’sinput (the size of the original circuit is itself polynomially bounded). Finallynote that, such a circuit can be easily described by a DTM in logspace (theDTM incorporates the description of the original circuit family and uses onlylogarithmic counters), therefore ACk ⊆ TCk.

To show the second inclusion, we must simulate an arbitrary MAJ gatewith AND/OR/NOT gates of 2-inputs. The idea is simple. We add the k1-bit inputs of the MAJ gate and compare the result to k/2. Both operationscan be done with circuits of depth O(log k) and size O(k). As mentionedabove, k is polynomially bounded in the size of the input. Therefore, sub-stituting every MAJ gate as described above will increase the depth of theoriginal circuit to O(log n)·O(log n) = O(log2 n), where n denotes the lengthof the circuit’s input. The size of the new circuit will remain polynomial inn. The new circuit can be easily described by a DTM in logspace (the DTM

87

Figure 4.1: Parallel classes and hierarchies

incorporates the description of the original circuit family and uses only log-arithmic counters) and thus, TCk ⊆ NCk+1.

Theorem 4.1.2 proves that the use of threshold gates results in, at most,a logarithmic speedup. Once again, the ordinary bounded fan-in AND/ORgates are shown to have adequate power for studying parallel computation.Theorem 4.1.2 also shows, for all k, how the classes NCk, ACk, and TCk,alternate in terms of computational power. Consequently, we now see thebig picture (figure 4.1, given the widely accepted conjectures) where theparent classes NC, AC, and TC, are equal.

Corollary 4.1.3. NC = AC = TC

Proof. Directly from theorems 4.1.1 and 4.1.2.

Corollary 4.1.4. NC ⊆ P

Proof. Directly from theorem 3.3.2.

The above corollaries place feasible highly parallel computation withinP. Indeed, we anticipated this result since the introduction of this section.

88

However, corollary 4.1.4 leaves an open question: does “NC equal P?”, i.e.,do all problems with feasible solutions also have feasible highly parallel solu-tions? This is, arguably, the most important open question in the realm ofparallel computation. In fact, it is concerned with the inherent limitationsof parallelization. We examine this topic in section 4.2.2.

In complexity theory, proving that a class inclusion is proper, is a rareand striking result. As usual, the aforementioned inclusions are not knownto be proper for arbitrary values of k (mere conjectures). However, we dohave proofs of proper inclusions for the first level, k = 0, of the classes givenin theorem 4.1.2. Specifically, we know that

AC0 ⊂ TC0 ⊆ NC1.

We do not know whether TC0 is strictly contained in NC1 (we conjecturethat it is). Below we describe a proof for the AC0 ⊂ NC1 inclusion. Notehere that when studying the first level of the NC, AC, or TC hierarchy, ourchoice on the uniformity type of the circuit plays an important role. Therigorous approach is to use some “conservative” uniformity type, e.g. theDLOGTIME, and not the logspace. In the following proof we use thelogspace, as this result holds for any of the commonly used uniformities [21]:

Theorem 4.1.5. AC0 ⊂ NC1.

Proof. To show that AC0 ⊆ NC1 we can use the same argument with thatused in th proof of theorem 4.1.1. To show proper inclusion, it suffices tofind a problem which belongs in NC1 but does belong in AC0. We point outthe PARITY problem: “does the parity of x equal 1 ? ”. Recall the parityfunction fp(x) = (x1 + x2 + . . . xn) mod (2), where n = |x|.

On one hand, the authors in [43] prove that the parity function fp(x)cannot be computed by AC circuits of constant-depth and polynomial-size 2.Therefore PARITY /∈ AC0.

On the other hand, it is trivial for any DTM to describe a XOR gateusing AND, OR, NOT gates: recall that A⊕B = (A∧¬B)∨ (¬A∧B). Oninput 1n, the DTM can output the description of a XOR tree of n leaves.The resulting circuit will have depth O(log n) and size O(n). Such a DTMis clearly logspace, because it only uses indexing variables for positioningwithin an object of size O(n). The XOR tree computes the parity of itsinput. Therefore, PARITY ∈ NC1.

2specifically, they show that polynomial-size parity circuits must have depth Ω(log∗n).

89

Relations between parallel and sequential complexity classes

Much effort has been devoted in parallel complexity theory towards relatingparallel to sequential computation. Arguably, the best way to establish sucha connection is by studying the inclusions between parallel and sequentialclasses.

Naturally, we begin by comparing to the ordinary Turing Machine. Theweakest of the widely used TM classes are the logarithmic space classes L andNL. Proving that these two classes are not contained in NC would severelydamage the image of the parallelization power. Fortunately, not only both Land NL fall within NC, but they are contained in the lower levels of the NChierarchy. Let us first consider the deterministic class [5]:

Theorem 4.1.6. NC1 ⊆ L

Proof. Assume that M c is a logspace DTM describing a circuit family C ofNC1. We will show that there is a logspace DTM M , which can computethe output of Cn(x) for any input x with |x| = n.

For starters, let us assume that a binary tree-like circuit Cn is given asinput to a DTM Ma, together with an input x. Ma can compute Cn(x)using space O(log n). The idea is simple. First, notice that we can mark anypath within a binary tree by using a unique identifier of log n bits: all theway from the root to any leaf, ‘0’ represents the left edge of a node and ‘1’the right edge. With this in mind, we can start from the root and performan ordered walk of the entire tree by using O(log n) space as follows. Wekeep track of our path by using a specific variable, which we update stepafter step (starting with ‘0’). At any step, this variable corresponds to oneof the unique identifiers described above. We update this tracking variable(append one bit, change last bit, remove last bit) by checking our locationon the tree: we find the current node on the graph at the input tape of Ma

(only logarithmic counters are required for this) and, if possible, we moveleft (depth first search). Second, notice that the nodes here correspond tologic AND/OR operations (and NOT). This has the following impact on theaforementioned ordered walk of the circuit. When we encounter an ANDnode, we will continue our walk back to the upper levels of the tree (i.e. wewill change/remove the last bit of our tracking variable) only if both childrenof the node were previously evaluated ‘true’. Practically, when we encountera ‘false’ node with an AND parent, we immediately evaluate the parent ‘false’and update our tracking variable. Similarly, when the parent is an OR gate

90

and the current node is ‘true’, we mark the parent as ‘true’. Note that weproceed to the evaluation of the right child of a node only in certain cases:when the node is AND and we have just evaluated its left child as ‘true’, orwhen it is OR and its left child is ‘false’. Consequently, in any case, we candeduce the truth of a parent node simply by knowing the truth of the currentnode and whether it is a left/right child (we do not need to store the valuesof any previous nodes). By following these rules, the tree walk will finallylead to the evaluation of the root. We can implement the above recursivealgorithm by reusing the space of the tracking variable and, therefore byusing only O(log n) space.

Unfortunately, not all circuits are binary tree-like circuits. Fortunately,we can transform any NC1 circuit to an equivalent binary tree (which com-putes the same function) by using only logarithmic amounts of space. First,if k > 2, we can replace each k-input gate with a small equivalent binarytree (of depth log k and size 2k− 1). Second, starting from the output of our“binary” circuit, we name the paths to each input gate as described in theabove paragraph. However, in this technique we use the names of these paths(‘0’, ‘01’, etc) to mark the intermediate gates of the original circuit. Each ofthese names, together with the type of its corresponding gate, represents anew gate of the transformed circuit. We proceed by reusing space to outputin this way every gate of the transformed circuit. Note that the new gateshave out-degree ‘1’ (because each original gate reachable by several pathswill be represented many times), and thus, the resulting circuit is a binarytree. The two operations result in a circuit which, like our original circuit, isof logarithmic depth and polynomial size. Moreover, they can be performedin O(log n) space from a DTM Mb, because the only variables required arefor positioning/counting within an object of polynomial size (also, the gatenames are O(log n) wide because the original circuit has O(log n) depth).

We now return to the description of our logspace DTM M . M will usethe two algorithms described above and the descritpion of M c (incorporated).On input x, |x| = n, our M will simulate M c(1n) to acquire Cn and then itwill evaluate Cn(x) as described above. However, there is one last obstaclebefore completing the proof: the outputs of the machines M c and Mb arepolynomial in size. If we store their outputs as intermediate results in atape of M , the space of M will not remain logarithmic. We overcome thisobstacle as follows [5]. When seen as functional blocks, the input of Ma isthe output of Mb and the input of Mb is the output of M c. We must solve thegeneral problem, where two machines A and B want to exchange information

91

without writing down all of it at once (in a single string). Notice that, eitherway, machine B reads one bit at a time from the “tape” in-between A andB. Moreover, this bit is located in a specific cell beneath the head of B. Theidea is that, before executing each step of B we know the location of the headon the input tape of B and, therefore, we can simulate A until it writes thatspecific cell. Then, we can continue with the step of B. By interleaving thesimulation of the two machines in the above way, the composite computationis perfomed with only one “communication” variable (the location of theB’s input head). In our case, the length of this variable is logarithmic,because the intermediate “tape” is polynomial (for both M c to Mb and Mb

to Ma “communication”). To conclude, all of the above algorithms requirelogarithmic space and the resulting DTM correctly evaluates the output ofCn(x).

Although not yet proved, many researchers believe that NC1 is prop-erly included in L [41]. For this reason, using the logspace uniformity todefine NC1 is controversial (probably, it allows more computing power tothe “preprocessing stage” than to the NC1 circuits themselves). To be pre-cise, it is still unknown whether the logspace-uniform NC1 is equal to theDLOGTIME-uniform NC1, or, if the logspace-uniform AC0 is equal to theDLOGTIME-uniform AC0 (other types of uniformity are also involved insuch disputes, e.g., the logspace and the P uniformity: is L equal to P?).However, we know that the class NCk for k ≥ 2 is identical under the twodefinitions (respectively, ACk for k ≥ 1) [41]. In fact, the most commonlyused types of uniformities all lead to equivalent definitions of the class NCk

for k ≥ 2 [24].We continue by examining the converse of theorem 4.1.6. Specifically, we

make use of the nondeterministic version of L to prove that [5], [23]:

Theorem 4.1.7. NL ⊆ NC2

Proof. It suffices to show that a NL-complete problem belongs to NC2. Wepoint out the REACHABILITY problem: “given a graph G and two nodess, t, is there a path from s to t?”

We start by showing that REACHABILITY is NL-complete. First, weshow that REACHABILITY∈NL. Nondeterministically, we guess the re-quested path (if any). That is, starting from s, step after step we tem-porarily store the current edge (i, j) and we find the next by looking at thedescription of G. We terminate when we discover t or when we perform more

92

than n steps (discover a cycle), where n denotes the size of the input. Theabove search requires logarithmic space (only counters). Second, we give alogspace reduction from any language A∈NL to REACHABILITY. Assumethat the logspace NDTM MA solves A. On input x, a logspace DTM (whichincorporates the description of MA) outputs the entire configuration graphof MA(x). In this graph, each node corresponds to a configuration of theMA(x) computation and each edge corresponds to a probable computationstep of MA(x). The DTM proceeds as follows: it writes down every possiblepair of strings of length c log |x| (i.e. the space bound of MA(x)) and it checkeach one of them separately. Specifically, it checks whether the first string isa valid configuration of MA(x) and if the pair constitutes a valid computationstep of MA(x) (e.g. by simulating all possible transitions –logarithmicallybounded– from the first configuration string). Notice that the DTM can writethe under examination pair of strings one after the other (e.g. lexicograph-ically) by reusing its space. In this way the DTM can output a descritpionof the entire configuration graph of MA(x) by using only logarithmic space(2c log |x|, plus counters). Moreover, during this transformation, we can en-sure that there is only one accepting node (e.g. by connecting all acceptingnodes to a new one). Clearly, MA accepts x if and only if there is a pathin the above constructed graph from the initial node to the single acceptingnode (solved by REACHABILITY).

We continue by showing that REACHABILITY∈NC2. Assume that weare given the n× n adjacency matrix V of a graph G with n nodes. In thisgraph, the max-min path has length at most n − 1. Therefore, to computethe transitive closure G∗, it suffices to take the adjacency matrix to the powerof n by using boolean matrix multiplication [23]. After the calculation of V n,we can answer the REACHABILITY question simply by looking at the [s, t]position in the resulting table. Notice that [44], instead of calculating all thepowers of V up to n, we can square the (V + I) matrix: we add the identitymatrix I to V and successively square the result, i.e. (V + I)2, (V + I)4, . . .,untill we obtain (V + I)m for some m ≥ n. The resulting matrix (V + I)m

suffices for answering the REACHABILITY question. Therefore, the com-putation of the transitive closure of V requires only dlog ne steps. A booleanmatrix multiplication can be calculated by circuits of logarithmic depth andpolynomial size [23]. Specifically, since the boolean product F = R×T of then × n matrices R, T is defined by F [i, j] =

∨n−1l=0 (R[i, l] ∧ T [l, j]), we will

compute it by using one distinct circuit per cell [i, j]: a tree whith n leavesas AND gates and n− 1 internal nodes as OR gates. To compute (V + I)m

93

we will use dlog ne layers of such multiplication circuits (each layer performsa matrix squaring and forwards its data to the next layer). This constructionresults in a cascaded circuit with O(log2 n) depth and polynomial size. Fi-nally, we will add one more layer at the top of the cascaded circuit, which willserve as a multiplexer to output the final value of the cell [s, t], where s andt are also given as inputs (recall that the content of this cell is the answer tothe REACHABILITY question). Multiplexers can be constructed with cir-cuits of logarithmic depth and polynomial size and thus, the final layer willnot affect the aforementioned bounds. Clearly, the construction of the en-tire circuitry described above is straightforward and can be perfomed by anyDTM in logarithmic space (only counters are required for positioning withinobjects of polynomial size). To conlcude, REACHABILITY∈NC2.

It is trivial to show that L ⊆ NL. By further combining theorems 4.1.6 and4.1.7 we get corollary 4.1.8. Note that whether L ⊂ NL is still an openquestion, and thus, we cannot use this chain yet to infer that NC1 is strictlycontained in NC2.

Corollary 4.1.8. NC1 ⊆ L ⊆ NL ⊆ NC2

The above result establishes that any logarithmic space algorithm (deter-ministic or not) can be efficiently parallelized3. The converse is not known,i.e. we cannot tell whether all NC circuits can be simulated in logarithmicspace. However, we can establish an analogous result for the generalizationof non-determinism, the Alternating Turing Machine (ATM). Specifically,we show that NCk languages can be decided by ATMs which use logarithmicspace and perform at most logk n alternations [44]:

Theorem 4.1.9. For k ≥ 1, NCk ⊆ATM[space(O(log n)),altn(O(logk n))].

Proof. Without loss of generality, we assume that the circuit Cn uses only2-input AND/OR gates and also that, NOT gates can be applied only at then inputs of Cn (every NC circuit has an equivalent circuit of this form, whichcan be constructed in logspace [44]).

We will contsruct an ATM N to simulate Cn(x) on input x, n = |x|.N will incorporate the description of the circuit family C. Consequently,

3a RAM algorithm (besides the multiple or single tape TM) can also be used to measurespace and check the aforementioned logarithmic space “criterion”. As shown in [45], TMand RAM can simulate one another within only a constant factor of extra space.

94

N can find any gate gi within Cn, together with its two predecessors, inlogarithmic space (without alternating). We program N to do the followingoperations. N will start by locating the output gate go of Cn. If go is anAND (OR) gate, then N will jump to a special purpose state labeled ‘AND’(‘OR’). This state will have two outgoing transitions, one corresponding tothe evaluation of the left child of go and one to the right. The choice willbe taken by N nondeterministically. The computation will continue in thisfashion, recursively, until N reaches a NOT gate; at that point, N entersthe ‘yes’ or ‘not’ state depending on the value of x at the under evaluationposition.

Note that the only nondeterministic moves included in N are those takenin the special purpose state (lebeled ‘AND’,‘OR’). Everything else (e.g., gateor wire searching) is performed deterministically with no alternations be-tween ‘AND’ and ‘OR’ labels. Also note that the computation tree char-acterising N has one branch per gate, for every gate of the circuit Cn. Infact, we can view each branching after a node/configuration of the computa-tion as the creation of two new “subprocesses” (each subprocess searches thepredecessors of a gate without alternating, and then spawns two children).Clearly, the computation tree of N reflects the structure of Cn and thus, itwill output the same result with Cn on input x.

The machine N requires O(log n) space for the above operations: each“subprocess” uses only enumeration variables to locate a specific gate gi –andits incomming wires– during the construction of Cn (which takes place severaltimes during the computation by reusing space). Since C is logspace uniform,O(log n) space suffices for the enumeration. Moreover, N will alternate atmost O(logk n) times, because the number of alternations is bounded by thelength of the maximum path within Cn (by construction, no more AND/ORtransitions can take place in N).

Theorem 4.1.9 proves only one direction of a well-known class equiva-lence. It is also shown [44] that NCk+1 ⊇ATM[space(O(log n)), altn(O(logk n))].Consequently, the languages decided by ATM with logarithmic space andpolylogarithmic alternations are exactly those languages in NC, i.e. the fea-sible highly parallel languages

NC = AC = ATM[space(O(log n)), altn(logO(1) n)]

Compared to NC, the AC equivalence noted above is even more closely re-lated to the ATM alternations. When we use unbounded fan-in gates in the

95

aforementioned simulation4, the depth of the simulating circuit decreases bya logarithmic factor. Hence, it turns out that the languages in ACk are ex-actly those languages decided by a logspace ATM, which preforms O(logk n)alternations

ACk = ATM[space(O(log n)), altn(O(logk n)]

Such a close relation to the ATM can be established also for the NCk circuits[24]. In this case, we have a direct correspondence between the depth ofthe circuit and the running time of the (random access input tape) ATM.Specifically, the languages in NCk are exactly those languages decided by alogspace ATM which preforms O(logk n) steps

NCk = ATM[space(O(log n)), time(O(logk n)]

These results are analogous to those in section 3.3, which show the equiva-lence of logspace uniform circuits to the polynomial-time DTM and to thePRAM (when polylogarithmic-time constraints are also imposed).

NC = PRAM[proc(nO(1)), time(logO(1) n)]

By combining the above equivalences of NC, we deduce an interesting equiv-alence of the ATM and the PRAM models. Specifically, we deduce that

ATM[space(O(log n)), altn(logO(1) n)] = PRAM[proc(nO(1)), time(logO(1) n)]

The picture given above shows that the ATM captures parallelism in variousways and it justifies the wide use of the model throughout the literature forstudying parallel complexity5. Moreover, we can now justify the use of othermodels of parallel computation for defining the NC class. For instance, it isnot uncommon to define NCk as the class of languages decided by PRAMalgorithms running in O(logkn) time and utilizing a polynomial numder ofprocessors [20], [29].

4to simulate an ATM, like in the case of a DTM, we have to stack a number of layers ofcircuits (one on top of the other). Such a layer computes straightforward boolean functions,which require O(log n) depth circuits when dexsigned with bounded fan-in gates or, O(1)depth circuits when designed with unbounded fan-in gates.

5other important results comparing the power of ATM and DTM [39]:

For f(n) ≥ log n, ASPACE(f(n))=DTIME(2O(f(n))).For f(n) ≥ n, ATIME(f(n)) ⊆ DSPACE(f(n)) ⊆ ATIME(f2(n)).Consequently, AL=P, AP=PSPACE, APSPACE=EXPTIME.

Note that [28] the introduction of alternation shifts by exactly one level the deter-ministic hierarchy L ⊆ P ⊆ PSPACE ⊆ EXPTIME ⊆ EXPSPACE ⊆ . . .

96

Figure 4.2: Sequential and parallel class relations (known and conjectured)

We conclude this subsection by coming back to the ordinary DTM and,especially, to the SC class (languages decided in polynomial-time and poly-logarithmic-space, simultaneously). The SC seems like a perfect candidate forrelating sequential and parallel computation. In fact, the definition of the SCis quite similar to that of the NC: it bounds two kind of resources simultane-ously, one by a polynomial and one by a polylogarithmic function. Moreover,as we have already seen in section 3.3.1, (i) polynomial-time DTMs are com-putationally equivalent to polynomial-size circuits, and (ii) polylogarithmic-space DTMs are computationally equivalent to polylogarithmic-depth cir-cuits. At first sight, these facts might lead falsely to the conclusion thatNC=SC. The key point missed in this conclusion is the simultaneity of therestrictions imposed by the class definitions. To be precise, the above result(i) was established without limiting the depth of the simulating circuit, orthe space of the simulating DTM. Similarly, the result (ii) was not concernedwith the circuit size, or the DTM time. In other words, preserving one re-source bound during the simulation might substantially alter the amount ofanother resource. Note that this observation applies to many class compar-isons, as for example in the SC versus polyL ∩ P case (polyL stands forpolylogarithmic-space): a language in polyL ∩ P has certainly one polyLand one P algorithm, but does it have one simultaneously? At this point, it

97

appears that SC and NC are incomparable [40]. One of their most importantdifferences seems to be their relation to the NL class. While NL ⊆ NC (seetheorem 4.1.7), it is unclear whether SC ⊆ NL; the NL-complete problems,e.g. the REACHABILITY, are prime candidate members of NC−SC. TheSC versus NC case is yet another open question encountered in the field. Fig-ure 4.2 illustrates the relations between the parallel and sequential classes asthey are thought of today (using proved and conjectured inclusions) [5].

Examples of NC problems

The following classification table exemplifies the NC hierarchy by pointingout some of its characteristic members. Its purpose is to give the complexityof each class in practice, i.e., by associating it to the complexity of specific,well-known, problems.

Classification of well-known feasible highly parallel problems

AC0 TC0 NC1 NC2 NC3

addition multiplication matrix mult. matrix rank MBSsubtraction division FVP matrix inver. MPCVPmatrix add. counting summation matrix deter.

majority prefix sum min span treeparity sorting SWP

tree isomor. reachabilityPBPM

LFMISdeg≤2

MIS, SLE,FFT,

JS1,CFL

The acronyms used in the above lists are as follows: FVP (boolean Formula Value Problem), SWP

(Shortest Weighted Paths), PBPM (Polynomially Bounded Perfect Matching), LFMIS (Lexicographically

First Maximal Independent Set), MIS (Maximal Independent Set), SLE (System of Linear Equations),

FFT (Fast Fourier Transform), JS1 (Job Scheduling on one machine), CFL (Context Free Language),

MBS (Maximal Bipartite Set), MPCVP (general Monotone Planar Circuit Value Problem).

98

Typically, the entries of this table refer to decision problems involvinginteger numbers6. Also, since tight lower bound proofs are rare results, theabove classification bases mostly on the speed of the algorithms publishedin the literature until today. In other words, the entries in the above tableare amenable to left column transitions (except parity, see proof of theorem4.1.5). Moreover, since only the AC0, TC0, and NC1 classes have properinclusion proofs, the remaining columns of this table are amenable to merging(not expected though).

As usually observed in classical complexity theory, imposing constraintson the input set of a problem can significantly reduce its parallel complexity.As a characteristic example, we point out the FVP and the MPCVP. Bothproblems are special cases of the CVP (Circuit Value Problem), in which weare given a boolean circuit with a specific input and we are asked whether itwill output ‘1’. In the general MPCVP, the circuit consists only of AND/ORgates (i.e., monotone) and its edges do not cross when the circuit is depictedin 2-D (i.e., planar graph). In the FVP, instead of a circuit, we are given aboolean formula in fully parenthesized form. Hence, the graph computing itsvalue contains nodes with out-degree ≤ 1. Note that compared to formulas,the circuits are considered more economical in expressing boolean functionsbecause of their ability to reuse subexpressions within their representation.As it turns out, this is a significant difference between the general circuits(unbounded gate fanout) and the boolean formulas (gate fanout ≤ 1): theCVP is a P-complete problem (inherently sequential), while the FVP is NC1

(very efficiently parallelized). Similarly, the difficulty of evaluating a circuitis reduced to NC3 by simultaneously restricting to planar and monotonecircuits. If we further restrict to upward stratified circuits (all edges face tothe same direction), then the complexity of MPCVP reduces to NC2 [46]. Tounderstand the impact of the above restrictions, we mention that the CVPremains P-complete in the following three, individual, cases: i) gates withfanaout ≤ 2, ii) monotone circuits, and iii) planar circuits.

In another direction, the difficulty of a problem can be significantly in-creased by elaborating on the properties of the requested solution. Considerhere the MIS and the LFMIS problems. In the MIS, we seek to extract anymaximal independent set from the given graph. MIS is efficiently parallel.However, if we start searching for solutions with specific properties, then the

6e.g., the decision version of the “integer addition” problem is stated as follows:ADD(x, y, i)=‘1’ iff the i-th bit of the result (x + y) is ‘1’, where x,y ∈ N.

99

problem becomes harder. For instance (see p. 118), requesting the lexico-graphically first Maximal Independent Set renders the problem P-complete(even worse, recall that the maximum independent set is NP-complete). Thecomplexity of the problem is reduced –once again– by imposing constraintson its input set. The special case LFMISdeg≤2, which is confined to graphsof degree ≤ 2, remains in NC2.

Although not listed above, members of higher NCk classes do exist. Suchan example is the problem of finding a Hamiltonian Cycle in dense graphswith N nodes of degree ≥ N/2. The decision version of this problem wasproved to be in NC4. Moreover, finding Hamilton Cycles in robustly expand-ing digraphs was recently shown to be in NC5 (recall that, for general graphs,the HC problem is NP-complete).

4.1.2 Beneath and Beyond NC

At this point, we have formed a clear picture of the most important parallelcomplexity class, the NC. Hereafter, the study continues beneath and beyondNC, either to shed more light at the lowest level of the hierarchy, or to framethe problems defying feasible highly parallel solutions.

Finer gradation within the NC hierarchy

We start by examining the possibility of further analyzing each NCk+1 class tosubclasses, similar to the ACk and TCk. As shown in the previous subsection(NC examples), such class gradations allow the creation of a more detailedclassification of the parallel problems according to their complexity. Recallthat the major difference between the AC and NC circuits lies in the fan-inof their gates: the AC gates have unbounded fan-in, while the NC gatesfeature certain bounds. There is, yet another, category of circuits that useboth bounded and unbounded fan-in gates: the “semi-unbounded” fan-incircuits [21]. Let us define the SACk class to contain exactly those languagesdecided by logspace uniform families of circuits with O(logk n) depth, whichuse bounded fan-in AND gates, unbounded fan-in OR gates, and NOT gatesonly at their input level. In other words, the SACk class is defined as the ACk

class with a restriction on the AND gates and the NOT gates. Trivially, wehave SACk ⊆ SACk+1, i.e., we get a hierarchy of SACk classes (not known tobe proper). Again, we define the union of these classes as SAC =

⋃k≥0 SACk.

Simply by looking at the fan-in of the gates we deduce that a SACk circuit

100

is a special case of an ACk circuit. Moreover, by using de Morgan laws,we have that NCk ⊆ SACk. Therefore, we have SAC=NC=AC=TC. Morespecifically, we have a finer gradation within the NC hierarchy:

NCk ⊆ SACk ⊆ ACk ⊆ TCk ⊆ NCk+1

Towards the analysis of the efficient parallel classes, significant researcheffort has focused at the lowest level of the NC hierarchy. This effort has ledto the genartion of numerous individual complexity classes. Note howeverthat dividing the NC1 and NC2 into multiple subclasses based solely on therunning time of the algorithms is a controversial task [40]. This trend islikely to obscure the real bottleneck of parallel computation, which is thenumber of the processors. For instance, an algorithm with n2 processors andlog n time (NC1) might turn, in practice, slower than an algorithm with nprocessors and log2 n time (NC2). In the real world, we have to take intoaccount factros as the communication overhead, etc. Overall, many believethat the product processors × time leads to more accurate estimations inpractical situations than the numerous subclasses of NC2 and NC1.

The class LOGCFL is defined to contain exactly those dicision prob-lems that are logspace reducible to a context-free language [40]. Recall thatdeciding membership in a context free language is in NC2 (CFL problem,examples’ subsection). Therefore, since L ⊆ NC2, we have LOGCFL ⊆ NC2.By using alternative characterizations of the class LOGCFL, based on non-deterministic auxiliary pushdown automata and ATMs, one can deduce thatNL ⊆ LOGCFL ⊆ AC1. LOGCFL was proved to be closed under comple-ment (this also holds for NCk and ACk trivially, as they are deterministicclasses). Overall, it is shown that LOGCFL=SAC1. Within LOGCFL wehave the LOGDCFL (D stands for deterministic), which lies in NC ∩ SC.Besides the above, we can define more classes based on reductions to famousproblems. Another common example is the DET class, which is defined tocontain exactly those problems that are logspace reducible to the “integermatrix determinant” problem (in NC2, examples’ subsection). We know thatNL ⊆ DET ⊆ NC2.

To take a closer look inside the class NC1, we have to use more con-servative types of uniformity than the logspace (as explained before, this isnot necessary for the remaining of the NC hierarchy). Common uniformitychoices are the DLOGTIME (see p. 84) and the ALOGIME (with alter-nating TMs). Note that the AC0 properly includes the DLOGTIME class ofdecision problems [40]. The definitions/results of this paragraph assume the

101

Figure 4.3: Class inclusions at the lower levels of NC

DLOGTIME-uniformity. Extending the idea of augmenting our circuitswith MAJ gates (see TC circuits), we introduce the MOD gates: an un-bounded fan-in MOD(k) gate outputs ‘1’ iff the number of its non-zero inputsis congruent to (0 mod k) [40]. The MOD gates were introduced to tacklethe PARITY problem with AC circuits of constant depth (recall the proof oftheorem 4.1.5). This goal was achieved with MOD(2) gates. However, as itturns out, such circuits cannot solve all problems of NC1. Let us elaborateby defining, first, the class AC0[k] to contain exactly those languages decidedby polynomial size, constant depth, unbounded fan-in Boolean circuits aug-mented with MOD(k) gates. Trivially, any AC0[k] class contains the AC0

class. Furthermore, for all primes p, the class AC0[p] is strictly contained inNC1. Specifically, AC0[p] can not solve the problem of determining whether(x mod q) = 0, x ∈ N, for any prime q 6= p (PARITY is in AC0[2], but notin AC0[3]). The classes AC0[p] are the largest –known– proper subclasses ofNC1. It is worth noting that, for any composite number k, it is not knownyet whether AC0[k] is properly included in NC1: it may as well be the casethat AC0[6]=NP. We define the class ACC0 to be the union of all AC0[k],k > 1, and we summarize the above as (ACC stands for “AC with counters”)

AC0 ⊂ ACC0 ⊆ TC0

Other parallel complexity classes have been defined by adding new ormodifying the criteria mentioned so far. For instance, by replacing the un-bounded fan-in OR gates of the SAC circuits with unbounded parity gates(XOR) we come up with the “parity SAC” classes ⊕SACk. Or, we can dividethe AC0 class into subclasses based on the number of alternations, i, betweenthe AND and OR gates from the input to the output of the circuit (i.e., its

102

O(1) depth). The classes AC0i form a non collapsing hierarchy within AC0.

In another direction, we can replace the depth criterion of the NC circuitswith a width criterion. The class BWk is defined as the NCk, except thatthe O(logk n) constraint refers to the width of the circuits [40]. We knowthat BW = SC. The class BW0 is also divided into BW0

i subclasses, where icounts the absolute width of the circuit. We have that AC0

i ⊆ BW0i for every

i ∈ N and that AC0 ⊂ BW0 because PARITY is in BW0. In fact, it is provedthat the BW0

i hierarchy collapses and that BW04 = BW0 = NC1 (holds for

both uniform and non-uniform circuits with gate fan-in=2). To sum up, theHasse diagram in figure 4.3 depicts the class inclusions (proper or not) thatwe encounter at the lower levels of the NC hierarchy [40] [47].

Beyond NC

As we already mentioned in the previous chapter, any language can be de-cided by non-uniform circuits of size O(n2n), e.g., via the disjunctive normalform of its characteristic function. In fact, this bound can be lowered toO(2n/n) [48]. However, the vast majority of boolean functions require cir-cuits with Ω(2n/n) gates. Even worse, if we place a uniformity condition onthe circuit families, the uniform circuit complexity of some languages exceedsO(2n). Such exponential numbers reveal the immense area beyond the NChierarchy, which uses only polynomial size circuits.

In section 3.3.1 we established an equivalence of the polynomial DTM(i.e., the class P) to the logspace uniform circuits. Analogous equivalencesbetween the TM and the circuit model can be established for classes of greatercomplexity. For instance, we know that the Polynomial Hierarchy (see foot-note, p. 106) is also defined by DC-uniform boolean circuits of constantdepth (in the Direct Connect uniformity we can find the size, a gate or anedge of a circuit by using a polynomial-time DTM) [48]. These circuits use

2nO(1)gates of unbounded fan-in and the NOT gates appear only at their

input level. Moreover, if we drop the constant depth restriction then weobtain exactly the EXP class (i.e., DTIME(2nO(1)

)). Overall, by carefullycontrolling the imposed restrictions on our circuits (gate fan-in, input rout-ing, etc) we can get alternative characterizations for various TM classes (NP,PSPACE, etc) [49]. The complexity of these classes increases by allowingmore resources to the circuits (e.g., depth). Notice that the above circuitsare infeasible (exponential size), but they have a succinct description: anybasic information regarding the circuit can be extracted in polynomial time.

103

One step farther, the following paragraphs examine the consequences ofcompletely dropping the uniformity constraint that is usually imposed toeach circuit family. To begin with, we have seen that non-uniform circuitshave the extraordinary ability of “deciding” undecidable languages (recallhowever that, describing such circuits algorithmically is itself an undecidableproblem). So, how can we relate these circuits to the well-studied TM classes?Is such a comparison worthy?

Before continuing to answer the above questions, we must define theTuring Machines which “take advice”. This notion was introduced to capturenon-uniform complexity classes. We can envisage such a TM as a machinewith two inputs: the first, x, is the string for which we are interested whetherx ∈ L for some language L. The second, an, is a string which is given as an“advice” to the TM, in order to facilitate the computation. Formally [48],

Definition 4.1.2. Let T,w: N → N be functions. The class of languagesdecidable by time-T(n) TMs with w(n) advice, denoted DTIME(T(n))/w(n),contains every L such that there exists a sequence an, n ∈ N, of binarystrings with |an| = w(n) and a TM M satisfying

∀x ∈ 0, 1n M(x, an) = 1 ⇔ x ∈ L

On input (x, an) the machine M runs for at most O(T (n)) steps.

The languages decided in this way may not as well be recursive, i.e.decidable by regular Turing Machines. Consider the example of an advice-TM deciding a unary language L with a single bit of advice: an = 1 iff1n ∈ L. Clearly, L may stand here for any unary language, recursive or not,because the above definition imposes no restrictions on the advice sequencean (e.g. its construction).

Based on the definition of the advice-TM, various classes have beenstudied in the literature. They are characterized by the time spent to ac-cept/reject a string (polynomial, nondeterministic-polynomial, exponential)and by the width of the advice string (polynomial, logarithmic). We pointout here the P/poly class, for which we give the following two definitions andwe prove that they are equivalent [48] [23].

Definition 4.1.3. The class P/poly contains exactly those languages decidedby polynomial-time TMs which use polynomial-size advices.

Definition 4.1.4. The class P/poly contains exactly those languages decidedby families of circuits with polynomial-size.

104

Theorem 4.1.10. The definitions 4.1.3 and 4.1.4 of the class P/poly areequivalent.

Proof. We will use bidirectional simulations, similar to those of section 3.3.1.For the first direction, assume that a language L has a TM M running

in polynomial time and using polynomial-size advices (in n = |x|). That is,there exists a family of advices an and a TM M , such that M on input(x, an) will decide whether x ∈ L in a polynomial number of steps. Since Mexists, we know how to construct a circuit Cn to simulate M(x, an) (as inthe proof of theorem 3.3.1). The resulting circuit will have polynomial size,because the computation of M is itself polynomially bounded. Moreover,since the family an exists, we know that there exists a family of circuitsCa

n such that Can outputs an without requiring any input (constant output):

every member Can is simply a hardwiring of gates, according to the string

an (we are interested only on the existance of Can, not in the difficulty of

constructing it). The size of Can is bounded by the length of an, i.e. it is

also polynomial in n. By appending Can to the aforementioned Cn we get

a polynomial size circuit and moreover, we assemble a family of circuits todecide L (notice that this family is rendered nonuniform, because of theargument used in the construction of Ca

n).For the second direction, assume that L has a polynomial circuit. That

is, there exists a family of circuits, Cn, such that Cn(x) = 1 ⇔ x ∈ L,n = |x|. We will use the advice sequance an, where an =< Cn > is thedescription of the n-th circuit of Cn (the difficulty to construct < Cn > isirrelevant here). A DTM can input an and x to simulate Cn(x) in polynomialtime, because the size of Cn(x) is polynomially bounded in |x|.

It is straightforward to see from definition 4.1.3 that any language withinP also resides within P/poly. In fact, the class inclusion is proper, as thereare many problems in P/poly which are not members of P: e.g. all the non-recursive languages of P/poly (recall the previous example with the L ⊆ 1n

decider). This conclusion is in accordance with the discussion of section 3.3.1,if we consider definition 4.1.4, thus

P ⊂ P/poly

The question that arises naturally here concerns the relation of NP toP/poly. Is NP a subset of P/poly? We know that the converse is false (due tothe existence of non-recursive problems in P/poly), but what if this inclusion

105

does not hold either? The answer to this conundrum might even resolvethe P versus NP conflict: if NP * P/poly, then P 6=NP. For this reason,circuit complexity is often considered as a means of separating importantcomplexity classes. Indeed, there is evidence that the NP class is not asubclass of P/poly: according to the Karp-Lipton theorem [50], such an“odd” inclusion would imply that the entire Polynomial Hierarchy collapsesto its second level7. This result also strengthens the conjecture mentioned insection 3.3.1, i.e. that no NP-complete problem has a polynomial size circuit.We give bellow a proof sketch of the Karp-Lipton theorem [48].

Theorem 4.1.11. If NP ⊆ P/poly then PH=Σp2.

Proof (sketch). To show that PH = Σp2 it suffices to show that Πp

2 ⊆ Σp2

(and thus, Πp2 = Σp

2). To show such an inclusion, it suffices to prove that aΠp

2-complete problem belongs in Σp2.

The problems in Πp2 are described by using the syntax ∀x∃yφ(x, y), while

those in Σp2 by using the syntax ∃x∀yφ(x, y). The goal of this proof is to

transform a problem of the former syntax to an equivalent of the latter syntaxin polynomial time. We point out the Πp

2SAT language (Πp2-complete), which

consists of all unquantified Boolean formulas φ(x, y) satisfying

∀x∃yφ(x, y) = 1, x, y ∈ 0, 1n (4.1)

By the assumption NP ⊆ P/poly, there exists a family of polynomial-size circuits Cn deciding SAT (i.e. Σp

1SAT). Specifically, Cn decideswhether for some given string x and formula φ(x, y) the following holds:∃yφ(x, y) = 1, for y ∈ 0, 1n. Moreover, it is known how to modify adecision algorithm/circuit so that it can output the solution of the problem(if any). Such a circuit C ′

n will input a formula φ and a string x to output

7 that is, any problem within PH (and probably within PSPACE, as PH ⊆ PSPACE)could be solved by a polynomial-time NTM with a polynomial number of queries to anoracle solving some NP-complete problem (Σp

2 = NPNP ). Recall [5] that PH is a sequenceof classes, which are defined recursively, based on TM using oracles. Specifically, at thezeroth level we have the classes ∆p

0 = Σp0 = Πp

0 = P . At the first level we have ∆p1 =

P, Σp1 = NP, Πp

1 = coNP . The classes in each of the upper levels are defined basedon a polynomial-time TM using an oracle from the previous level: ∆p

i+1 = PΣpi , Σp

i+1 =NPΣp

i , Πpi+1 = coNPΣp

i . Note that each level includes all of the previous levels and thatPH=

⋃i≥0 Σp

i . An equivalent way to define PH is to let Σpi (respectively Πp

i ) denote thoselanguages accepted by polynomial-time ATMs that make less than i alternations startingfrom an existential (respectively universal) state [48].

106

a string y (if any exists). Note that the circuits of the family C ′n will still

have polynomial size.Let us denote by w a potential description of the circuit C ′

n (by assump-tion, C ′

n exists). Consider the language consisting of all Boolean formulasφ(x, y) satisfying

∃w∀x(w =<C ′

n > ∧ φ (x,C ′n(φ, x)) = 1

)(4.2)

If 4.1 is true (false) for a specific formula ψ(x, y) then, by the definition ofC ′

n, 4.2 is also true (false) for ψ(x, y). Hence, we have reduced the problemdescribed by 4.1 to the problem described by 4.2. Moreover, the reductionis performed with Σp

2 requirements: the circuit C ′n is constructed by using

a SAT oracle (which inputs ψ and x) and the evaluation of C ′n, as well as

its construction, requires at most polynomial time. Consequently, ifNP ⊆ P/poly, then Πp

2SAT ∈ Σp2.

An even more extreme assumption that NP is contained in P/log (a strictsubset of P/poly) implies that PH collapses at its first level, i.e. P=NP [50].Similar results given in [50] show that it is rather unlikely for deterministic-exponential-time or polynomial-space problems to have polynomial-size cir-cuits. Specifically,

If EXP ⊆ P/poly then EXP = Σp2

If PSPACE ⊆ P/poly then PSPACE = Σp2 ∩ Πp

2

Iff PSPACE ⊆ P/log then PSPACE = P

Note as a corollary of the first that, EXP⊆P/poly and P=NP will never holdtogether. These inclusions, when combined, contradict the Time HierarchyTheorem [48]: if P=NP, then PH collapses and thus, P=Σp

2=EXP.

We conclude the current section by referring to a randomized complexityclass of parallel computation, namely the RNC class. The RNC class is theunion of all RNCk classes, k > 0, where each RNCk is defined as the NCk,except that the circuit is probabilistic [1] [5]. More specifically, we have aboolean circuit, which inputs a number of extra bits (intuitively, the randombits). The number of extra bits is bounded by a polynomial p(n), n = |x|.If x ∈ L, the circuit outputs ‘1’ with probability at least 1/2, i.e., at leasthalf of the 2p(n) outputs are correct. Otherwise, if x /∈ L, the circuit outputs‘0’ with probability 1 (zero sided error). The introduction of randomization

107

Figure 4.4: Known class relations (single arrows denote proper class inclusions)

108

to the NC class has the same effect with the RP case (randomized P) [5].As we will see in section 4.2, the first important class beyond NC is, mostprobably, the class of P-complete problems (still inside P). Randomization isa potential way of attacking these problems, analogous to that of attackingNP-complete problems with randomized polynomial algorithms. A charac-teristic example is the “perfect matching” problem, which is shown to be inRNC [5]. More specifically, it lies in RNC ∩ co-RNC, i.e., there is a feasibleparallel algorithm that runs in expected polylogarithmic time for any input,giving always the correct output [40]. Unfortunately, perfect matching is notyet proved to be either NC or P-complete8. In fact, as with the P versus NPcase, no problem in RNC was ever proved to be P-complete; most researchersbelieve that randomization is not adequate for solving such difficult problems,neither P-complete (with RNC) nor NP-complete (with RP). Regarding therelations of the RNC to the other classes, we expect that RNC is in P (mereconjecture). Also, it is straightforward to show that NC ⊆ RNC, because theuse of the random bits is not mandatory. We know that RNC ⊆ BPP (prob-lems solvable with randomized sequential algorithms with two-sided error inexpected polynomial time), which in turn is contained in P/poly according toAdleman’s theorem [47]. Figure 4.4 summarizes most of the aforementioned–proved– class relations with a Hasse diagram (double arrows denote classinclusions and single arrows denote proper class inclusions).

4.2 P-completeness

Are all problems in the class P equally difficult to solve? A first, negative,answer can be obtained by using their sequential-time as a comparison mea-sure: the Time Hierarchy Theorem [5] establishes that some problems requirestrictly greater amounts of time than others, i.e. the DTIME(nk) hierar-chy is proper. There are other viewpoints, though, from which we cannotanswer the above question with certainty9. As it turns out, we cannot eventell whether all P problems are easy enough to be solved in narrow space(polyL), or in moderate parallel time bounds (NC). Separating P from such

8recall that the special case PBPM of this problem is in NC2 (examples’ subsection).9e.g., we also know, from the Space Hierarchy Theorem, that the DSPACE(logk n)

hierarchy is proper. However, we cannot use this result to further divide P with respect tospace, because we expect that polyL * P (the distinguishing problems of each polyL levelmight lie beyond P). In other words, we cannot tell whether SC is a proper hierarchy.

109

classes is an issue addressed, among others, by the theory of P-completeness.The most direct method for separating two classes is, arguably, by show-

ing that a specific member of one class does not reside in the other. Tofind such distinctive members we, naturally, search among the most difficultproblems of a class (after all, we are working with difficulty classifications).Hence, the theory of P-completeness focuses on identifying the “hardest”problems of P. Indeed, as practice has shown, the P-complete problems arethe prime candidate members of P−NC, of P−L, etc. They seem to lackfeasible highly parallel, as well as, polylogarithmic-space solutions.

To identify the “hard” problems of a class we use reductions. In general,we say that a problem A is reduced to a problem B, if we can solve anyinstance of A by transforming it to an instance of B, and then, solving B. Thecomplexity of transforming the instances determines the relative difficulty ofthe two problems. Intuitively, if the transformation is easy enough (comparedto the complexity of B), then solving A is at most as difficult as solvingB. Note that we can choose among many types of transformations (andreductions), depending on the underlying complexity class and the intendedapplication. By using reductions, we can partially order the problems of aclass according to their difficulty. The “hardest” problems are those to whichwe can reduce any problem of their class. We call such a problem completeand we can use it as a representative of its class.

Several problems have been identified as P-complete [1]. Even though itis conjectured that they all lie beyond NC, none of them was ever provedto do so. Such a striking result would imply that all P-complete problemslie beyond NC, and moreover, it would separate NC from P once and for all(the same holds for the ‘P versus L’ conundrum). Decades of failed attemptsreveal the magnitude of the challenge and in no case do they invalidate themerit of the theory. Besides provably separating the classes, P-completenessgives us some insight to the limits of parallelization and it expands our knowl-edge of difficult –or, seemingly difficult– problems. In practice, a large setof known P-complete problems can, actually, guide the designer in solvingefficiently parallel his own problem. For example, identifying a P-completesubproblem suggests that the decomposition of the larger problem shouldbe reformulated. Or, it suggests that some other method should be applied(e.g. approximation). In other examples, there exist problems featuring bothhighly parallel and inherently sequential solutions; the designer can benefitby consulting the theory.

Historically, the notion of P-completeness appeared at roughly the same

110

time as that of NP-completeness (1970’s) [1]. The motivation was to studythe relations between sequential time and space. Cook was the first to raisethe question of whether everything computable in polynomial time is alsocomputable in polylogarithmic space (‘P versus polyL’, still open). Cookalso gave the first P-completeness result (the “Path Systems” problem), byusing logarithmic-space reducibility. Thereafter, many problems were shownto be P-complete, including the “Circuit Value” by Ladner (1975). Theintroduction of the class NC (Pippenger, 1978) and the importance of P-completeness in the study of parallel computation (Goldschlager et al., 1982)led to several new results by using NC reducibility. Nonetheless, there stillexist numerous problems that are not known to be either P-complete or NC.Such a characteristic example is the Greatest Common Divisor, which, asthe rest of the open problems, is not expected to be in NC (compare to theFactoring problem, which is not expected to be in P). Today, the general–but unproven– consensus is that the P-complete problems are inherentlysequential (i.e. they are feasible but not feasible highly parallel).

In the following of this section we study P-completeness for the purposesof parallel computation. We describe the notion of reducibility and its uses.We define the P-complete problems and we give various examples. Moreover,we explain the NC ⊂ P conjecture and we consider its consequences.

4.2.1 Reductions and Complete Problems

One of the central ideas used in P-completeness is the reduction of one prob-lem to another. That is, the idea of finding the solution of one problemby solving another problem. For example, sometimes, we can tell whetherx ∈ LA by transforming the instance x to a new instance f(x) and answeringwhether f(x) ∈ LB. The existence and the complexity of such functions,f , help us order/classify the problems according to their own complexities.Intuitively, if f exists and can be computed with less or equal complexity tothat of deciding LB, then deciding LA cannot be more difficult than decidingLB.

The computational complexity of a function is formalized and studiedwith the same methods presented in the previous section. Recall that theclass NC was defined there only for decision problems. Let us define herethe function analog of this class, namely the FNC. The FNC is the set offunctions, which can be computed by boolean circuits of polynomial-size andpolylogarithmic-depth [1]. As with the NC hierarchy, the FNC is the union of

111

all FNCk classes (circuit-depth = O(logk n)). The gates of the FNC circuitshave bounded fan-in and the circuit families are logspace uniform. FNC cap-tures the functions that can be computed feasible highly parallel. Note that,by definition, NC ⊆ FNC. Other important function classes are the FP (func-tions computed by polynomial-time DTMs), the FL (by logarithmic-spaceDTMs), the nondeterministic FNP and FNL, the nonuniform FNL/poly (byNDTM with polynomial-size advices), etc.

We continue by formalizing the notion of reductions [1].

Definition 4.2.1. A language LA is many-one reducible to a language LB,written LA ≤m LB, if there is a function f such that x ∈ LA if and only iff(x) ∈ LB.

We say that LA is P many-one reducible to LB, written LA ≤Pm LB if and

only if the function f is in FP.We say that LA is logspace many-one reducible to LB, written LA ≤L

m LB

if and only if the function f is in FL.We say that LA is NCk many-one reducible to LB, for k ≥ 1, written

LA ≤NCk

m LB, if and only if the function f is in FNCk.We say that L is NC many-one reducible to LA, written LA ≤NC

m LB, ifand only if the function f is in FNC.

We call the above reductions many-one because many distinct instances ofthe original problem A can be mapped to a single instance of the new problemB. A generalization of the many-one reducibility is the Turing reducibility.The idea is that we are not limited in a single transformation f(x), andthus, to a single question of the form f(x) ∈ LB in order to answer whetherx ∈ LA. Instead, we are allowed to use a machine, which can issue multiplequestions to an oracle giving answers for the problem B. In practice, multiplequeries are more useful when we turn from decision to search problems10.We will define Turing reducibility here by using PRAMs equipped with anextra instruction-call to query their oracle (the B-oracle). Notice that, ifwe assume a unit cost for the instruction-call, the complexity of the Turingreduction captures the difficulty of reducing A to B independently of the truecomplexity of B. Formally [1],

Definition 4.2.2. A search problem A is Turing reducible to a search problemB, written A ≤T B, if and only if there is a B-oracle-PRAM that solves A.

10i.e. find a value (a string, a path, etc) with a specific property. When the answer toa search problem is unique (for any input), then we have a function problem.

112

We say that A is P Turing reducible to B, written A ≤PT B, if and only

if the B-oracle-PRAM on inputs of length n uses nO(1) time and nO(1) pro-cessors.

We say that A is NC Turing reducible to B, written A ≤NCT B, if and only

if the B-oracle-PRAM on inputs of length n uses logO(1)n time and nO(1)

processors.

Similar to the PRAM extension with oracles, we define the B-oracle-circuit families. These circuits include oracle gates for solving B. Such a gateis considered as a vertex with k inputs and l outputs. The k-input will bean encoding of an instance of the problem B, and the l-output will be theencoding of the corresponding solution to B. By convention, the oracle gatehas depth log(k + l) and size (k + l). Oracle circuit families serve the samepurpose as oracle PRAMs. They are introduced to study a more detailedreduction than the ≤NC

T defined above. Specifically [1],

Definition 4.2.3. A search problem A is NCk Turing reducible to a searchproblem B, for k ≥ 1, written A ≤NCk

T B if and only if there is a uniformB-oracle circuit family Cn that solves A, and each member Cn has depthO(logkn) and size nO(1).

As already mentioned, reductions are often used to reckon whether aproblem is more difficult than another (either in terms of complexity or com-putability). Furthermore, we can use reductions to reckon whether two prob-lems are equally difficult [1].

Definition 4.2.4. Suppose that for some reducibility ≤ we have A ≤ B andB ≤ A. Then we say that A and B are equivalent under ≤.

The above reductions have certain properties, which can be used to reachcertain conclusions easier and faster. For instance, we know that they aretransitive, i.e. if A ≤ B and B ≤ Γ, then A ≤ Γ. To prove such a claimfor a many-one reduction, we must compose the two functions transformingthe instances of the two given problems. For this, notice that the classes FP,FL, FNCk and FNC are closed under composition [1] (it can be shown withstraightforward simulations, except from FL, which requires the techniqueused in the proof of theorem 4.1.6 for passing the “intermediate result” [5]).To prove the claim for a Turing reduction, it is straightforward to constructa machine (circuit) utilizing the second given oracle and simulating the op-erations of the two given machines (e.g. by interleaving their computations).Overall [1],

113

Proposition 4.2.1. All the reductions given in definitions 4.2.1, 4.2.2 and4.2.3 are transitive.

Moreover, since Turing reducibility is a generalization of many-one re-ducibility, it is straightforward to show that [1]

Proposition 4.2.2. If A ≤m B, then A ≤T B. If A ≤Pm B, then A ≤P

T B.If A ≤NC

m B, then A ≤NCT B.

The aforementioned reductions are useful in parallel computation for spe-cific reasons. In some cases, we want our transformation to be in FNC (orin L) so that we can solve our problem highly parallel (i.e., transform it andthen solve another highly parallel problem). In other cases, we might simplywant to show the complexity class of a specific problem (by comparing toother problems of that class). In most cases, we are interested in orderingproblems within P, and NC, according to their complexity. Let us elaborateon the latter two situations, which involve comparisons. When comparingproblems through reductions, one should be careful with the design of thetransformation function or the oracle machine: the reduction should be nomore complex than the problem itself (e.g., the exponential cost of a falselychosen transformation could mask the polynomial cost of the problems underexamination and thus, mislead the comparison). Therefore, before continu-ing, we define the compatibility of a reduction and we explain why the afore-mentioned reductions are suitable for the purposes of P-completeness [1].

Definition 4.2.5. Let ≤ be a resource bounded reducibility and let C be acomplexity class. We say that ≤ is compatible with C if and only if for allproblems A and B, when A ≤ B and B ∈ C, then A ∈ C.

It is easy to see that the resource-bounded many-one reductions given indefinition 4.2.1 are compatible with P. That is, if we reduce a problem A(of unknown complexity) to a problem B ∈ P by using one of the ≤P

m, ≤Lm,

≤NCk

m , ≤NCm reductions, we actually deduce that A ∈ P (the transformation

of the input x can be computed in polynomial time, as well as the solutionof B). Similarly, the many-one reductions ≤L

m and ≤NCm are compatible with

NC and the many-one reduction ≤NCk

m is compatible with NCk. The Turingreductions ≤P

T and ≤NCT are also compatible with P and NC respectively (we

can replace the oracle-instruction by an actual machine, as we know that thenumber of the oracle “executions” is bounded by the running time of theoriginal oracle-PRAM ). The ≤NC1

T is compatible with NCk for each k ≥ 1 as

114

it preserves circuit depth. Note finally that, for any reduction ≤ compatiblewith C and for any problems A, B, if A ≤ B and A /∈ C, then B /∈ C. This isan immediate consequence of definition 4.2.5, which can be used for provingthat a problem does not reside in a particular class C.

Having defined the above tools for ordering and classifying problems, wecan now give the notion of a complete problem. Assume that for some classC and some problem B ∈ C, we know that every problem A ∈ C can bereduced to B by using a reduction compatible with C (∀A,A ≤ B). That is,we know that B is at least as “difficult” as any other problem in C and thatit also resides in C. Such a problem is called C-complete11. A C-completeproblem can be viewed as a representative of the class C. In certain cases,proving a complexity result for that problem is as good as proving a result forevery problem within C. For instance, consider that in order to show a classinclusion C1 ⊆ C2, it suffices to show that a C1-complete problem resides inC2 (because no problem in C1 is more difficult than the C1-complete problem,which itself can be solved under C2 constraints). Further, when C1 ⊆ C2, toshow that the two classes are equal, it suffices to show that a C2-completeproblem resides in C1 (which implies that C2 ⊆ C1). Overall, when a C1-complete problem resides in C2, and at the same time a C2-complete problemresides in C1, then C1 = C2.

As the title of this section indicates, we will focus on a specific set ofcomplete problems: the P-complete problems. Formally [1],

Definition 4.2.6.A language LB is “P-hard under NC reducibility” if LA ≤NC

T LB for everyLA ∈ P .A language LB is “P-complete under NC reducibility” if LB ∈ P and LB

is P-hard.

Definition 4.2.6 can be extended –in the obvious way– to include functionproblems (FP-complete) and search problems (quasi-P-complete). Note that,one can give P-completeness results using reductions different than the NC

11not all classes have complete problems. For instance, no complete problems are knownfor “semantic” classes like RP and BPP (Randomized Polynomial-Time and Bounded-Error Probabilistic Polynomial-Time). Moreover, it is believed that some classes do nothave complete problems at all. Such an example is the Polynomial Hierarchy (non-semanticclass), for which we know that the existence of a PH-complete problem implies the collapseof PH to some finite level [5]. Note that the same holds for the NC hierarchy, which wouldcollapse at the level of the –alleged– complete problem (it also collapses if NC=P).

115

(weaker forms of reducibility, possibly down to NC1 many-one). However, weavoid P reducibility due to the NC ⊂ P conjecture (we avoid allowing morecomputational power to the transformation than to the “complete” problemitself).

Proofs and examples of P-complete problems

To exemplify the use of the above, we present a number of P-complete prob-lems and some commonly used proof techniques. We begin with the CircuitValue Problem (CVP), which is, probably, the most famous of the class.CVP is a decision problem: given the description of a boolean circuit Cn anda bit vector v = [v1, . . . , vn], is Cn(v) = 1 ? We show that [5]

Theorem 4.2.3. CVP is P-complete

Proof. It suffices to show that CVP ∈ P and that A ≤Lm CVP for every

A ∈ P (note that ≤Lm is compatible with P and that it is a weaker form of

reducibility than ≤NCT , which was used in definition 4.2.6).

It is straightforward to evaluate the output of the circuit in polynomialtime: read its description and evaluate every gate in a bottom-up fashion.

To show that every A ∈ P can be reduced to CVP, recall Theorem 3.3.1.Since problem A has a polynomial-time DTM MA, we can construct a cir-cuit CA

n to simulate MA(x), |x| = n, by using a logarithmic-space DTM.Therefore, any input x of A can be transformed in a string < x,CA

n > suchthat x ∈ A if and only if < x,CA

n >∈ CV P . Since the above transforma-tion requires only logarithmic space, we deduce that A ≤L

m CVP for everyA ∈ P.

Having proved that CVP is P-complete, we can use it to show other P-completeness results. Specifically, to establish that A ∈ P is P-complete,it suffices to give a compatible reduction CVP ≤ A. Such a result impliesthat any problem B ∈ P can be reduced to A by composing the two trans-formations (B ≤ CVP and CVP ≤ A), e.g., by using the “communication”technique used in the proof of Theorem 4.1.6. In fact, CVP plays the samerole in P-completeness theory that the Satisfiability Problem (SAT) does in

116

NP-completeness theory12. Analogous to SAT, CVP is most often used toshow that other problems are P-complete. Moreover, as with SAT, manyvariants of the CVP are also P-complete. This holds even for restricted CVPvariants, which can sometimes simplify a proof. Such a case is examined next:the Monotone CVP. MCVP is the same problem, except that the given circuitconsists only of AND and OR gates (no NOT gates). Nonetheless [1],

Theorem 4.2.4. MCVP is P-complete

Proof. First, note that as CVP ∈ P, so is MCVP ∈ P. To show that MCVPis P-hard, we will reduce CVP to it.

We construct a DTM which inputs an instance of CVP, i.e. < x, Cn >,and constructs an instance of MCVP, i.e. <x′, C ′

m > with |x′| = m = 2n =2|x|. First, the DTM inverts every bit of x and by concatenation constructsx′ =<x; x>. This action will allow the removal of any NOT gate from thefirst level of Cn (by using a connection directly to the inverted bit of theinput). We extend this idea to the remaining levels of Cn by using a classicaltechnique of circuit design, called double-rail logic (the idea is that for everysignal-edge within Cn, there will also be a new signal of the opposite value–without using NOT gates). The DTM reads the description of Cn and whenit encounters an AND gate vi with vi = vj ∧ vk, it outputs two gates: vi andvi = vj∨vk (the bar here denotes only a symbol, not an operation). Similarly,for an OR gate vi = vj ∨ vk it outputs vi and vi = vj ∧ vk. For a NOT gatevi = ¬vj it outputs vi = 1 ∧ vj and vi = 0 ∨ vj.

Basically, this is a recursive definition of the circuit construction. It iseasy to show by induction that, starting from x and x, for any signal si ofCn, there is a signal si of C ′

m such that si = ¬si (the new circuit C ′m is

double the size of Cn). This allows C ′m to have the same functionality with

Cn, but without using any NOT gates (when necessary, a node connectsdirectly to the inverted signal). We are interested in the output gate of C ′

m,which has the same value with Cn, that is, <x, Cn >∈ CVP iff <x′, C ′

m >∈MCVP. It is clear that the above computations can be performed by theDTM in logspace (simple counters and indexing variables are required) andthus, CVP ≤L

m MCVP.

12SAT is a well known decision problem: given a boolean expression φ (usually inconjunctive normal form –CNF), is is satisfiable? SAT is computationally equivalent tothe Circuit-SAT problem (i.e., for a given circuit Cn, is there an input assignment so thatCn will output ’1’?) [5].

117

Other variants of the CVP that are P-complete are [1]:

¦ NAND-CVP: The circuit consists only of NAND gates. Reductions areoften simplified when only one type of gate needs to be simulated.

¦ Top-CVP: Topological CVP, i.e. the vertices in the circuit are numberedand listed in topological order.

¦ P-CVP: Planar CVP, where the circuit is a planar graph, i.e. it can bedrawn in the plane with no edges crossing. Note that Monotone-Planar-CVP ∈ NC.

¦ AM2-CVP: Alternating Monotone Fanin 2, Fanout 2 CVP, where on anypath from an input to an output, the gates are required to alternatebetween OR and AND.

¦ Arith-CVP: Arithmetic CVP, where we are given an arithmetic circuit withdyadic operations +,−, ∗, together with inputs x1, . . . , xn from a ring.

There are numerous P-complete problems, which are related to circuitcomplexity, graph theory, searching graphs, combinatorial optimization andflow, local optimality, logic, formal languages, algebra, geometry, real analy-sis, games, etc [1]. For each the above categories, we list one representativeP-complete problem bellow:

¦ CVP: Circuit Value Problem (see above).

¦ LFMIS: Lexicographically First Maximal Independent Set. Given agraph G with ordered vertices and a designated vertex v, is v in thelexicographically first maximal independent set of G?

¦ BDS: Breadth-depth Search. Given an undirected graph G with anumbering on the vertices, and two designated vertices u and v, isvertex u visited before vertex v in the breadth-depth first search of Ginduced by the vertex numbering?

¦ LP: Linear Programming. Given an integer n× d matrix A, an integern × 1 vector b, and an integer 1 × d vector c, find a rational d × 1vector x such that Ax ≤ b and cx is maximized. Typically, this is anFP-complete problem.

118

¦ UNAE3SAT: Unweighted, Not-all-equal Clauses, 3SAT, FLIP. Givena 3-CNF boolean formula φ, find a locally optimal assignment for φ.A truth assignment satisfies a clause under the not-all-equals criterionwhen it has at least one true and one false literal. The cost of theassignment is the number of the satisfied clauses. An assignment s islocally optimal if it has maximum cost among its neighbors, where theneighbors are assignments that can be obtained from s by flipping thevalue of one variable. Typically, this is an FP-complete problem.

¦ HORN-SAT: The CNF SAT with the special case of Horn clauses, i.e.each clause is a disjunction of literals having at most one positive literal.

¦ Memb-CFG: Context-free Grammar Membership. Given a context-freegrammar G and a string x, is x ∈ L(G)?

¦ IM: Iterated Mod. Given the integers a and b1, . . . , bn, is(. . . ((a mod b1) mod b2) . . . ) mod bn = 0 ?

¦ PHULL: Point Location on A Convex Hull. Given an integer d, a setS of n points in Qd, and a designated point p ∈ Qd, is p on the convexhull of S?

¦ IIRF: Inverting An Injective Real Function. Given an NC real functionf defined on [0, 1], compute x0 with error less than 2−n, such thatf(x0) = 0. The function is increasing and has the property that f(0) <0 < f(1) (it has a unique root). A real function f is in NC if anapproximation to f(x) with error less than 2−n, for x ∈ [−2n, +2n], canbe computed in NC.

¦ LIFE: Game of Life. Given an initial configuration of the “Game ofLife”, a time bound T expressed in unary, and a designated cell c ofthe grid, is cell c live at time T?

We conclude this subsection with a P-complete problem, which can giveus some insight to the forthcoming question NC ?

= P. In the Generic MachineSimulation Problem (GMSP) [1], we are given the description of a Turingmachine <M >, an integer T coded in unary and an input string x. We areasked whether M accepts x within T steps. The solution to this problemis essentially a Universal Turing Machine (UTM) equipped with a counterfor keeping track of the time T . The reason that T is given in unary is to

119

avoid the exponential time cost of the UTM13. We give the following proofby using NC1 Turing reducibility [1]:

Theorem 4.2.5. GMSP is P-complete

Proof. First, note that GMSP ∈ P. A modified UTM′ will simulate T steps ofthe given machine. For each one of them it will simply read the description<M > of the given machine together with its input x. The execution time ispolynomially bounded by the length of the UTM′ input, because the inputincludes both T (in unary) and <M ; x>.

We now show how to reduce any A ∈ P to the GMSP. Assume that MA

decides A and that p(n) is an easily computable function, denoting an upperbound on the running time of MA. Both MA and p(n) are considered knownfor the purposes of this proof. Hence, when given an input x for A, with|x| = n, we can transform it to the string fA(x) = x; <MA >; 1p(n). Clearlyx ∈ A iff fA(x) ∈ GMSP.

We argue here that fA(x) can be computed with NC1 circuits. The firstpart of fA(x) is a plain copy of the input x to the output of the circuit (i.e.depth=O(1), size=O(|x|)). The second part is a straightforward generationof a description string, which does not depend on x (hardwired circuitry ofconstant size and depth). The third part involves an actual computation, thatof p(n). Recall that this is a uniform circuit, described by a DTM. The DTMmust count up to n, compute p(n) and generate the corresponding numberof gates, all in logspace. Notice that p(n) need not be a tight bound on MA,just a polynomial easy to compute. Therefore, we can choose p(n) = 2k log n

for some appropriate constant k. This computation results in a ’1’ followedby k log n 0’s, i.e. it can be stored in logspace. It only remains to ’expand’this number to its unary representation at the output of the DTM (in theform of gates). This can be done by successively decreasing the computednumber by one and appending a gate description at the output tape (thesubtraction is also logspace ). The resulting circuit has constant depth andpolynomial size. Also, the family of the above circuits is highly uniform.

13Algorithms that run polynomially in the length of their input just because their inputis represented in unary are called pseudopolynomial. This characterization is related tothe notion of strong NP-completeness; we know that, unless P=NP, strong NP-completeproblems do not have even pseudopolynomial algorithms [5].

120

4.2.2 NC versus P

The ‘NC versus P’ question is analogous to the ‘P versus NP’ conundrum,which is encountered in the theory of NP-completeness. There, we askwhether all problems with feasible verifiers14 also have feasible solutions(sequential polynomial time). Here, we ask whether all problems with fea-sible solutions also have feasible highly parallel solutions (polynomial workin polylogarithmic time). Most researchers believe that the answers to bothquestions are negative, but up until today both of them remain open.

According to the previous subsection, to show that NC=P, it suffices toshow that a P-complete problem resides in NC. From a circuit point of view(see CVP), we must come up with a technique to compute the value of apolynomial circuit in polylogarithmic time by using a polynomial number ofprocessors. From a DTM point of view (see GMSP), we must come up with aninterpreter/compiler, which will input a sequential code and it will producean equivalent feasible highly parallel code. The obstacles encountered in bothdirections seem rather impossible to overcome. Many theoretical efforts havefailed, while at the same time, the approaches followed in the industry leadonly to moderate speedups. Everyday experience shows that the P-completeproblems are quite difficult to parallelize and most likely, as we say, they areinherently sequential. Next, we give some of the evidence that NC 6= P andwe discuss some of the implications.

Evidence that NC 6= P

Let us begin by reporting that monotone-NC 6= monotone-P [42]. That is,provably, we can separate the two classes if we confine ourselves to monotonefunctions (f is monotone if flipping a bit from ‘0’ to ‘1’ in any argumentto f cannot cause the value of f to change from ‘1’ to ‘0’). However theseparation of the general NC and P classes is a more difficult task15. Asmentioned above, if we could just discover how to simulate efficiently inparallel every polynomial DTM, then every feasible sequential computationcould be translated automatically into a highly parallel form. Besides DTM,this observation holds for any programming language. However, such a highly

14a feasible verifier is a polynomial-time sequential algorithm, which can verify thecorrectness of an answer to a problem if it is given a succinct certificate (witness, proof)for that specific answer [48].

15the same holds for P vs NP, for which we also know that monotone-P 6=monotone-NP.

121

parallel interpreter is not likely to exist. This machine should be able todeduce nontrivial properties of programs by just reading their code. In manycases, finding these properties is provably an intractable, or even undecidableproblem. Modern compilers exploit only properties, which tend to be verysyntactic, local, or relatively simplistic. Their optimization techniques rarelymake radical alterations to the set of intermediate values computed by theprogram (either to the method or to the computation order). Overall, thestate of the art parallelizing compilers generate code with modest executionspeedup (some constant factor, of great importance in the industry, butrather indifferent in our theoretical endeavors) [1].

The evidence that general simulations cannot be performed fast is strength-ened by taking a closer look to known highly parallel algorithms. In manycases all such algorithms are strikingly different from good sequential algo-rithms for the same problem. Therefore, their automatic generation fromtheir sequential counterparts seems far beyond the capabilities of modernparallelizing compilers [1].

Despite the above –empirical– observations, the literature includes manyefforts to speed-up the execution of a general sequential machine. Alas, allof them require parallel machines with exponentially many processors. Forinstance, we know that any DTM running in time T can be simulated bya CREW-PRAM in O(

√T ) time but with 2ω(

√T ) processors (by precom-

puting in parallel the huge table of the DTM transition function and thenperforming rapid table look-ups). Similar results have been given for generalRAM simulations. For circuits, we know that any bounded fan-in booleancircuit with size S has an equivalent circuit with depth O(S/ log S) and size2Ω(S/ log S). We have similar results for circuit simulations of general DTMs.Note that, when we impose a feasibility constraint on the aforementionedtechniques we get only a constant speedup [1].

Indeed, efficiently parallel general simulations seem impossible. Whatif we drop the ‘generality’ requirement? As expected, such a cutback al-lows us to design feasible highly parallel simulators for weaker relatives ofthe polynomial-time DTM. We can simulate machines solving only specificsubclasses of P, depending on their allowed resources. Unfortunately, thesesubclasses seem to be separated from P by an exponential gap. This expo-nential gap, hard as it is to bridge, is yet another evidence that NC 6=P.

We will mention here [1] some of these machines together with their in-feriority to the –general– polynomial-time DTM. Usually, to obtain such aconvenient machine we have to impose a constraint on two kind of resources

122

simultaneously: time and space, time and processors, space and tree size,etc. First, consider the examples of a Finite State Machine and of a logsapceDTM. Both machines decide only a portion of P (to be precise for the second,we know that L ⊆ NC2 and we believe that L 6=P –open problem). Both canbe simulated highly parallel due to the following observation. Intuitively,they carry little ‘state’ information. Therefore, the first and second halvesof their computation are only loosely coupled. We can afford to simulatethe two halves (recursively) by trying all midpoint ‘states’ in parallel, andselecting the correct when the first half is known. Note here that we have tosupply these machines with an exponentially greater amount of resources inorder to decide any problem in P. As another example, consider a generaliza-tion of the logspace DTM: a logspace and 2logO(1) n-time bounded auxiliaryPush Down Automaton, or a logspace and 2logO(1) n-tree size bounded ATM.As before, the limited space and the loosely connected subtrees of the com-putation allows for highly parallel simulators. Note however that a nearlyexponential gap remains before we are certain that this machine can decideany problem in P. More examples exist, which reveal the gap between thegeneric NC simulation and the generic P simulation. In all cases, if we relaxone of the resource constraints (the gap), the known highly parallel simula-tion degrades to a polynomial time one. Given the above picture, one couldsay that NC=P when we find out that a nearly exponential increase at someresource adds no extra power to a model [1].

As a final piece of evidence that NC 6=P, we mention that natural ap-proaches provably fail [1]. Many approaches in designing a general highlyparallel simulation come “naturally” to those who have studied the problem.One can prove that their failure is intrinsic. That is, one can formulate anabstract model embodying all these ideas and prove that it is impossible toachieve a result strong enough to show that NC equals P.

All of the evidence given in this subsection may be quite compelling, butit is ultimately inconclusive. It does not prove that a feasible highly parallelsimulator does not exist. For all that we know, it is still possible that NCequals P, or NP, or even PH. However, any such result would be a surpriseto the community.

Inherently sequential, strong & strict complete, approximability

We have seen that the P-complete problems are the most difficult problemsof P and that they most probably do not reside in NC (if one is not in NC,

123

then none is). Such a concession necessitates the use of alternative, feasible,methods for speeding up their solutions: randomization or approximation.Are all difficult problems amenable to these methods? The answer to thisquestion requires further study of the P-complete class and it involves toolsand notions, which are briefly described in the following paragraphs.

Inherently sequential is a notion introduced to permit classification of al-gorithms with respect to their potential parallelization [1]. It targets thosealgorithms for which parallelization does not result in significant time sav-ings. Note that the term is also used to characterize problems, namely thosewhich have feasible sequential solutions but have –or seem to have– no fea-sible highly parallel solution. To highlight the difference between the twousages of the term we mention that, even though we know of no provablyinherently sequential problem (NC ?

= P), we know of many provably inher-ently sequential algorithms. Further, we know that some problems have bothan inherently sequential algorithm and a feasible highly parallel one. Let uselaborate by giving the exact definition of the inherently sequential RAMalgorithm. The definition examines the “nature” of the intermediate opera-tions of the algorithm. Intuitively, the dependencies between the algorithm’sintermediate values determine the degree at which we can parallelize theoperations. Technically, a RAM algorithm Π is inherently sequential if thelanguage LΠ = x#i#j| bit i of fΠ(x) is j is P-complete; the flow functionfΠ(x) inputs x (the input of Π) and outputs a string containing every inter-mediate value of the computation Π(x) in order. In other words, an algorithmis considered inherently sequential when its flow function is at least as diffi-cult to compute as a P-complete problem. Some of the algorithms that havebeen proved to be inherently sequential are the standard depth-first search,the greedy algorithm for computing a lexicographically first maximal path,the Gaussian elimination with partial pivoting for solving a system of equa-tions, etc [1]. Notice though that, the maximal path can be computed highlyparallel by using randomization and that solving a system of equations is inNC. After all, proving that an algorithm is inherently sequential is only aclue that the problem might not be efficiently parallelized. A general result isthat any polynomial time algorithm computing the solution to a P-completeproblem is indeed inherently sequential. As with the DTMs of P-completeproblems, it is widely believed that the inherently sequential algorithms can-not be automatically parallelized by compilers (even when they solve an NCproblem).

Another type of algorithm studied in P-completeness is the pseudo-NC

124

algorithm [1]. It is called pseudo because it features NC performance onlywhen given its input in unary encoding. Specifically, in number problems,replacing the binary input with unary results in an exponential increase ofthe length of the input. Hence, a problem might be solved efficiently in termsof this input length. Another characteristic of such problems is that, evenwith binary inputs, there exist efficient solutions under the assumption thatthe numbers involved in the computation are restricted to small integers.For these reasons, problems with pseudo-NC algorithms hold a special placewithin the FP-complete class (they are considered border cases, in a sense,easier than the others). This distinction gives rise to the notion of strong P-completeness. By definition, a problem is strongly P-complete if there existsa polynomial p such that the problem remains P-complete even when thenumbers contained to its input (its instances) have magnitudes of at mostp(|x|). That is, a problem is strongly P-complete when it cannot be solved bya pseudo-NC algorithm, unless of course NC=P. Note that problems whichdo not involve numbers are all strongly P-complete. Examples of stronglyP-complete are the CVP, the Linear Inequalities (a variation of LP), FirstFit Decreasing Bin Packing, etc [1].

Strongly P-complete problems cannot be solved by a fully NC approxima-tion scheme, unless NC=P [1]. One way of attacking a problem once provenP-complete, is to approximate it. Roughly, assume that we are given a com-binatorial optimization problem Π and that we try to maximize/minimizesome quantity. Can we obtain a near optimal solution by using an NCalgorithm, instead of using an inherently sequential one to obtain the op-timal? In some cases, this is plausible. Formally, for an instance I ∈ Π,let OPT (I) denote the optimal solution and A(I) the solution provided bysome NC-approximation algorithm (running in parallel time logO(1)(|I|) andusing |I|O(1) processors). RA(I) = A(I)/OPT (I) defines the approxima-tion ratio of A for that specific instance I (generally, we are interested inthe infimum of RA over all instances of the problem Π). We say that, anNC algorithm A with inputs ε > 0 and I ∈ Π is an NC approximationscheme for Π if and only if it delivers a candidate solution on instance Iwith ratio RA(I) ≤ 1 + ε. Moreover, we call this algorithm fully NC approx-imation scheme for Π (FNCAS), if the NC constraints are also met againstthe quantity 1/ε besides the length |I| (i.e. polylogarithmic parallel timeand polynomial processors in 1/ε). The difference between the two approxi-mations (simple and fully) is that when we try to approximate the solutionvery closely (ε → 0) the former might require exponential resources, while

125

the latter only NC. Provably, the existence of a FNCAS for a problem Πimplies the existence of a pseudo-NC algorithm for Π [1]. This is actuallythe contraposition of the first statement of the paragraph, which establishesthat not all problems are approximable. Such an example is the LFMIS-sizeproblem (LFMIS –section 4.2.1– but asking for the size of the set). Examplesof fully NC approximable problems are the First Fit Decreasing Bin Packing,the 0-1 Knapsack, the Makespan, the List Scheduling, etc16 [1].

In another direction, when it is too much to ask a shift from polynomialto polylogarithmic time, we must raise the question of a limited polynomialparallel speedup (e.g. speedup=

√n). We already have a positive answer on

that for some problems in P, but not for all of them [1]. The theory of strictP-completeness studies P under this prism [52]. It tries to identify completeproblems that exhibit some parallel speedup, and for which any further per-formance improvement would imply that all problems in P can have a limitedpolynomial speedup. The key idea is to use a reduction preserving speedup.That is, find NC reductions requiring ‘little’ transformation resources; withsuch reductions, any speedup of the P-complete problem could also have animpact to the reduced problem (the running time is not ‘masked’ by thereduction itself). Roughly, we define as strict T(n)-P-complete a problemwith parallel running time O(T (n)) when17, for any problem A∈P, the valueT (fA(x)) is bounded by the sequential RAM time of A (f is the many-oneNC reduction of A). The first strict

√n-P-complete problem was the Square

Circuit Value Problem (boolean, monotone, alternating, synchronous, fan-in=2 and square, i.e. the number of gates at every level equals its depth).

4.3 The Parallel Computation Thesis

The previous sections present results and give proofs relating certain com-plexity classes. Note that, each of these classes is defined on top of a specificmodel of computation. In other words, each of the presented results relatesonly two models (when we overlook potential class/model equivalences). Are

16NC-approximation schemes can also be used for NP- and PSPACE-hard problems [51].17“soft Oh” notation: g(n) ∈ O(T (n)) if and only if ∃ c > 0, k > 0, n0 ∈ N s.t. ∀n > n0,

g(n) ≤ T (n) · (log n)k. We use O when we wish to ignore polylogarithmic factors.

126

there any generic statements to include ‘all’ models of parallel computation?This section discusses such a statement, which is analogous to the extendedChurch-Turing Thesis.

The Church-Turing Thesis (CTT) is a consensus over what should beconsidered computable by the computability/complexity theorists [53]. Itexpresses their common belief that the standard Turing machine can computewhatever was once vaguely referred to as “effectively calculable”. Specifically,the CTT asserts that “a function is effectively calculable if and only if it isrecursive”. Provably [25], this statement is equivalent to “. . . if it is λ-definable” and to “. . . if it is Turing-computable”, i.e. computed by a TM.The belief that the CTT is true stems from the fact that every model ofcomputation proposed so far was eventually shown to have power equal tothat of the TM. The CTT itself cannot be proved, as it attempts to relate thevague notion of “effectively calculable” to the mathematical –well defined–notions of recursion, TM, etc (some consider the thesis as a definition for thecomputable functions, while others think of it as a natural law).

Over the years, the CTT has appeared in various forms. One exampleis the extended CTT (ECTT), which makes the stronger assertion that “theTM is as efficient as any computing device can be”. That is, if a functionis computable by some device in time T (n), then it is computable by a TMin time (T (n))O(1). This polynomial relation also expresses the belief that Ptruly captures the notion of efficiently solvable problems, i.e. that P containsall those problems solved in polynomial time by any model (polynomial-timeCTT [41]). Note that, if we replace ‘time’ by ‘work’ (=time×processors),then the ECTT also includes the parallel models18.

The ECTT can be rephrased as “time on all reasonable sequential mod-els of computation is polynomially related” (Sequential Computation The-sis) [54]. Before making a similar statement in the context of parallel com-putation, we elaborate on the characterization “reasonable”. The term itselfdoes not have a universally accepted definition, even though its purpose issomewhat clear. It serves as a filter excluding models with extraordinaryabilities, e.g. deciding non-recursive languages (see non-uniform circuits), orgenerating exponentially large words in only polynomially many steps (seeRAM with unit-cost multiplication instruction, or P-systems with membrane

18in some cases, ECTT (and CTT in general) has been called into question. The criticismbases mainly on models which are quite different from the paper-and-pencil framework ofTuring, as for example the Quantum Computer [48] [53].

127

division [35]). Roughly, we will characterize a model as reasonable if it can bephysically implemented with a moderate amount of resources19. To continue,the Parallel Computation Thesis (PCT) asserts that [28]:

On reasonable models, sequential space is polynomially related toparallel time.

That is, if a function can be computed sequentially in space f(n), thenit can be computed in parallel time O(fk(n)), for some constant k, andvice versa, i.e. Seq-SPACE

(fO(1)(n)

)= Par-TIME

(fO(1)(n)

). The thesis

associates sequential and parallel machines. Moreover, it implies that, timeon all reasonable parallel models is polynomially related.

We have already seen evidence in the previous sections that, the PCT (andits implication) holds . Recall the inclusions NC1 ⊆ L ⊆ NC2 proved in sec-tion 4.1.1. In the same section, we also showed the relation of general NCcircuits to the logarithmic-space ATM. Recall that we had already proved theequivalence of NC circuits and polylogarithmic-time PRAMs in section 3.3.2.These two equivalences also implied (section 4.1.1) that polylogarithmic-time PRAMs are equivalent to logarithmic-space ATMs. The discussionin section 3.3.1 reported that for general uniform boolean circuits we haveNSPACE(S) ⊆ UniformDepth (S2) ⊆ DSPACE (S2). Notice thatany of the simulations between parallel machines described in the previouschapter suffered –at most– a polynomial time slowdown.

Besides the above, various results have been published in the literaturerelating parallel time and sequential space. For instance, we know that foruniform aggregates, for conglomerates and for ATMs the following inclusionshold (X represents here the UAG, the CONG, or the ATM) [29]

DSPACE(f(n)) ⊆ X-TIME(f 2(n)) ⊆ DSPACE(f 2(n))

A similar result is known for the CREW-PRAM, for the Hardware Modifica-tion Machines (see p. 46) and for the SIMDAGs (see p. 13) when f(n) ≥ log n(Y represents here the HMM, the SIMDAG, or the CREW-PRAM) [29]

DSPACE(f(n)) ⊆ Y-TIME(f(n)) ⊆ DSPACE(f 2(n))

19in a circular definition, one could denote by reasonable those models which satisfythe sequential and parallel computation theses. In other words, a new model should beconsidered reasonable if it is able to simulate some existing reasonable model (e.g. theTM) and vice versa [29].

128

For Vector Machines (see p. 47) we know that when f(n) ≥ log n [32]

DSPACE(f(n)) ⊆ VM-TIME(f 2(n)) ⊆ DSPACE(f 4(n))

We can summarize the commonalities of the proofs that establish resultsof the aforementioned kind as follows [29]. First, consider the simulationof the sequential space (inspired by Savitch’s theorem [5]). Let M be aS(n) space bounded machine which, as a consequence, has at most 2O(S(n))

configurations. Construct the transition matrix or the transition graph ofthese configurations and compute its transitive closure by repeated squaringthe matrix (e.g see proof of theorem 4.1.7) or by path doubling within thegraph. Note that this step of the simulation should require logarithmic timein the number of the configurations and hence, polynomial time in S(n).The simulation accepts if and only if there is a transition from the initial tothe accepting configuration of M . Second, consider the simulation of paralleltime. Perform a depth first search on the instructions executed by the parallelmachine (e.g. see proof of theorem 4.1.6). Note that the search is executed ina recursive fashion and that the depth of the recursion is bounded by the timeconsumed by the simulated machine, say T (n). As a result, no more spacethan polynomial in T (n) should be required. To make a decision about theinput, during the simulation we check whether a final state or an acceptinginstruction is reached.

The PCT can serve as a “rule of thumb” to aid the engineer in settingcertain performance targets when designing his algorithm. For instance, ifshe/he is aware of a polylogarithmic-space sequential algorithm for a givenproblem, then she/he can expect that a highly parallel solution also exists forthat problem, whichever –reasonable– parallel machine she/he has at her/hisdisposal (a reasonable model should be neither too powerful nor too weak).Notice, however, that the PCT examines only the cases where a single kindof resource is bounded at each domain (sequential space parallel time).That is, to say, the PCT is indifferent to the amount of parallel processorsthat it involves. Recall that, in order to avoid impractical solutions, wesimultaneously bound the parallel time and the hardware cost. Of course,dealing with simultaneity is a more difficult task, especially when formulatinggeneral statements to include several distinct models. Nonetheless, parallelcomputation is, most often, concerned with simultaneity, and thus, we should“sharpen” our thesis accordingly.

In this direction, the PCT features two more formulations: an extensionand a generalization [54]. The extended PCT asserts that, “on any reason-

129

able parallel machine, time and hardware are simultaneously polynomiallyrelated to Turing Machine reversals and space”, where a reversal is said tooccur when any head of the TM changes direction. The term hardware refershere to the product space×wordsize of a parallel machine and is consideredas a good measure when memory cost dominates the cost of the process-ing elements (e.g. elements with minimal instruction sets). Equivalently,the term might refer to the width of a uniform circuit (the circuit-width,compared to the circuit-size, is sometimes considered as a more accuratemeasure of the circuit cost, because it captures the amount of hardware thatis ‘activated’ at each time instant). The PCT and its extension are nottrue for massively parallel computers, which utilize an exponential numberof processors (such machines can decide any NP language in sub-logarithmictime) [54]. To include such models, the generalized PCT asserts that “on aparallel machine, time is related within a constant to the alternations of anAlternating Turing Machine, while at the same time, address complexity isrelated polynomially to the time of that ATM”. The term address complexityrefers here to the number of bits required to describe an individual unit ofhardware (e.g. it refers to the wordsize of a shared memory machine, to thelogarithm of the size of a circuit, etc). Similar to the PCT, cross simulationsbetween shared-memory machines and ATMs, as well as simulations betweennetworks of processors and DTMs give support to the generalized and theextended PCT [54]. As a final note, the aforementioned theses also highlightthe theoretical efforts to capture the inherent correspondence of sequentialcost to parallel cost: sequential-space to parallel-time, alternations-numberto parallel-time, alternating-time to address-complexity, etc.

130

Chapter 5

Conclusion

This thesis has reviewed selected topics from the theory of parallel computa-tion. It includes two main parts, one concerned with the formal descriptionof the parallel computation and one devoted to the thorough study of itscomplexity.

Following the diachronic efforts of the research community to captureboth computation and communication, and to balance between theory andpractice, this thesis has presented the plethora of models proposed in theliterature for studying parallel computation. The presentation was dividedbetween the processor-based models and the circuit-based models. A seconddistinction was made between shared-memory and message-passing models.Starting with the highly abstract and continuing to the technology orientedmodels, this survey highlighted the differences of the resulting machines anddiscussed their uses and abilities. Also, from a slightly different standpoint, itreported an architectural taxonomy of the parallel machines, which is widelyreferenced in the applied computer science.

Subsequently, the thesis examined the computational power of the de-scribed models by using common simulations techniques. It showed that allshared memory models feature similar power (logarithmic slowdown factors)and that the distributed memory models can simulate optimally the PRAMwhen given sufficient parallel slackness. Moreover, it proved that, under cer-tain resource bounds, the uniform families of Boolean circuits compute thesame set of functions with the Turing machines and the PRAM.

In the second part, the thesis reviewed the most important achievementsin the area of parallel complexity theory. It defined the complexity classescapturing the feasible highly parallel computation and it proved various in-

131

clusions among them. Furthermore, it related these classes to well-knownclasses of sequential computation. It reported a number of tractable prob-lems and it classified them according to the efficiency of their parallel so-lutions. Beyond the realm of feasible computation, the thesis investigatedthe effects of parallelization and gave alternative, circuit based, character-izations to classes of higher complexity. As a means of studying the exactlimits of efficient parallelization, the theory of P-completeness was explained.Reductions and completeness proofs were given, as well as a list of practicalP-complete problems. Subsequently, the famous ‘NC versus P’ conundrumwas discussed, along with the implications and the possibility of separatingthe two classes. Finally, the Parallel Computation Thesis was presented torelate the solutions offered by distinct models of computation.

132

Bibliography

[1] Raymond Greenlaw, H. James Hoover, and Walter L. Ruzzo. Limitsto Parallel Computation: P-Completeness Theory. Oxford UniversityPress, first edition, 1995.

[2] Behrooz Parhami. Introduction to Parallel Processing Algorithms andArchitectures. Kluwer Academic, first edition, 1999.

[3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clif-ford Stein. Multithreaded Algorithms, chapter 27 of Introduction to Al-gorithms. MIT Press, third edition, 2009.

[4] Steven Fortune and James Wyllie. Parallelism in random access ma-chines. In Proceedings of the tenth annual ACM symposium on Theoryof computing, pages 114–118, 1978.

[5] Christos H. Papadimitriou. Computational Complexity. Addison-Wesley,first edition, 1994.

[6] John H. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann, firstedition, 1993.

[7] Leslie M. Goldschlager. A universal interconnection pattern for parallelcomputers. Journal of the ACM, 29:1073–1086, Oct 1982.

[8] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. On communicationlatency in PRAM computations. In Proceedings of the first annual ACMsymposium on Parallel algorithms and architectures, pages 11–21, 1989.

[9] Phillip B. Gibbons. A more practical PRAM model. In Proceedings of thefirst annual ACM symposium on Parallel algorithms and architectures,pages 158–168, 1989.

133

[10] Kurt Mehlhorn and Uzi Vishkin. Randomized and deterministic simula-tions of prams by parallel machines with restricted granularity of parallelmemories. Acta Informatica (Springer Berlin), 21:339–374, Nov 1984.

[11] Bowen Alpern, Larry Carter, Ephraim Feig, and Ted Selker. The uni-form memory hierarchy model of computation. Algorithmica (SpringerNew York), 12:72–109, Sep 1994.

[12] Leslie G. Valiant. A bridging model for parallel computation. Commu-nications of the ACM, 33:103–111, Aug 1990.

[13] Gianfranco Bilardi, Kieran T. Herley, Andrea Pietracaprina, GeppinoPucci, and Paul Spirakis. BSP vs LogP. In Proceedings of the eighthannual ACM symposium on Parallel Algorithms and Architectures, pages25–32, 1996.

[14] Alexandros Gerbessiotis. Topics in Parallel and Distributed Computa-tion. PhD thesis, Harvard University, Cambridge, Massachusetts, 1993.

[15] Duncan K.G. Campbell. A survey of models of parallel computation.Technical report, University of York, CiteSeerX, 1997.

[16] D. B. Skillicorn, Jonathan M. D. Hill, and W. F. McColl. Questions andanswers about BSP. Scientific Programming (IOS Press), 6(3):249–274,1997.

[17] Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W.Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsan-tilas, and Rob H. Bisseling. BSPlib: The BSP programming library.Parallel Computing (Elsevier), 24(14):1947–1980, Dec 1998.

[18] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus ErikSchauser, Eunice Santos, Ramesh Subramonian, and Thorsten vonEicken. Logp: Towards a realistic model of parallel computation. InProceedings of the fourth ACM SIGPLAN symposium on Principles andpractice of parallel programming, pages 1–12, Aug 1993.

[19] F. Thomson Leighton. Introduction to Parallel Algorithms and Archi-tectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, first edition,1991.

134

[20] Joseph JaJa. An Introduction to Parallel Algorithms. Addison-Wesley,first edition, 1992.

[21] Heribert Vollmer. Introduction to circuit complexity: a uniform ap-proach. Springer-Verlag Berlin, first edition, 1999.

[22] Allan Borodin. On relating time and space to size and depth. SIAMJournal on Computing, 6(4):733–744, Dec 1977.

[23] John E. Savage. Models of Computation: Exploring the Power of Com-puting. Addison-Wesley, first edition, 1997.

[24] Larry Ruzzo. On uniform circuit complexity. Journal of Computer andSystem Sciences (Academic Press), 22(3):365–383, Jun 1981.

[25] Stathis Zachos. Computability and complexity, course notes. NationalTechnical University of Athens, 2004.

[26] Patrick W. Dymond and Stephen A. Cook. Complexity theory of paralleltime and hardware. Information and Computation (Elsevier), 80(3):205–226, Mar 1989.

[27] Clark D. Thompson. The VLSI complexity of sorting. IEEE Transac-tions on Computers,, (12):1171–1184, Dec 1983.

[28] Ashok K. Chandra, Dexter C. Kozen, and Larry J. Stockmeyer. Alter-nation. Journal of the ACM, 28(1):114–133, Jan 1981.

[29] Raymond Greenlaw and H. James Hoover. Parallel Computation: Mod-els and Complexity Issues, chapter 45 of Algorithms and Theory of Com-putation Handbook (editor M. Atallah). CRC Press LLC, 1999.

[30] Stephen A. Cook and Patrick W. Dymond. Parallel pointer ma-chines. Computational Complexity (Birkhauser, Springer), 3(1):19–30,Mar 1993.

[31] Russ Miller, V. K. Prasanna-Kumar, Dionisios I. Reisis, and Quentin F.Stout. Parallel computations on reconfigurable meshes. IEEE Transac-tions on Computers, 42(6):678–692, Jun 1993.

135

[32] Vaughan R. Pratt and Larry J. Stockmeyer. A characterization of thepower of vector machines. Journal of Computer and System Sciences(Elsevier), 12(2):198–221, Apr 1976.

[33] Klaus Sutner. On the computational complexity of finite cellular au-tomata. Journal of Computer and System Sciences (Academic Press),50(1):87–97, Feb 1995.

[34] Lila Kari and Grzegorz Rozenberg. The many facets of natural comput-ing. Communications of the ACM, 51(10):72–83, Oct 2008.

[35] Gheorghe Paun. Introduction to Membrane Computing, chapter 1 ofApplications of Membrane Computing. Springer-Verlag, 2006.

[36] Christos H. Papadimitriou and Mihalis Yannakakis. Towards anarchitecture-independent analysis of parallel algorithms. SIAM Jour-nal on Computing, 19(2):322–328, Apr 1990.

[37] Eric E. Johnson. Completing an MIMD multiprocessor taxonomy. ACMSIGARCH Computer Architecture News, 16(3):44–47, Jun 1988.

[38] Ravi B. Boppana and Michael Sipser. The complexity of finite functions,chapter from the Handbook of Theoretical Computer Science (vol. A).MIT Press, 1991.

[39] Michael Sipser. Introduction to the Theory of Computation. Thomson,second edition, 2006.

[40] D. S. Johnson. A catalog of complexity classes, chapter 2 of Handbookof Theoretical Computer Science: Algorithms and Complexity (editorJ. van Leeuwen). MIT Press/Elsevier, 1990.

[41] Eric Allender, Michael C. Loui, and Kenneth W. Regan. ComplexityClasses, chapter 27 of Algorithms and Theory of Computation Hand-book (editor M. Atallah). CRC Press LLC, 1999.

[42] Ran Raz and Pierre McKenzie. Separation of the monotone NC hierar-chy. Combinatorica (Springer Berling), 19(3):403–435, Mar 1999.

[43] Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, andthe polynomial-time hierarchy. Theory of Computing Systems (SpringerNew York), 17(1):1432–4350, Dec 1984.

136

[44] Dexter C. Kozen. Theory of Computation (Texts in Computer Science).Springer-Verlag London, 2006.

[45] C. Slot and P. van Emde Boas. On tape versus core: an applicationof space efficient perfect hash functions to the invariance of space. InProceedings of the sixteenth annual ACM symposium on Theory of com-puting, pages 391–400, 1984.

[46] Nutan Limaye, Meena Mahajan, and Jayalal M. N. Sarma. Upperbounds for monotone planar circuit value and variants. ComputationalComplexity (Birkhauser Basel, Springer), 18(3):377–412, Oct 2009.

[47] Scott Aaronson, Greg Kuperberg, and Christopher Granade. Complex-ity zoo, http://qwiki.stanford.edu/wiki/Complexity Zoo, May 2010.

[48] Sanjeev Arora and Boaz Barak. Computational Complexity: A ModernApproach. Cambridge University Press, 2009.

[49] H. Venkateswaran. Circuit definitions of nondeterministic complexityclasses. SIAM Journal on Computing, 21(4):655–670, Aug 1992.

[50] Richard M. Karp and Richard J. Lipton. Some connections betweennonuniform and uniform complexity classes. In Proceedings of the twelfthannual ACM symposium on Theory of computing, pages 302–309, 1980.

[51] Harry B. Hunt, Madhav V. Marathe, Venkatesh Radhakrishnan, S. S.Ravi, Daniel J. Rosenkrantz, and Richard E. Stearns. NC-approximationschemes for NP- and PSPACE-Hard problems for geometric graphs.Journal of Algorithms (Elsevier), 26(2):238–274, Feb 1998.

[52] Anne Condon. A theory of strict P-completeness. Journal of Computa-tional Complexity (Birkhauser Basel), 4(3):220–241, Sep 1994.

[53] Andrew Chi-Chih Yao. Classical physics and the ChurchTuring thesis.Journal of the ACM, 50(1):100–105, Jan 2003.

[54] Ian Parberry. Parallel Complexity Theory. Pitman, first edition, 1987.

137

Documents

Models of Parallel Computation and Parallel Complexity