12
10 1092-3063/00/$10.00 © 2000 IEEE IEEE Concurrency Fault Tolerance and Configurability in DSM Coherence Protocols tolerance. To address these aspects, the DSM coherence protocol must offer increased redundancy, decreased reliance on centralized data and control, support for servicing requests locally, and control over the degree of data availability on a per-data-unit basis. In a page-based DSM system, as assumed in this article, the unit of interest is a DSM page. Object-based systems can use the same protocols. The Boundary-Restricted (BR) class of coher- ence protocols 2 satisfies these properties and offers highly available access to shared data at low operation costs. In the past, our work focused on a refinement of BR—called Dynamic BR 3 and investigated BR’s as well as DBR’s properties using analytical techniques for a single DSM page. In this article, we use a shared-memory simulator and a DSM application suite to further study the BR class of protocols. Our goal is to investi- gate the trade-offs between the degree of fault tolerance, operational costs, and con- figurability for various DSM coherence protocols such as Write-Invalidate, Write- Invalidate with Downgrading, Write- Broadcast, and various instances of the BR class. We have chosen to execute real- world applications in a simulation envi- ronment. These programs generally use more than a single DSM page, and not all of these pages are equally distributed among the participating sites as execution proceeds. Thus, the behavior of a single DSM page under a certain coherence pro- tocol might differ from the results reported here; however, our results pro- vide a realistic judgment of the pros and cons of the coherence protocols evaluated. Impact of the coherence protocol Data replication can improve perfor- mance in distributed systems by reducing the latency to access data and enabling more requests to access data concurrently. Con- sequently, a system can tolerate more mal- functioning components because several copies of the data exist. Hence, a direct con- sequence of replication is greater data avail- ability—that is, the probability that at an arbitrary point in time a system can access the data of a DSM page. Although it can potentially improve fault tolerance, repli- To make complex computer systems more robust and fault tolerant, data must be replicated for high availability. The level of replication must be configurable to control overhead costs. Using an application suite, the authors test several distributed shared memory coherence protocols under different workloads and analyze the operation costs, fault tolerance, and configurability of each. Distributed Systems P otentially malfunctioning components in large distributed shared memory systems require highly available services that can be configured according to expected failure rates in the environment. Although several coherence protocols have been developed for DSM systems, 1 few address configurability and fault Brett D. Fleisch University of California, Riverside Heiko Michel University of Kaiserslautern Sachin K. Shah IBM Corporation Oliver E. Theel Darmstadt University of Technology

Fault tolerance and configurability in DSM coherence protocols

  • Upload
    oe

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fault tolerance and configurability in DSM coherence protocols

10 1092-3063/00/$10.00 © 2000 IEEE IEEE Concurrency

Fault Tolerance andConfigurability in DSMCoherence Protocols

tolerance. To address these aspects, theDSM coherence protocol must offerincreased redundancy, decreased relianceon centralized data and control, supportfor servicing requests locally, and controlover the degree of data availability on aper-data-unit basis. In a page-based DSMsystem, as assumed in this article, the unitof interest is a DSM page. Object-basedsystems can use the same protocols. TheBoundary-Restricted (BR) class of coher-ence protocols2 satisfies these propertiesand offers highly available access to shareddata at low operation costs.

In the past, our work focused on arefinement of BR—called Dynamic BR3—and investigated BR’s as well as DBR’sproperties using analytical techniques fora single DSM page. In this article, we use ashared-memory simulator and a DSMapplication suite to further study the BRclass of protocols. Our goal is to investi-gate the trade-offs between the degree offault tolerance, operational costs, and con-figurability for various DSM coherenceprotocols such as Write-Invalidate, Write-Invalidate with Downgrading, Write-Broadcast, and various instances of the BR

class. We have chosen to execute real-world applications in a simulation envi-ronment. These programs generally usemore than a single DSM page, and not allof these pages are equally distributedamong the participating sites as executionproceeds. Thus, the behavior of a singleDSM page under a certain coherence pro-tocol might differ from the resultsreported here; however, our results pro-vide a realistic judgment of the pros andcons of the coherence protocols evaluated.

Impact of the coherenceprotocol

Data replication can improve perfor-mance in distributed systems by reducingthe latency to access data and enabling morerequests to access data concurrently. Con-sequently, a system can tolerate more mal-functioning components because severalcopies of the data exist. Hence, a direct con-sequence of replication is greater data avail-ability—that is, the probability that at anarbitrary point in time a system can accessthe data of a DSM page. Although it canpotentially improve fault tolerance, repli-

To make complex

computer systems

more robust and fault

tolerant, data must be

replicated for high

availability. The level of

replication must be

configurable to control

overhead costs. Using

an application suite,

the authors test several

distributed shared

memory coherence

protocols under

different workloads

and analyze the

operation costs, fault

tolerance, and

configurability of each.

Distributed Systems

Potentially malfunctioning components in large distributed

shared memory systems require highly available services that

can be configured according to expected failure rates in the

environment. Although several coherence protocols have been

developed for DSM systems,1 few address configurability and fault

Brett D. FleischUniversity of California, Riverside

Heiko MichelUniversity of Kaiserslautern

Sachin K. ShahIBM Corporation

Oliver E. TheelDarmstadt University of Technology

Page 2: Fault tolerance and configurability in DSM coherence protocols

April–June 2000 11

cation also leads to a fundamental prob-lem—the difficulty of ensuring that allreplicas are mutually consistent and thatthe sites are not accessing out-of-date (orstale) data. As a result, each time a sharedpage is modified, mechanisms mustensure that all existing copies are consis-tent. Page-based DSM systems usually dothis by either transmitting the modifiedpage or an invalidation message to thevarious sites, or by transmitting only themodified section of the page. The cost ofmaintaining mutual consistency in repli-cated systems can be very high.

Restricting the number of replicas low-ers operation costs because fewer expen-sive updates or invalidations must be madeto ensure consistency. However, fewerreplicas also decreases concurrency and thelevel of data availability. In developingrobust systems, we try to maximize datareplication and minimize the costs of con-sistency-related message transmission.Clearly, there is a strong relation betweenthe DSM coherence protocol, operationcosts, and data availability. Some systemshave addressed fault tolerance for DSM,4but most do not use data replication to pro-vide high availability. On the other hand,our work does not consider operationalavailability—the operations required torecover a system and its applications so thatreplica fail-over occurs and operationalfault tolerance is provided. This latter topicis beyond the scope of this research.

WRITE-INVALIDATE PROTOCOLThe Write-Invalidate (WI) protocol

permits multiple readers or a single writerto access a DSM page, but not both simul-taneously. Typically, any number of read-ers can concurrently access a page that isbeing accessed in read mode. When aprocess tries to write to the page, the sys-tem multicasts an invalidation to all othersites storing the page. When a site stor-ing the page receives the invalidation, thesite discards the DSM page and acknowl-edges the multicast. In addition, one sitetransfers the latest copy of the data to thesite where the write request originated. Ina careful implementation of the protocol,after all sites respond to the sender, thewrite is permitted to complete. Conse-quently, processes are prevented from

reading stale data because they do notstore a replica when a writer is present.The effect is to process all updates to theitem sequentially, while reads can proceedin parallel.

WRITE-INVALIDATE WITHDOWNGRADING PROTOCOL

The WI Protocol with Downgrading(WID) is a modification of the WI proto-col used in DSM systems such as Mirage5

and Mirage+.6 The primary differencebetween WI and WID occurs after awrite. During the next shared read accessfrom another site, the write copy is notinvalidated as in WI. Instead, the pageremains stored and readable at the formerwriter’s site. The data from the formerwriter is then transferred to the new

reader. Therefore, in WID, a minimumof two readers are typically using the pagein read mode. Subsequently, a write to oneof the pages that is in read mode mightinvolve a mere upgrade to write modealong with an invalidation to the other site.

The WID protocol works well whena reader and a writer try to access thesame page from different sites. Althougha cost must be paid to transfer the datafrom a former writer to a new reader, thereverse is not true, because in WID amere upgrade and invalidation can beperformed when one of the two readerswrites to the page. For these situations,WID can be much less expensive to use.

COMMON PROPERTIES OF WI ANDWID

Both WI and WID are multiple-read-ers/single-writer protocols. During writesin both WI and WID, data availability ispoor because only one copy of the data is

present in the network. Furthermore, theapplication’s read–write ratio will governthe degree of availability: if it is low (thatis, many writes occur compared with thenumber of reads), availability will be low.When control transfers from the writingprocess to other reading processes, thedata becomes much more available. How-ever, the read–write ratio of an applica-tion is a tenuous property on which tobase availability.

From a cost perspective, WI andWID are suited for applications wherethe number of successive writes betweentwo reads is high, as well as applicationsthat exhibit a high degree of per-sitelocality of reference.2 Additionally, thereis no stable system state (when the num-ber of copies of a page and their locationremain constant), because replicas arecontinually being invalidated whenevera write request is executed.

WRITE-BROADCAST PROTOCOLThe Write-Broadcast (WB) coherency

protocol is an update-based protocol.When a site obtains write access to a page,the process updates the page locally andmulticasts the changes to all other sitespossessing replicas. Sites that store replicasincorporate the updates that are receivedand send acknowledgments. Stored repli-cas are never deleted in WB. Once a siteobtains a copy of the page as a result of aread or write request, the item continuesto remain at the site for the duration of theprogram. Because every write operationrequires the multicast of update (or control)messages for consistency, write operationsare expensive,5 particularly when compo-nents are not failure-free and write opera-tions have to be aborted when a site cannot be contacted due to site or com-munication failures. However, read oper-ations are extremely inexpensive after theinitial cost of transferring the page to thesite, because reads are local and do notrequire remote communication or coop-eration with other sites. In addition toallowing multiple readers, several processesmay write the same page at the same timein WB; this is known as multiple-read-ers/multiple-writers sharing. Once all sitescache a copy of the data item, the system issaid to have reached a stable system state.2

Clearly, there is a strong relationbetween the DSMcoherence protocol,operation costs, anddata availability.

Page 3: Fault tolerance and configurability in DSM coherence protocols

12 IEEE Concurrency

The BR coherenceprotocol class

The hybrid protocol BR combines theadvantages of WI and WB by using a multiple-readers/multiple-writersapproach to enhance concurrency andparallelism. BR typically reduces thenumber of replicated copies during writeoperations but, unlike WI, lets more thanone replica exist during writes. There-fore, BR provides greater fault tolerancein terms of data availability than WI.Unlike WB, you can specify a maximumnumber of replicas, which could other-wise grow in proportion to the numberof sites accessing the page. This controlsupdate costs.

DESIGN GOALS AND FUNCTIONALMECHANISM

We believe that a coherence protocolfor a large, error-prone DSM system mustexhibit the following design properties:2

1. Limited workload dependability: thenumber of copies of a page shouldhave limited dependence on the work-load—that is, the sequence of readand write operations.

2. Lower bound in the number of cached

copies: the number of cached copies ofa page preferably should never bereduced to one. Single copies make thesystem vulnerable to component fail-ures, since only one copy of the dataexists. Nonetheless, while increasingthe lower bound on the number ofcached copies results in higher dataavailability, it also increases the man-agement cost.

3. Upper bound in the number of cachedcopies: the number of cached copiesshould never result in greater man-agement cost than overall benefit.The protocol should avoid situationswhere all clients cache replicas thatmust be updated during writes. Con-sequently, an arbitrary linear increasein the number of sites caching a pagemust also be avoided, since the prob-ability that a write operation can suc-cessfully complete decreases signifi-cantly in this case.

The BR protocol, which has beendesigned along these three design prop-erties, operates as follows. On receivingthe requested page, a client maps the datainto its memory and uses it according tothe granted mode. The item at a client siteis considered cached in a particular mode

with respect to a particu-lar request. A mode is a2D tuple consisting of aread attribute in the firstdimension and a writeattribute in the seconddimension. The modedetermines how subse-quent read and writeoperations function.Table 1 gives the variousmodes used by the func-tional model.

If the client reads(writes) the data of apage cached with a local

read (local write) mode attribute, thenthis results in a local read operation (localwrite operation) carried out at the clientsite. If a particular page or item is cachedin a mode including a global read (globalwrite) attribute, then a read (write)request needs to be submitted to theDSM server. This request leads to aglobal read operation (global write oper-ation). Sequential consistency5 is thebasis for our protocol’s memory coher-ence policy, but weaker forms of consis-tency can be supported as well.

The BR protocol can be in one of twophases at a time: A so-called read phasestarts with the execution of the firstglobal read operation the DSM serverreceives after a global write operationand lasts until the next global write oper-ation. A second phase, the write phase,starts with the execution of the firstglobal write operation the DSM serverreceives after a global read operation andends with the next global read operation.So, in a read phase, no global write oper-ations are executed, whereas in a writephase, no global read operations occur.Local operations do not change thephase state of the BR protocol. A detaileddescription of BR and its behavioralcharacteristics is available elsewhere.2

Table 1. Modes (2D tuples) associated with DSM pages at client sites.

MODE DESCRIPTION SAMPLE SITUATION

(Local read, local write) Data can be read and modified locally. A client has the only cached copy of the DSM page.

(Local read, global write) Data can be read locally, but write requests The client site is one of many active readers of themust be submitted to the DSM server. DSM page.

(Global read, global write) Data cannot be read or written locally. The requested DSM page is not cached at the clientRequests must be submitted to server. client site.

(Global read, local write) Not used None

Table 2. Boundary settings of different coherence protocols. N defines the maximumnumber of sites in the network, and Rmin, Rmax, Wmin, and Wmax are determined by the

parametric settings supplied to BR.

PROTOCOL

PHASE WI WID BR WB

Read Minimum copies 1 2 Rmin 1Maximum copies N N Rmax N

Write Cached copies Always deleted Always deleted Updated, Never deleted,or invalidated or invalidated possibly deleted always updated

Minimum copies 1 1 Wmin 1Maximum copies 1 1 Wmax N

Page 4: Fault tolerance and configurability in DSM coherence protocols

April–June 2000 13

PROTOCOL INVARIANTSThe BR coherence protocols exhibit

boundary settings as shown in Table 2.The DSM server enforces these con-straints while serving global read andglobal write requests that are sent to it.During a global read operation, BRguarantees the number of read copies ofa page Nr:

Rmin ≤ Nr ≤ Rmax with Rmin, Nr,Rmax ∈ {1, 2, …} (1)

where Rmin and Rmax represent the min-imum and maximum number of cachedcopies in a read phase. Further, BR alsoguarantees that during a write phase, thenumber of copies of a page, Nw, is

Wmin ≤ Nw ≤ Wmax with Wmin, Nw, Wmax ∈ {1, 2, …} (2)

where Wmin and Wmax represent theminimum and maximum number ofcached copies during a write phase.

An instance of the BR coherence pro-tocol class is addressed as BR(Rmin, Rmax,Wmin, Wmax). Practical settings for theseparameters are discussed elsewhere.2 Inthe scope of this article, we focus on a par-ticularly interesting subclass of BR,denoted as BR(w, n). We obtain this sub-class from the general BR coherence pro-tocol class by setting w = Wmin = Wmax, n = Rmax, and Rmin = w + 1. RestrictingWmin and Wmax in this way helps to keepcosts during a write phase very low whilestill providing an acceptable level of avail-ability if w > 1. Setting w > 1 avoids theundesirable situation where the numberof copies drops to just one. By setting Rminto w + 1 and Rmax to n, w + 1 up to n copiesof a page can exist during as sequence ofglobal read operations (n is introduced tosimplify the notation). In particular, at thebeginning of a read phase only a singleadditional copy must be installed, therebyminimizing the costs for this operation.Creating at least one additional copy atthe beginning of a read phase ensureshigher availability without causing exces-sive costs. Thus, the BR(w, n) subclasstakes into consideration that maintaininga certain level of data availability duringa read phase costs less than maintaining

the same level of availability during a writephase.

Note that if the parameter w is set to 1and n to N (the number of sites in the net-work), then the resulting BR(1, N) proto-col is equivalent to the WID protocol.

From Table 2 it is clear that the WI,WID, and WB protocols violate designproperty number 2 given earlier by per-mitting a single copy to exist during readand write phases. WB violates designproperty number 3 when all clients cachereplicas during the write phase. On theother hand, BR protocols restrict thenumber of cached copies to lie betweendefined limits. Ideally, the boundary set-ting can be different for every page, lead-ing to a different degree of availabilityper item and different operation costs.

The behavior of WI, WID, WB, andBR is best understood by examining eachprotocol during read and write opera-tions. Consider the first operation per-formed on some DSM page p. Assume itis a read operation. WI, WID, and WBwill satisfy the request by sending pagep to the requesting site; at this pointthere is only one copy of p in the net-work. On the other hand, BR immedi-ately stores a replica at Rmin differentsites. Any failures that occur at therequesting site can be tolerated using BRbecause of the other existing replicas.This is not the case in WI, WID, or WB.

Examining the protocols during writeoperations, WI, WID, and WB occupyextremes of a spectrum. At the lower endis the lack of replication offered by WIand WID. At the higher end is the levelof replication exhibited by WB—maxi-mal. BR occupies the entire area and canbe configured to lie between extremes—

never minimal and never maximal. It isthis region between extremes that pre-sents a suitable compromise between thedesired level of replication and operationcosts.

Figure 1 illustrates the dependencybetween site availability p and data avail-ability Adata(p) = 1 – (1 – p)a with a > 0being the number of copies of a particu-lar page. If all sites of a network (or atleast those sites caching a copy of a cer-tain page) are available with a probabil-ity of, for example, p = 0.75, and if onlya single copy of a page is present in thesystem, then the data availability of thispage is, of course, Adata(0.75) = 75%. Ifthe number of copies always available isincreased to a = 2, then the associateddata availability is increased by approxi-mately 18%. Thus, when using WI,WID, or WB, minimal data availabilitywould be 75% in the present examplewhereas by using BR(2, n) with n ≤ N,minimal data availability would beapproximately 93% and, by using BR(3,n), approximately 98%. Various otherexamples can be derived by the givengraphs. All those examples demonstratethe superiority of BR coherence proto-cols in terms of data availability if Wminand Wmax are appropriately set.

A comparison

Figure 2 shows the behavior of BR(2, 5) in a network of five sites with asample reference trace. All sites requestor reference the same page. Recall thatBR(w, n) guarantees that w replicas existduring a write phase and that r (w + 1≤ r ≤ n ≤ N) replicas exist during the readphase. In BR, two types of messages are

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5 6 7 8 9 10

Data

ava

ilabi

lity

(A)

Number of copies (a)

Probability that a copy can be accessed (data availability)

Site availability p :p = 0.95p = 0.90p = 0.85p = 0.80p = 0.75p = 0.50

Figure 1. Data availability dependingon the number of cached copies andsite availability.

Page 5: Fault tolerance and configurability in DSM coherence protocols

14 IEEE Concurrency

transmitted across the network: controland data messages. Control messagescontain page invalidations, page down-grades, page upgrades, or page updates.Data messages contain actual page databeing transferred to a requesting client.

The example begins with the firstoperation, a read operation issued by R2,which requires R2 to receive a single datamessage containing the new page. BR(2, 5) guarantees that at least three readcopies are present during a read phase;hence two more page replicas must be

stored. Three replicas are shown in thefirst step of Figure 2. The system uses atotal of three data messages to install thethree replicas, as shown in Table 3. As apostcondition of this operation, a pagecopy at site R2 is compulsory but two othercopies must exist at two other sites. In ourexample, these sites are R1 and R4. (Thecurrent implementation of BR uses a pol-icy that randomizes where the additionalreplicas are stored, but to optimize per-formance, system designers may selectother policies based on page usage statis-

tics.) Table 3 shows the total number ofcopies of a page existing after the execu-tion of the corresponding operation, on aper-operation basis.

The next read request at site R5requires that a page be transmitted to siteR5. The local read operation at site R2requires no messages. The next writerequest at site R4 takes place in two phases.BR(2, 5) must maintain only two pagecopies during a write phase. As a result,first the system transmits two controlmessages to invalidate page copies at R2and R5 (again, picked randomly in thisexample). Only then does the write oper-ation proceed at R4. And second, the sys-tem transmits a single control message tosite R1 to update the replica. The nextoperation, a write request at site R3,requires that a page be sent to R3 and aninvalidation control message be sent torandomly chosen site R4. Finally, the sys-tem sends a single control message toupdate the replica at R1 due to the writeoperation at R3. The last operation in theexample, a local read at site R1, takes placewithout any message costs. Only twocopies of the page exist in the final phaseof the example, because the page mode isa (local read, global write); the request is

R2R1 R3 R4 R5

Time

Read request R5

Local read R2

Write request R4

Write request R3

Local read R1

Read request R2

Client site caching page copyClient site

Figure 2. Location of copies of a singlepage for the BR(2, 5) coherenceprotocol. “Read Request R2,” forexample, means that site R2 requesteda DSM page for (global) reading andthe request has been granted.

Table 3. Comparing the number of messages transmitted and the number of replicated copies available for differentcoherence protocols on a five-site network. In BR(3, 5), R5 is randomly picked to cache an additional copy of the page.

NUMBER OF MESSAGES TOTAL NUMBER OF COPIES

OPERATION WI WID BR (2, 5) BR (3, 5) WB WI WID BR (2, 5) BR (3, 5) WB

Read R2 Control 0 0 0 0 0Data 1 1 3 4 1 1 1 3 4 1

Read R5 Control 0 0 0 0 0Data 1 1 1 0 1 2 2 4 4 2

Local Read R2 Control 0 0 0 0 0Data 0 0 0 0 0 2 2 4 4 2

Write R4 Control 2 2 3 3 2Data 1 1 0 0 1 1 1 2 3 3

Write R3 Control 1 1 2 3 3Data 1 1 1 1 1 1 1 2 3 4

Read R1 Control 0 1 1 0 0Data 1 1 0 0 1 1 2 2 3 5

Breakup Control 4 4 5 6 5Data 5 5 5 5 5

Total 9 9 10 11 10 8 9 17 21 17Avg. per operation 1.5 1.5 1.6 1.8 1.6 1.3 1.5 2.8 3.5 2.8

Page 6: Fault tolerance and configurability in DSM coherence protocols

March–June 2000 15

satisfied locally, and the server is nevercontacted. The system does not enforcethe invariant in Equation 1 because it isstill in a write phase (local reads do notcause the system to enforce the read phaseinvariant while being in write phase).

In addition, Table 3 compares thebehavior of WB, WI, WID, BR(2, 5), andBR(3, 5) for the same sequence of readand write operations as shown in Figure 2,including the number of messages trans-mitted and replicated copies available.

Table 4 presents a similar comparisonbut with a larger network of 10 sites. Itextends the sample sequence of read andwrite operations to present a more com-plete analysis. According to our obser-vations, the class of BR protocols provedto be most effective in providing the fol-lowing design properties.

• Competitive operation costs. Table 3shows that in terms of the total num-ber of messages sent, BR costs asmuch as WI and WB. In a larger net-work (see Table 4), BR transmitsfewer messages than WB and only afew more control messages than WI.

• Better fault tolerance in terms of high dataavailability. In the smaller network(Table 3), WB, WI, and WID permitsingle DSM pages to exist at times. InWB, sufficient replicas usually do notexist during application startup. WithWI and WID, this problem persiststhroughout the execution of the appli-cation. On the other hand, BR neverpermits the number of page copies tobe less than Wmin.

• Better scaling than WB. Observe thebehavior of WB for the Write R8 andWrite R6 operations in Table 4. Thesystem is approaching a stable systemstate; numerous replicated copies ofthe same page are being stored, whichin turn requires numerous controlmessages to keep the replicas consis-tent. Clearly in large networks, as thesystem reaches a stable state, WBincurs very high operation costs forwrite operations, making it scalepoorly. BR, on the other hand, can beset to maintain Wmin ≤ w ≤ Wmaxcopies, thus limiting the number ofmessages transmitted. Consequently,BR is configurable to maintain fixedlevels of operational costs even with alarge increase in the number of par-ticipating sites.

• Configurability. Table 4 shows thatWB maintains too many replicas thatmust be kept mutually consistent. WIand WID, on the other hand, providetoo few replicas and thus cannot pro-vide high data availability. The classof BR protocols lets us control thelevel of replication by varying Wminand Wmax as required by the fault tol-erance requirements of the system.The analytical results in Table 4 con-firm this conjecture.

The simulator andsimulator applications

While the comparison exampledescribed earlier confirmed that BR iscost effective and provides fault-tolerant

data access, a goal of this work is toexamine BR using real shared-memoryprograms. We created a program-drivensimulator that executes commonly stud-ied DSM programs.9,10 The simulatorlets us compare and analyze WI, WID,WB, and BR in actual practice.

The simulations use AINT (AlphaInterpreter),7 a software tool for analyzingshared-memory systems. AINT simulatesparallel programs on uniprocessor work-stations and provides a program-drivensimulation environment. In AINT, a sim-ulation application is executed until it gen-erates a memory reference. AINT thentransfers control to the back-end, whichsimulates the desired coherence protocolin response to that memory referenceevent. The back-end, coded in C, enforcesthe coherence protocols we are studyingsuch as WI, WID, WB, and BR. Also,AINT maintains an array of structuresthat store the state of the DSM system andrelated results, for example, the numberof messages and transmission costs. Thesimulator works on DEC Alpha worksta-tions running DEC/OSF1 and is upward-compatible from version 2.0.

AINT and the back-end, together,simulate a loosely coupled DSM system.Messages transmitted across the networkare either control or data messages. Con-trol messages typically contain the DSMstate information, page invalidation, pageupgrade, page downgrade, or page updateinformation. Data messages containactual page data being transferred to arequesting client. Data messages are con-siderably larger (1 – 4 Kbytes) than con-

Table 4. Extending the protocols over a larger network (N = 10 sites) and then comparing the number ofmessages transmitted and the number of copies available.

NUMBER OF MESSAGES TOTAL NUMBER OF COPIES

OPERATION WI WID BR (2, 5) BR (3, 5) WB WI WID BR (2, 5) BR (3, 5) WB

Total from previous example 9 9 10 11 10 8 9 17 21 17Read R10 Control 0 0 0 0 0

Data 1 1 1 1 1 2 3 3 4 6Write R8 Control 2 3 3 4 6

Data 1 1 1 1 1 1 1 2 3 7Read R7 Control 1 1 0 0 0

Data 1 1 1 1 1 1 2 3 4 8Write R6 Control 1 2 3 4 8

Data 1 1 1 1 1 1 1 2 3 9Read R9 Control 1 1 0 0 0

Data 1 1 1 1 1 1 2 3 4 10

Breakup Control 9 11 11 14 19Data 10 10 10 10 10

Total 19 21 21 24 29 14 18 30 39 57Avg. per operation 1.7 1.9 1.9 2.2 2.6 1.3 1.6 2.7 3.5 5.2

Page 7: Fault tolerance and configurability in DSM coherence protocols

16 IEEE Concurrency

trol messages (96 bytes) and often largerthan the network transfer unit of 1 Kbyte.The system maintains coherence in thesimulator by transmitting control mes-sages or data messages to sites that main-tain or require copies of the page, respec-tively. Table 5 lists the protocols wemodeled in our simulation study and theparameters used in the simulator.

As new technology emerges, we canadjust the simulation parameters toreflect the characteristics of leading-edgehardware and network environments.This, in turn, will let us project theeffects of new technology on protocolbehavior. As described, the simulationuses applications to generate memoryreference events. An application for aDSM system is characterized by itsmemory access patterns. These patternsvary widely, as shown by several mea-surements we made (see Table 6).

The same DSM applications, writtenby different programmers, could poten-tially result in varied levels of parallel-ization of the problem9 and exhibit dif-ferent communication, memory, andsynchronization behaviors. This is whyour DSM application suite consists of thefollowing programs, each representingvaried problems, with a broad spectrumof memory access patterns, locality, prob-lem size, and synchronization behavior:8

• Parallel Matrix Multiply representscomputationally intensive problems.

• Quicksort represents a class of prob-lems that require a high level of coor-dination, synchronization, and man-agement between processes, at thesame time performing a large numberof local operations.

• Water-NSquared is indicative of pro-grams with more activity located atthe participating clients. After theclients solve their assigned subprob-lems, these solutions are finally com-bined to solve the entire problem.10

Table 6 summarizes each program’scharacteristics. The data presents the

mean execution time of 20 applicationexecutions in an environment consistingof four processors. The table shows thetotal number of operations (reads andwrites to memory), the total number ofoperations to shared memory, the num-ber of read and write operations toshared memory, their sums and per-centages, as well as the read–write ratioof operations to shared memory. Quick-sort exhibits an extremely low read–writeratio, whereas the other two applicationshave a very high read–write ratio; that is,reads from shared memory are muchmore frequent than writes to it.

Experiments and results

The simulations involved running theprograms from our application suitewhile varying the underlying memorycoherence protocol. In each simulationexperiment, we ran WI, WID, WB, andthe permissible instances of BR(w, n)protocols varying the DSM page sizefrom 512 bytes to 4 Kbytes. The mes-sage model used in the simulation ismore sophisticated than the one used toexplain the comparison example. In par-ticular, a 96-byte message acknowledgedevery message. These acknowledgmentswere counted as control messages. Thesize of a message transmitted over thenetwork was fixed at 1 Kbyte. Therefore,depending on the size of the DSM page,up to four 1-Kbyte messages can be sent.All results presented are averaged overthree simulation runs with identical set-tings and parameters. Parallel MatrixMultiply exhibits reproducible executionpatterns on different runs. However, theQuicksort and Water-NSquared appli-cations have less deterministic behavior.

PERFORMANCE AND FAULTTOLERANCE

Our first set of experiments examinesthe performance of the different DSMcoherence protocols with respect to avaried degree of fault tolerance in termsof data availability.

Figure 3 shows the total number ofmessages transmitted (sum of data andcontrol messages) in a DSM applicationusing the three coherence protocols with pages sizes of 512 bytes, 1 Kbyte, 2 Kbytes, and 4 Kbytes in a network of N = 4 sites. We count acknowledgmentsas control messages; this generally dou-bles the number of messages regardless ofwhich coherence protocol is used. There-fore, no particular coherence protocol isfavored. These observations let us qual-ify the total amount of traffic generatedby a distributed application using DSM.The values help us estimate overhead interms of aggregate operations to maintainmemory coherence.

Figure 4 shows the number of datamessages per DSM application using thethree coherence protocols with pagesizes of 512 bytes, 1 Kbyte, 2 Kbytes, and4 Kbytes in a network of N = 4 sites.Because the number of data messagesvaries significantly among the coherenceprotocols, we used a logarithmic scale.We did not count the messages used forinitialization of coherence protocols,because they are not considered datamessages but control messages. Fur-thermore, their number is negligible.

Figures 3a and 4a show the results forParallel Matrix Multiply. Here, for allcoherence protocols, we expect that thetotal number of messages decreasesslightly when the DSM page size isincreased. We also observe this with datamessages: the number of data messagesdecreases as the DSM page sizeincreases. This is because, in the case oflarger page sizes, a single data requestresults in the sending of a larger portionof the DSM segment to the issuingprocess. If the process writes in a sequen-tial fashion to DSM, as Parallel MatrixMultiply does when creating the resultmatrix, then this leads to a decrease ofdata-issuing operations, thereby reduc-ing the number of data messages. Notethat with a larger page size, up to fournetwork messages must be transmittedto install a single DSM page. Figure 4a

Table 5. Modeled protocols and simulation parameters.

MODELED PROTOCOLS SIMULATOR SETTINGS

N = 4 SITES N = 8 SITES PARAMETER RANGE

WI WI Page size (for data messages) 512 bytes; 1, 2, or 4 KbytesWID WID Control size (for control messages) 96 bytesWB WB Maximum network transfer unit 1 KbyteBR(2 and 3, 4) BR(2 and 3, 8) Network size N 4 sites, 8 sites

Page 8: Fault tolerance and configurability in DSM coherence protocols

April–June 2000 17

shows that for WI and WID, the mes-sage traffic is more significantly influ-enced by page installations. In WB andBR(w, n), page installations are fewerthan in WI or WID, because the formerprotocols use control messages to sendupdates. The vast majority of messagessent by BR(w, n) and WB therefore con-sist of short 96-byte control messages.

The reason why BR(w, n) coherenceprotocols produce higher traffic than WI,WID, and WB coherence protocols (seeFigure 3a) is that Parallel Matrix Multi-ply’s read–write ratio, as stated in Table6, is very high. In this particular case, it is44. Thus, read operations are frequent,leading to a decrease in the number ofavailable copies in WI and WID, therebyreducing fault tolerance and the num-ber of coherency-related messages. WBhas superior performance compared toBR(w, n) because the processors all workseparately. Only a few additional copiesare requested and installed over a longperiod of execution time, and only at theend of the application execution does thenumber of copies increase and the appli-cation require updates with write opera-tions. In BR(w, n), all w installed copieshave to be updated throughout the entireexecution, consequently accounting forits higher costs: BR(2, 4) and BR(3, 4)coherence protocols are required to storeat least two and three copies, respectively,of any DSM page at any time of the pro-gram execution. This leads to increasedmessage traffic in case of a highread–write ratio, but also to a highlyincreased data availability since a certainminimal number of copies is available atall times independent of the application’scurrent read–write pattern.

The Quicksort application exhibits avery low read–write ratio (see Table 6).With a read–write ratio of 1.3, writeoperations to DSM are almost as fre-

quent as read operations. Additionally,Quicksort requires a high degree of syn-chronization and cooperation among theparticipating processes. (We use theterms process and site synonymously in thisanalysis, because for these simulations,the processes were located at differentsites.) Figures 3b and 4b show the results.

Because write operations are very likelyand dispersed, all sites cache copies ofmany if not all DSM pages even in an earlystate of the program execution. Thus,many copies need to be updated at manysites, resulting in a large number of controlmessages if the DSM coherence protocolexhibits a WB-like behavior—as is the casefor BR(3, 4) and, of course, WB itself. Inthese cases, the number of data messagesis quite low. DSM coherence protocolsthat behave rather WI-like (such as WI,WID, and BR(2, 4)) require fewer controlmessages but more data messages, becausecertain copies must be installed at the sitesat multiple times. Quicksort is a goodexample, showing that BR coherence pro-tocols can naturally bridge the gapbetween WI and WID coherence proto-cols on one side and WB on the other sidein terms of message costs. Because manycopies are requested and installed at thebeginning of the application execution,WB does not perform as well as the othercoherence protocols: the read–write ratioof the application is small, and the large,increasing number of replicas must beupdated frequently. Note that althoughthe BR(w, n) coherence protocols producelower message costs than WB in the sce-nario, the former guarantee a minimaldegree of fault tolerance at all times,whereas the latter does not.

Figures 3c and 4c give the results ofthe Water-NSquared application. InWater-NSquared, once the processorsreceive a requested DSM page, it usuallyremains in possession of the requesting

processor. The processors perform com-putations individually and partially accu-mulate information once, at the end. Theread–write ratio of Water NSquared is51—that is, very high (see Table 6). Wesee no substantial difference in this appli-cation in terms of total number of mes-sages when the page size is varied, pri-marily because sharing is limited. Thetotal number of messages transmittedreaches its maximum when using BR(3,4),whereas BR(2, 4) and WB are compara-ble. WI and WID produce substantiallyfewer message costs. The reason for thehigh message costs of BR(3, 4) stemsfrom the fact that even when using WB,not as many DSM copies are cached atthe participating sites as for BR(3, 4). ForBR(3, 4), due to the fault tolerancerequirements, at least three copies of anyDSM page are cached at any time.

With respect to DSM page installationsand reinstallations, Water-NSquaredexhibits a comparable behavior to ParallelMatrix Multiply: the larger the DSM pagesize, the fewer data messages are issued. Interms of the number of data messages,BR(w, n) basically lies between WI andWID on one side and WB on the other.Although BR(2, 4) and WB produce nearlythe same total message costs, Figure 4cshows that BR(2, 4) reinstalls more DSMpages than WB, because the former issuesapproximately 9 to 15 times the number ofdata messages than the latter.

As a general heuristic, we can say thataccording to these measurements,BR(w, n) transmits more messages thanWI or WID. Depending on theread–write ratio and the application’sbehavior, fewer or more messages aretransmitted with respect to WB. WI andWID transmit the least total number ofmessages. This is because of their con-servative behavior—reducing the cachedcopies to one during a write operation.

Table 6. Measured characteristics in the DSM application suite.

KEY PROBLEM TOTAL SHARED SHARED SHARED SHARED READ–WRITE

APPLICATION PROPERTIES SIZE OPERATIONS OPERATIONS READS WRITES RATIO

Parallel matrix Computationally 64-by-64 567 × 103 540 × 103 528 × 103 12 × 103 44multiply intensive integer matrices (95%) (98%) (2%)

Quicksort Much coordination 1,000 integers 150 × 103 14,275 8,166 6,109 1.3and synchronization (9.5%) (57%) (43%)

Water- NSquared Much local activity, 512 molecules 605 × 106 157 × 106 154 × 106 3 × 106 51few global activities (26%) (98%) (2%)

Page 9: Fault tolerance and configurability in DSM coherence protocols

18 IEEE Concurrency

As a result, these protocols do not incurthe high costs of keeping multiple copiesof data up-to-date, but they do sufferfrom being highly vulnerable to failure.On the other hand, the performance interms of message costs of the higher endof BR(w, n) coherence protocols drops(approaches the performance of WB)when the read–write ratio becomessmaller, implying that the writes areoccurring more frequently. Consider aBR(7, 8) coherence protocol for exam-ple, and WB. While WB will continue toincrease the number of cached copies asnew sites request pages, it may never—depending on the particular workload—cache copies at a majority of the sites.BR(7, 8), on the other hand, will force atleast seven copies of each page to exist inthe system throughout the application’sexecution.

It is important to emphasize that nei-ther WI, WID, nor WB guarantee thatreplicated copies exist at all times. WBpotentially allows either each site to cachethe same page, which is excessive, or justa single page copy to exist in the net-work, which is minimal. Clearly, anyoverhead incurred by BR(w, n) is offsetby the provision of better fault tolerancein terms of high data availability. Con-sequently, BR(w, n) incurs competitiveoperation costs.

CONFIGURABILITY ANDSCALABILITY

With faster networking technologyand larger systems on the horizon, anymechanism or protocol that scales poorlywith an increasing number of sites willresult in unacceptable performance. Adirect consequence of having a largernumber of clients is increased commu-nication. As a result, an important para-meter to observe when investigatingscalability is the total number of mes-sages transmitted.

As already noted, WI, WID, and WBprovide no mechanism for controllingthe level of data availability provided.This contrasts with the class of BR coher-ence protocols that allows control overthe degree of fault tolerance (also calledlevel of replication). Consequently, it isoften possible to find an instance of BR

0

20,000

40,000

60,000

80,000

100,000

120,000

WI WID BR(2,4) BR(3,4) WB

Tota

l num

ber o

f mes

sage

s

DSM coherence protocol

Page size:512 bytes1 Kbyte2 Kbytes4 Kbytes

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

WI WID BR(2,4) BR(3,4) WB

Tota

l num

ber o

f mes

sage

s

DSM coherence protocol

(a)

(b)

(c)

Page size:512 bytes1 Kbyte2 Kbytes4 Kbytes

Tota

l num

ber o

f mes

sage

s

Page size:512 bytes1 Kbyte2 Kbytes4 Kbytes

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

WI WID BR(2,4) BR(3,4) WBDSM coherence protocol

Parallel Matrix Multiply

Quicksort

Water-NSquared

Figure 3. Number of messages sent by WI, WID, WB, and BR(w, n) coherence pro-tocols in a network of N = 4 sites for (a) Parallel Matrix Multiply—BR(w, n) may bemore expensive than WB in terms of messages transmitted if read–write ratio ishigh; (b) Quicksort—BR typically occupies the range between WI and WB if read–write ratio is low; and (c) Water-NSquared—varying the page size does not resultin a different number of messages being transmitted, because sharing is minimal.

Page 10: Fault tolerance and configurability in DSM coherence protocols

April–June 2000 19

that provides the desired degree of faulttolerance at an acceptable cost. In gen-eral, operation costs increase as the levelof replication increases. The configura-bility offered by BR is invaluable whenthis class of protocols is used to provide,for example, a higher (but not excessive)level of replication—BR(4, 8) for pagesintegral to the system’s functioning—anda lower level of replication—BR(2, 8) forpages of lesser importance. Such config-urability helps maximize the degree offault tolerance and minimize the associ-ated operation costs.

Configurability is provided in BR byestablishing an upper bound on the num-ber of replicas, thereby limiting theeffects of scale and controlling the max-imum level of replication for each page.In the second set of experiments, we var-ied the number of participating sites andDSM page sizes in Parallel Matrix Mul-tiply and Quicksort and examined thebehavior of WI, WID, WB, and BR(w, n).We configured the experiments to limitthe upper bound on the number of copiesto a level we would expect to see in realenvironments; our previous work3

showed that a realistic number of replicasfor a page is likely to be eight or less (seeFigure 1). A level of replication exceedingthat increases operation costs but has anegligible impact on data availability.Thus, using this small-scale but likelyconfiguration, our results provide somegeneral conclusions concerning scalingthe number of replicas for BR systems.

Figure 5 and Figure 6 present ourfindings. In each graph, the x-axis indi-cates the coherence protocol beingexamined and the variation in the num-ber of participating sites—either four oreight (each participating site hostsexactly one participating process). Foreach coherence protocol, the left barpresents a network with four participat-ing sites whereas the right bar indicatesa network of eight participating sites.The y-axis indicates the number of mes-sages transmitted for each coherenceprotocol and network. The numberslisted above each bar of an eight-site net-work represent the percentage increasein the number of transmitted messageswhen increasing the network size from

Page size:512 bytes1 Kbyte2 Kbytes4 Kbytes

1

10

100

1,000

10,000

100,000

WI WID BR(2,4) BR(3,4) WBNu

mbe

r of d

ata

mes

sage

s

DSM coherence protocol

2208

2128

2088

2068

2208

2128

2088

2068

135

7938

19

7734

188

192

9648

24

Num

ber o

f dat

a m

essa

ges

Page size:512 bytes1 Kbyte2 Kbytes4 Kbytes

1

10

100

1,000

10,000

100,000

WI WID BR(2,4) BR(3,4) WBDSM coherence protocol

2038

2091

2185

1877

1789

1804

1859

1748

585

597

597

615

23 2415

27 25 22 19 19

Num

ber o

f dat

a m

essa

ges

Page size:512 bytes1 Kbyte2 Kbytes4 Kbytes

1

10

100

1,000

10,000

100,000

1e+06

WI WID BR(2,4) BR(3,4) WBDSM coherence protocol

4791

431

558

1967

517

359 3909

425

357

1512

512

134

2192

713

340

7482

4612

2807

1849

1469

1187 23

8812

0461

331

5

(a)

(b)

(c)

Parallel Matrix Multiply

Quicksort

Water-NSquared

Figure 4. Number of data messages sent by WI, WID, WB, and BR(w, n) in anetwork of N = 4 sites for (a) Parallel Matrix Multiply—WI and WID exhibithigher values due to repeated DSM page installations; (b) Quicksort—BRcoherence protocols behave either like WI coherence protocols (for example,BR(2, 4)) with respect to DSM page reinstallations or more like the WBcoherence protocols (for example, BR(3, 4)) depending on their configuration;and (c) Water-NSquared—BR coherence protocols install or reinstall more DSMpages than WB but fewer pages than WI and WID.

Page 11: Fault tolerance and configurability in DSM coherence protocols

20 IEEE Concurrency

four to eight while preserving theemployed DSM coherence protocol.

In examining BR, we see that thenumber of cached copies during a writeoperation can be fixed in a certain inter-val (by controlling Wmin and Wmax). As aconsequence, increasing the number ofsites without modifying BR’s parametersw and n results in x ∈ {w, …, n} copies ofa particular DSM page to be allocated ata subset of sites Ix ⊆ {1, …, N} with Ixhaving a cardinality of x. Depending onthe application and its access pattern, x aswell as Ix might change during execution.If—due to the application’s access pat-tern—many subsequent instances of Ixdiffer substantially, then the cost increaseis very high, because the coherence pro-tocol must invalidate and install or rein-stall a large number of DSM page copies.

If, on the other hand, w remains fixedwhile n is increased in correspondencewith N, then the range of possible valuesfor x increases from {w, …, n} to {w, …, N}.Thus, the potential of subsequent in-stances of Ix to be substantially differentis higher. This results in a further costincrease of an application if the applica-tion’s access pattern is characterized bya sequence of substantially different x-values spanning a wide subrange of {w, …, N} during the execution.

Figure 5 shows the results for Paral-lel Matrix Multiply. Although the totalnumber of messages of BR(w, n) coher-ence protocols is higher than for WI,WID, and WB (due to reasons discussedearlier), the percentage increase in thetotal number of messages transmittedwhen increasing the network size fromfour to eight and correspondingly

increasing the n parameter for BR(w, n)is substantially lower than for the othercoherence protocols: for BR(w, n), thepercentage increase is either 3% or 8%,whereas the percentage increase is 20%for WI and WID and 26% for WB. TheBR(w, n) coherence protocols seem tonaturally support the application’s accesspattern: various copies are distributedamong the participating sites in readphases, whereas only a few copies aremaintained during write phases. Sincethe application can basically be charac-terized as a single long “read phase”(when the lines and columns of the twoinput matrices are read) followed by asingle short “write phase” (in which theresult matrix is written), local (that is,inexpensive) read operations are realizedduring the “read phase” and update costsare reduced by the invalidation of exces-sive copies at the beginning of the “writephase.” This behavior results only in aslight overall cost increase—that is, ingood scaling behavior.

For Quicksort, the results are sum-marized in Figure 6. Here, the percent-age increase is 50% to 58% compared to 92%, 94%, and 105% for WI, WID,and WB, respectively. This behavior isbecause through its mechanism, BR(w, N)has in most cases already installed copiesof a DSM page when the Quicksortapplication needed it. Thus, DSM pageinvalidations and installations are few. Asa result, read operations that occur aftera write operation are very likely to belocal (that is, inexpensive). This leads toa reduced overall cost increase when thenetwork (and therefore the DSM appli-cation) is scaled.

As an interesting result, for DSMapplications with important access pat-terns such as Parallel Matrix Multiplyand Quicksort, BR(w, n) coherence pro-tocols scaled significantly better thanWI, WID, and especially WB for a real-istic number of cooperating processesand a number of page copies likely to be used in DSM systems. In addition,BR(w, n) coherence protocols were themost configurable of the protocols.

MEMORY COHERENCE PROTOCOLSgreatly affect the behavior of DSM sys-tems and govern the operations costs andnumber of messages transmitted. Faulttolerance and recovery methods gener-ally depend on replicated copies of theshared data, available due to the coher-ence protocols.

Overall, our investigation showedthat BR provides a mechanism to controlthe level of replication and hence thedegree of fault tolerance that can be pro-vided. The related overhead in terms ofthe total number of messages transmit-ted depends on the application’s behav-ior. As a rule of thumb, BR(w, n)’s over-head is more than WI and less than WBif the DSM application exhibits a lowread–write ratio—a spectrum represent-ing a compromise between the desireddegree of fault tolerance and cost. If theread–write ratio is high, BR(w, n)’s over-head might by even higher than the over-head of WB, especially if the applicationdoes not homogeneously reference thevarious DSM pages during execution. Inany case and irrespective of the applica-tion’s access pattern, BR(n, w) is the onlyDSM coherence protocol that guaran-tees a given degree of fault tolerance interms of data availability throughout theentire application execution.

WB scaled poorly, and WI and WIDscaled reasonably well. BR(w, n) scaledbetter than WI, WID, or WB in the sam-ples we examined while also being moreconfigurable. Thus, if you are willing topay for a certain degree of fault toleranceby using an instance of a BR(n, w) coher-

Tota

l num

ber o

f mes

sage

s

0

20,000

40,000

60,000

80,000

100,000

120,000

WI WID BR(2,N) BR(3,N) WBDSM coherence protocol

+20%

+8%

+20%

+3%

+26%

Network size:N=8 sitesN=4 sites

Parallel Matrix Multiply and 1 Kbyte DSM Pagesize

Figure 5. Scalability of WI, WID, WB,and BR coherence protocols usingParallel Matrix Multiply. The percent-ages indicate the increase in the totalnumber of transmitted messages whenthe number of participating sitesincreases from four to eight.

Page 12: Fault tolerance and configurability in DSM coherence protocols

April–June 2000 21

ence protocol, then decreasing the DSMapplication’s execution time by scaling theapplication to a higher number of proces-sors will be rewarded by a gentle increasein terms of transmitted messages.

Since we conceived this article, con-siderable research has progressed in thisarea. We are implementing and measuringBR in a new, portable DSM system. Weare also examining policies for betterselection of how and when to make copyadjustments in both the DBR coherenceprotocol and its extension DBRpc.11

ACKNOWLEDGMENTsThis research was sponsored in part by USNSF grant number CCR-9704015, the UCMicro Program, and an equipment grant fromHP Research Labs.

References1. V. Lo, “Operating Systems Implementa-

tions of Distributed Shared Memory,”Advances in Computers, Vol. 39, 1994, pp.197–237.

2. O. Theel and B.D. Fleisch, “The BoundaryRestricted Coherence Protocol for Scalableand Highly Available Distributed SharedMemory Systems,” The Computer J., Vol.39, No. 6, Aug. 1996, pp. 496–510.

3. O. Theel and B.D. Fleisch, “A DynamicCoherence Protocol for Distributed SharedMemory Enforcing High Data Availabilityat Low Cost,” IEEE Trans. Parallel and Dis-tributed Systems, Vol. 7, No. 9, Sept. 1996,pp. 915–927.

4. C. Morin and I. Puaut, “A Survey of Recov-erable Distributed Shared Virtual MemorySystems,” IEEE Trans. Parallel and Distrib-uted Systems, Vol. 8, No. 9, Sept. 1997, pp.959–969.

5. M. Singhal and N. Shivaratri, “DistributedShared Memory,” Chapter 8, AdvancedConcepts in Operating Systems, McGrawHill, New York, 1994, pp. 241–244.

6. B.D. Fleisch, N.C. Juul, and R.L. Hyde,“Mirage+: A Kernel Implementation ofDistributed Shared Memory for a Networkof Personal Computers,” Software Prac-tice and Experience, Vol. 23, No. 10, Oct.1994, pp. 569–591.

7. A. Paithankar, AINT: A Tool for Simulationof Shared-Memory Multiprocessors, mas-ter’s thesis, Univ. of Colorado, 1995.

8. S.K. Shah and B.D. Fleisch, “A Comparisonof DSM Coherence Protocols using Pro-gram Driven Simulations,” Proc. Int’l Conf.Parallel and Distributed Processing Tech-niques and Applications, (PDPTA), Vol. 3,CSREA Press, Athens, Ga., July 1998, pp.1546–1553.

9. J.P. Singh, W.-D. Weber, and A. Gupta,“Splash: Stanford Parallel Applications forShared Memory,” Computer ArchitectureNews, Vol. 20, No. 1, Mar. 1992, pp. 5–44.

10. S.C. Woo et al., “Splash-2 Programs: Char-acterization and Methodological Consid-erations,” Proc. 22nd Annual Int’l Symp.Computer Architecture, ACM Press, NewYork, June 1995, pp. 24–37.

11. J. Turk and B. Fleisch, “DBRpc: A HighlyAdaptable Protocol for Reliable DSM Sys-tems,” Proc. 19th IEEE Int’l Conf. Distrib-uted Computing Systems, May 1999, pp.340–348.

Brett D. Fleisch is an associate professor inthe Department of Computer Science andEngineering at the University of California,Riverside. His research interests are in operat-ing systems, distributed shared memory, faulttolerance, reliability, and availability; his dis-sertation was entitled Distributed Shared Mem-ory in a Loosely Coupled Environment. Hereceived his PhD from UCLA, his MS fromColumbia University, and his BA from theUniversity of Rochester, all in computer sci-ence. Fleisch is a member of the ACM, theIEEE Computer Society, and Usenix. Contacthim at Univ. of California, Riverside, River-side, CA 92521; [email protected].

Heiko Michel is working towards his PhD atthe Institute of Microelectronic Systems atthe University of Kaiserslautern, Germany.His research interests include digital signalprocessing, especially efficient implementa-tions of channel decoding algorithms for usein mobile communications. He received theDipl.-Ing. degree in electrical engineeringfrom Darmstadt University of Technology,Darmstadt, Germany. Contact him at theUniv. of Kaiserslautern, Inst. of Microelec-tronic Systems, Erwin-Schroedinger-Strasse,D-67663 Kaiserslautern, Germany; [email protected].

Sachin K. Shah is a member of technical staffat IBM Corporation in Pittsburgh (formerlyTransarc Corp.). He received his MS in com-puter science from the University of Califor-nia, Riverside; his thesis was entitled “FaultTolerance and Scalability in DSM CoherenceProtocols.” He completed his B.Tech in com-puter engineering from Thadomal ShahaniEngineering College, which is part of BombayUniversity, India. While there, Shah chairedthe IEEE Computer Society chapter. Con-tact him at [email protected].

Oliver E. Theel is a faculty member in theComputer Science Department at the Darm-stadt University of Technology, Germany,where he earned his MSc and PhD in com-puter science. He spent 1994–1995 as a visit-ing researcher at the University of California,Riverside. His research interests are distrib-uted systems, fault tolerance, replication tech-niques, properties of distributed algorithms,control theory, and self-stabilization. He is amember of the IEEE Computer Society. Con-tact him at Darmstadt Univ. of Technology,Alexanderstrasse 10, D-64283 Darmstadt,Germany; [email protected].

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

WI WID BR(2,N) BR(3,N) WB

Tota

l num

ber o

f mes

sage

s

DSM coherence protocol

Quicksort and 1 Kbyte DSM pagesize

+92%

+58%

+94%

+50%

+105%

Network Size:N=8 sitesN=4 sites

Figure 6. Scalability of WI, WID, WB,and BR coherence protocols usingQuicksort. The percentages indicatethe increase in the total number oftransmitted messages when thenumber of participating sitesincreases from four to eight.