TreadMarks: Shared Memory Computing on Networks of Workstations

TreadMarks: Shared Memory Computing on Networks of

Workstations

C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony,

W. Yu, W. ZwaenepoelRice University

INTRODUCTION

• Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space

• Key issue in building a software DSM is minimizing the amount of data communication among the workstation memories

Why bother with DSM?

• Key idea is to build fast parallel computers that are– Cheaper than shared memory multiprocessor

architectures– As convenient to use

CPU

Shared memory

Conventional parallel architecture

CACHE CACHE CACHE CACHE

CPU CPU CPU

Today’s architecture

• Clusters of workstations are much more cost effective– No need to develop complex bus and cache

structures– Can use off-the-shelf networking hardware

• Gigabit Ethernet • Myrinet (1.5 Gb/s)

– Can quickly integrate newest microprocessors

Limitations of cluster approach

• Communication within a cluster of workstation is through message passing– Much harder to program than concurrent

access to a shared memory• Many big programs were written for shared

memory architectures– Converting them to a message passing

architecture is a nightmare

Distributed shared memory

DSM = one shared global address space

main memories

Distributed shared memory

• DSM makes a cluster of workstations look like a shared memory parallel computer– Easier to write new programs– Easier to port existing programs

• Key problem is that DSM only provides the illusion of having a shared memory architecture– Data must still move back and forth among the

workstations

Munin

• Developed at Rice University• Based on software objects (variables)• Used the processor virtual memory to detect

access to the shared objects• Included several techniques for reducing

consistency-related communication• Only ran on top of the V kernel

Munin main strengths

• Excellent performance • Portability of programs

– Allowed programs written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes(dusty decks)

Munin main weakness

• Very poor portability of Munin itself– Depended of some features of the V kernel

• Not maintained since the late 80's

TreadMarks

• Provides DSM as an array of bytes• Like Munin,

– Uses release consistency– Offers a multiple writer protocol to fight false

sharing• Runs at user-level on a number of UNIX

platforms• Offers a very simple user interface

First example: Jacobi iteration

• Illustrates the use of barriers– A barrier is a synchronization primitive that

forces processes accessing it to wait until all processes have reached it• Forces processes to wait until all of them

have completed a specific step

Jacobi iteration: overall organization

• Operates on a two-dimensional array• Each processor works on a specific band of rows

– Boundary rows are shared

Proc 0

…

Proc n-1

Jacobi iteration: overall organization

• During each iteration step, each array element is set to the average of its four neighbors– Averages are stored in a scratch matrix and

copied later into the shared matrix

Jacobi iteration: the barriers

• Mark the end of each computation phase• Prevents processes from continuing the

computation before all other processes have completed the previous phase and the new values are "installed"

• Include an implicit release() followed by an implicit acquire()– To be explained later

Jacobi iteration: declarations

#define M #define N float *grid // shared array float scratch[M][N] // private array

Jacobi iteration: startupmain() { Tmk_startup(); if (Tmk_proc_id == 0 ) { grid = Tmk_malloc(M*N*sizeof(float)); initialize grid; } // if Tmk_barrier(0); length = M/Tmk_nprocs; begin = length*Tmk_proc_id; end = length*(Tmk_proc_id + 1);

Jacobi iteration: main loop for (number of iterations) { for (i = begin; i < end; i++) for (j = 0; j < N; j++)

scratch[i][j] = (grid[i-][j] + … + grid[i][j+1])/4; Tmk_barrier(1); for (i = begin; i < end; i++) for (j = 0; j < N; j++)

grid[i][j] = scratch[i][j]; Tmk_barrier(2); } // main loop} // main

Second example: TSP

• Traveling salesman problem– Finding the shortest path through a number of

cities• Program keeps a queue of partial tours

– Most promising at the end

TSP: declarations

queue_type *Queueint *Shortest_lengthint queue_lock_id, min_lock_id;

TSP: startupmain ( Tmkstartup() queue_lock_id = 0; min_lock_id = 1; if (Tmk_proc_id == 0) { Queue = Tmk_malloc(sizeof(queuetype)); Shortest_length =

Tmk_malloc(sizeof(int)); initialize Heap and Shortest_length;

] // if Tmk_barrier (0);

TSP: while loop while (true) do { Tmk_lock_acquire(queue_lock_id);

if (queue is empty) { Tmk_lock_release(queue_lock_id); Tmk_exit(); } // while loop Keep adding to queue until a long promising tour appears at the head Path = Delete the tour from the head Tmk_lock_release(queue_lock_id); } // while

TSP: end of main

length = recursively try all cities not on Path, find the shortest tour length Tmk_lock_acquire(min_lock_id); if (length < Shortest_length) Shortest_length = length Tmk_lock_release(min_lock_id

} // main

Critical sections

• All accesses to shared variables are surrounded by a pair

Tmk_lock_acquire(lock_id);…

Tmk_lock_relese(lock_id);

Implementation Issues

• Consistency issues• False sharing

Consistency model (I)

• Shared data are replicated at times– To speed up read accesses

• All workstations must share a consistent view of all data

• Strict consistency is not possible

Consistency model (II)

• Various authors have proposed weaker consistency models– Cheaper to implement– Harder to use in a correct fashion

• TreadMarks usessoftware release consistency– Only requires the memory to be consistent at

specific synchronization points

SW release consistency (I)

• Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables– P(&mutex) and V(&mutex)– lock(&csect) and unlock(&csect) – acquire( ) and release( )

• Unprotected accesses can produce unpredictable results

SW release consistency (II)

• SW release consistency will only guarantee correctness of operations performed within a request/release pair

• No need to export the new values of shared variables until the release

• Must guarantee that workstation has received the most recent values of all shared variables when it completes a request

SW release consistency (III)

shared int x;acquire( );

x = 1;release ( );// export x=1

shared int x;

acquire( );// wait for new value of x

x++;release ( );// export x=2

SW release consistency (IV)

• Must still decide how to release updated values– TreadMarks uses lazy release:

• Delays propagation until an acquire is issued

– Its predecessor Munin used eager release:• New values of shared variables were

propagated at release time

SW release consistency (V)Eagerrelease

Lazyrelease

False sharing

accesses x accesses y

x y

page containing x and y will move back and forthbetween main memories of workstations

Multiple write protocol (I)

• Designed to fight false sharing• Uses a copy-on-write mechanism• Whenever a process is granted access to write-

shared data, the page containing these data is marked copy-on-write

• First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).

Creating a twin

Multiple write protocol (II)

• At release time, TreadMarks– Performs a word by word comparison of the

page and its twin– Stores the diff in the space used by the twin page– Informs all processors having a copy of the

shared data of the update• These processors will request the diff the first time

they access the page

Creating a diff

x = 1

y = 2

x = 1

y = 2

First write access

twin

x = 3

y = 2

Before

After

Compare with twinNew value of x is 3

Example

Multiple write protocol (III)

• TreadMarks could but does not check for conflicting updates to write-shared pages

The TreadMarks system

• Entirely at user-level• Links to programs written in C, C++ and Fortran• Uses UDP/IP for communication (or AAL3/4 if

machines are connected by an ATM LAN)• Uses SIGIO signal to speed up processing of

incoming requests• Uses mprotect( ) system call to control access to

shard pages

Performance evaluation (I)

• Long discussion of two large TreadMarks applications

Performance evaluation (II)

• A previous paper compared performance of TreadMarks with that of Munin– Munin performance typically was within 5 to

33% of the performance of hand-coded message passing versions of the same programs

– TreadMarks was almost always better than Munin with one exception:• A 3-D FFT program

Performance Evaluation (III)

• 3-D FFT program was an iterative program that read some shared data outside any critical section– Doing otherwise would have been to costly

• Munin used eager release, which ensured that the values read were not far from their true value

• Not true for TreadMarks!

Other DSM Implementations (I)

• Sequentially-Consistent Software DSM (IVY):– Sends messages to other copies at each write– Much slower

• Software release consistency with eager release (Munin)

Other DSM Implementations (II)

• Entry consistency (Midway):– Requires each variable to be associated to a

synchronization object (typically a lock)– Acquire/release operations on a given

synchronization object only involve the variables associated with that object

– Requires less data traffic– Does not handle well dusty decks

Other DSM Implementations (III)

• Structured DSM Systems (Linda):– Offer to the programmer a shared tuple space

accessed using specific synchronized methods

– Require a very different programming style

CONCLUSIONS

• Can build an efficient DSM entirely in user space– Modern UNIX systems offer all the required

primitives• Software release consistency model works very

well• Lazy release is almost always better than eager

release

Documents

TreadMarks: Shared Memory Computing on Networks of Workstations