New York University MILAN Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute

New York University MILAN

NYU

Bridging the Gap Between Distributed Shared Memory and

Message Passing

Holger Karl

Department of Computer Science

Courant Institute of Mathematical Sciences

New York University

Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft


NYUOverview

Motivation

Related approaches

Charlotte

Annotating Charlotte

Some experiments

Summary


NYUMotivation

Scenario Distributed computing using the World Wide Web Use Java to overcome heterogeneity, security concerns and

program installation problems Idea: volunteer computing

– Applets can participate in distributed applications

Challenges Only make use of standard Web browsers/Java virtual machines Consider the Web’s and Java’s idiosyncrasies

– Inherent faultiness, machines come and go

– Host-of-origin imposes a star-like communication topology

Provide a simple programming interface, e.g., DSM No hardware support (e.g., page protection) available Do not forget efficiency


NYURelated Approaches

Standard browser Programminginterface

Fault tolerance

JPVM No Message passing Possible

ATLAS No Work stealing Yes

JavaParty Yes (with RMI) Remote objects No

ParaWeb Yes Message Passing NoNo DSM No

Java/DSM No DSM No

Javelin Yes Infrastructuresupport

Possible

Charlotte(PDCS ’96)

Yes DSM Yes


NYUCharlotte - Programs

Alternating sequential and parallel steps Sequential steps

executed by a manager application

Parallel steps executed by worker applets

Routines are defined in parallel steps

Routines are methods of objects, derived from a Charlotte class Droutine

public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … }}


NYUCharlotte - Memory Semantics

Memory partitioned into private and shared parts Shared memory with CRCW-common semantics Implemented at object level

Charlotte provides classes for every primitive type: int Dint, float Dfloat, etc

Manager has master copy of memory Shared objects have get() and set() methods

– get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager

– set() marks data as modified to be flushed to manager at end of parallel step

Updates from different routines incorporated atomically at end of parallel step

Allows eager scheduling for fault tolerance Routines can be executed multiple times without violating exactly

once semantics


NYUCharlotte - Matrix Multiplication

public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } …}


NYUCharlotte - Pros and Cons

Pros Simple programming model Well-defined DSM semantics Adaptive parallelism Fault tolerant with respect to worker crashes

Cons Efficiency problems

– Shared data is accessed via method invocations

– Loading data from manager incurs latency

– Choosing good page size is difficult

– Sending modified data back to manager requires inspection of status flags

Reason: runtime system does not know which data is read or modified by routines

Make this information explicit in Charlotte programs!


NYUAnnotations

Specify read and write

locations for routines

Use methods of class

Droutine

dloc_read()

dloc_write()

First step: Correctness-

preserving annotations

Only send read data at

beginning of routine

public class Matrix extends Droutine {

public void drun (int n, int i) {…}

public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } …}


NYURelying on Annotations

Second step: Rely on annotations Data transfer between manager and worker happens only

according to annotations No if-statement in get() necessary No status update in set() necessary No latencies for requesting data updates But, wrong annotations lead to wrong results

Use Uint (unchecked) instead of Dint, etc. Identical interface Remaining overhead: method invocation for data access

Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed

Simple to switch back and forth


NYUSharing Primitive Types

Third step: Use primitive types (int) instead of objects (Uint) Possible since get() and set() are trivial for Uint

Possible since annotations describe data movement completely

Avoids method invocation overhead

But: different interface– No get()/set() syntactic changes

– Call-by-reference / call-by-value

Particularly suited to arrays of distributed types

Freely mix Dint, Uint and shared int


NYUAdditional Optimizations

Manager keeps track of workers’ valid data sets Use routines’ read sets to select routine to give to worker

Choose routine with minimal amount of missing data Example: 3 routines, 2 workers

Data can be cached at workers between parallel steps Requires data to be declared unmodified at the beginning of

parallel step

Worker 1

Worker 2

valid: 1-100

valid: 150-200

read_set:20-80

90-150

160-210


NYUDiscussion

Gradual incorporation of semantic knowledge into Charlotte programs Large flexibility to mix different levels of annotation semantics in one

program Easy to switch back and forth among these levels Close to message passing behavior Charlotte’s nice properties are preserved (fault tolerance, adaptive

parallelism) Is it efficient?

pureDSM

DSM+hints

DSM+unchecked

objects

DSM+shared

primitives

annotationswith correct-ness check

correctness-assuming

annotations

primitive typesinstead of objects

Messagepassing

programtransformation


NYUAnnotating DSM Code

Munin Annotating data with expected access patterns, system chooses

appropriate consistency protocols

Aurora Object-based system in C++, different consistency models are

dynamically selected at runtime

CRL, Cid Library of C-functions, explicit mapping of shared memory in local

memory

Jade Annotations used to extract parallelism from sequential code

All lack flexibility to mix different guarantee levels


NYUExperiments - Setup

Setup: Multiplication of 200x200 integer matrices

PentiumPro 200 at NYU with FastEthernet (100MBit/sec)

Pentium 90 at Humboldt University Berlin

Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux

Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200)

Ping between NYU and HUB: typically 130 msec.


NYUExperiments

Multiply two 200x200 integer matrices

Runtimes for Standard Charlotte (Dint) Standard Charlotte plus

correctness-preserving annotations (Dint+A)

Unchecked classes (Uint) Shared primitive types (int) A message passing

implementation (mes.pas.)

Shared primitive types competitive with message passing

Reasonable absolute speedups (compared to sequential runtime)

0

5

10

15

20

25

1 2 3 4No. of Workers

Tim

e (s

ec.)

Dint Dint+A Uint int mes.pas.

0

0.5

1

1.5

2

2.5


Ab

solu

te S

pee

du

p

Dint Dint+A Uint int mes.pas.


NYUExperiments (contd.)

Ratios between various optimizations

Annotations/Presending give a factor of about three

Getting rid of objects is another factor of two

Altogether: up to nine times faster than standard Charlotte

And better scalability

0

2

4

6

8

10


Rat

io

Dint/Dint+A Dint/Uint Dint/int



Manager at NYU, worker at HU Berlin

Similar behavior, if slightly smaller improvement

0

20

40

60

80

100

1 2No. of Workers

Tim

e (

sec.

)

Dint Dint+A int mes.pas.

01234567

1 2No. of Workers

Rat

io

Dint/Dint+A Dint/int Dint/mes.pas.



Colocation and Caching Example: multiply matrix A

by two different matrices B1 and B2

Small impact in a LAN environment

Up to 25% improvement for high-latency connections

Colocation results in smaller standard deviation of runtimes

024

68

1012

141618

NYU (comp.) NYU Berlin (comp.) Berlin

Worker Location

Tim

e (s

ec.)

Dint+Ann. Caching Colocation


NYUFuture Work

Investigating more elaborate examples Data with both regular and irregular access patterns

Compiler-generated annotations

New programming interface no nested parallelism

parBegin() / parEnd() makes some annotations awkward

Overlapping communication/computation Multiple worker threads in one virtual machine

Applying annotations to Calypso (a page-based DSM system)

Down the road: use annotations for QoS


NYUSummary

Annotations to describe read and write sets of parallel routines Correctness-preserving

Correctness-sensitive

Or directly sharing primitive types

Possible to mix these different levels in a single program and to switch back and forth between them

Big flexibility for programmer

Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance

Performance competitive with message passing programs

Documents

New York University MILAN Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute