War of the Worlds -- Shared-memory vs. Distributed-memory

War of the Worlds -- Shared-memory vs. Distributed-memory

• In distributed world, we have heavyweight processes (nodes) rather than threads

• Nodes communicate by exchanging messages– We do not have shared memory

• Communication is much more expensive – Sending a message takes much more time than sending data

through a channel

– Possibly non-uniform communication

• We only have 1-to-1 communication (no many-to-many channels)

Initial Distributed-memory Settings

• We consider settings where there is no multithreading within a single MPI node

• We consider systems where communication latency between different nodes is– Low– Uniform

Good Shared Memory Orbit Version

{z1,z4,z5,…, zl1}

{z1,z4,z5,…, zl1}

f1f1

O1

Hash Server Thread 1

WorkerThreads

f2f2

f3f3

f4f4

f5f5

f1f1

f2f2

f3f3

f4f4

f5f5

f1f1

f2f2

f3f3

f4f4

f5f5

f1f1

f2f2

f3f3

f4f4

f5f5

[x1, x2, …, xm] [x1, x2, …, xm]

{z2,z3,z8,…, zl2}

{z2,z3,z8,…, zl2}

O2


{z6,z7,z9,…, zl3}

{z6,z7,z9,…, zl3}

O3


Shared Task Pool

Why is this version hard to port to MPI?

• Singe task pool!– Requires a shared structure to which all of the hash servers

write data, and all of the workers read data from

• Not easy to implement using MPI, where we only have 1-to-1 communication

• We could have a dedicated node which will hold task queue– Workers send messages to it to request work

– Hash servers send messages to it to push work

– This would make the node potential bottleneck, and would involve a lot of communication

MPI Version 1

• Maybe merge workers and hash servers?

• Each MPI node acts both as a hash server and as a worker

• Each node has its own task pool

• If task pool of a node is empty, the node tries to steal work from some other node

MPI Version 1

{z1,z4,z5,…, zl1}

{z1,z4,z5,…, zl1}

f1f1

MPI Nodes

f2f2

f3f3

f4f4

f5f5

[x11,x12,… x1m1] [x11,x12,… x1m1]

{z2,z3,z8,…, zl2}

{z2,z3,z8,…, zl2}

{z6,z7,z9,…, zl3}

{z6,z7,z9,…, zl3}

f1f1

f2f2

f3f3

f4f4

f5f5

[x21,x22,… x2m2] [x21,x22,… x2m2]

f1f1

f2f2

f3f3

f4f4

f5f5

[x31,x32,… x3m3] [x31,x32,… x3m3]

MPI Version 1 is Bad!

• Bad performance, especially for smaller number of nodes

• Same process does hash table lookups, and applies generator functions to points– It cannot do both at the same time => something has to wait– This creates contention

MPI Version 2

• Separate hash servers and workers, after all

• Hash server nodes– Keep parts of the hash table– Also keep parts of task pool

• Worker nodes just apply generators to points

• Workers obtain work from hash server nodes using work-stealing

MPI Version 2

{z1,z4,z5,…, zl1}

{z1,z4,z5,…, zl1}

f1f1

O1

Workernodes

f2f2

f3f3

f4f4

f5f5

[x11,x12,… x1m1] [x11,x12,… x1m1]

{z2,z3,z8,…, zl2}

{z2,z3,z8,…, zl2}

O2

{z6,z7,z9,…, zl3}

{z6,z7,z9,…, zl3}

O3

[x21,x22,… x2m2] [x21,x22,… x2m2] [x31,x32,… x3m3] [x31,x32,… x3m3]

Hash Server nodes

T1 T2 T3

f1f1

f2f2

f3f3

f4f4

f5f5

f1f1

f2f2

f3f3

f4f4

f5f5

f1f1

f2f2

f3f3

f4f4

f5f5

MPI Version 2

• Much better performance than MPI Version 1 (on low-latency systems)

• Key thing is separating hash lookup and applying generators to points in different nodes

Big Issue with MPI Versions 1 and 2 -- Detecting Termination!

• We need to detect the situation where all of the hash server nodes have empty task pools, and where no new work will be produced by hash servers!

– Even detecting that all task pools are empty and all hash servers and all workers are idle is not enough, as there may be messages flying around that will create more work!

– Woe unto me! What are we to do?

• Good ol’ Dijkstra comes to rescue - We use a variant of Dijkstra-Scholten Termination Detection Algorithm

Termination Detection Algorithm

• Each hash server keeps two counters– Number of points sent (my_nr_points_sent)– Number of points received (my_nr_points_rcvd)

• We enumerate hash servers - H0 … Hn

• Hash server H0, when idle, sends a token to the hash server H1

– It attaches a token count (my_nr_points_sent, my_nr_points_rcvd) to the token

• When a hash server Hi receives a token– If it is active (has tasks in the task pool), sends the token back to H0

– If it is idle, it increases each component of the count attached to the token and sends the token to Hi+1

– If received token count was (pts_sent, pts_rcvd), the new token count is (my_nr_points_sent + pts_sent, my_nr_points_rcvd + pts_rcvd)

• If H0 receives the token, and if token count is (pts_sent, pts_rcvd) such that pts_rcvd = num_gens * pts_sent, then termination is detected

MPIGAP Code for MPI Version 2

• Not trivial (~400 lines of GAP code)

• Explicit message passing using low-level MPI bindings– This version is hard to implement using task

abstraction


Worker := function(gens,op,f) local g,j,n,m,res,t,x,toSend,idle; n := nrHashes; while true do t := GetWork(); if IsIdenticalObj (t, fail) then return; fi;

m := QuoInt(Length(t)*Length(gens)*2,n); res := List([1..n],x->EmptyPlist(m)); for j in [1..Length(t)] do for g in gens do x := op(t[j],g); Add(res[f(x)],x); od; od; for j in [1..n] do if Length(res[j]) > 0 then OrbSendMessage(res[j],minHashId+j-1); fi; od; od;end;


GetWork := function()

local msg, target;

tid := minHashId;

OrbSendMessage(["getwork",processId],tid);

msg := OrbGetMessage(true);

if msg[1]<>"finish" then

return msg;

else

return fail;

fi;

end;


OrbGetMessage := function(blocking) local test, msg, tmp, veg; if blocking then test := MPI_Probe(); else test :=

MPI_Iprobe(); fi; if test then msg := UNIX_MakeString(MPI_Get_count()); MPI_Recv(msg); tmp := DeserializeNativeString(msg); totalProcTime := totalProcTime + veg; else return fail; fi;end;

OrbSendMessage := function(raw,dest) local msg; msg := SerializeToNativeString(raw); MPI_Binsend(msg,dest,Length(msg));end;

Work in Progress - Extending MPI Version 2 To Systems With Non-Uniform Latency

• Communication latencies between nodes might be different

• Where to place hash server nodes? And how many?

• How to do work distribution? – Is work stealing still a good idea in a setting where communication

distance between a worker and different hash servers is not uniform?

• We can look at the Shared memory + MPI world as a special case of this– Multithreading within MPI nodes– Threads from the same node can communicate fast– Nodes communicate much slower

Documents

War of the Worlds -- Shared-memory vs. Distributed-memory