Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Transparently Composing CnC Graph Pipelines

on a Cluster

Hongbo Rong, Frank Schlimbach

Programming & Systems Lab (PSL)

Software Systems Group (SSG)

7th Annual Concurrent Collections Workshop

9/8/2015

Problem

A productivity program running on a cluster

2

Problem


The programmer is a domain expert, but not a tuning expert

2

Problem



Call distributed libraries

2

Problem




Library functions are not composable

Black box

Independent

Context-unaware

Barrier at the end

2

Problem




Library functions are not composable

Black box

Independent

Context-unaware

Barrier at the end

How to compose these non-composable library functions automatically?

2

Flow Graphs

3

Figure from

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

Flow Graphs

3

Figure from



Flow Graphs

3

Figure from



Flow Graphs

3

Figure from


Traditional: Bulk-synchronous Parallel Pieplined & asynchronous:

Communication

Basic Idea

User program As usual: assume sequential, global shared-memory programming

C = A + B

E = C * D

Basic Idea


C = A + B

E = C * D

Library Add a CnC Graph for each library function

A B

+C

C D

+E

Basic Idea


C = A + B

E = C * D


A B

+C

C D

+E

Compiler/

interpreter

Compose the corresponding graphs of a sequence of library calls

Both graphs use the identical

memory for C

A B

+C D

+E

Basic Idea

Let CnC do the distribution:

mpiexec -genv DIST_CNC=MPI –n 1000 ./julia user_script.jl

mpiexec -genv DIST_CNC=MPI –n 1000 ./python user_script.py

mpiexec -genv DIST_CNC=MPI –n 1000 ./matlab user_script.m

Execution


C = A + B

E = C * D


A B

+C

C D

+E

Compiler/

interpreter

Compose the corresponding graphs of a sequence of library calls

Both graphs use the identical

memory for C

A B

+C D

+E

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

Hello World

User program:



A1* B

Multiply

C1*

Process 1

Graph 1

Hello World

User program:



A1* B

Multiply

C1*

Process 1

Graph 1 Leverage the library

Hello World

User program:



A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

Graph 1

Graph 2

Leverage the library

Hello World

User program:



A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

Graph 1

Graph 2

No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.


Hello World

User program:



A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

…

…

Graph 1

Graph 2



Hello World

User program:



A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

A100* B

Multiply

C100*

Process 100

D

Multiply

E100*

…

…

Graph 1

Graph 2



Hello World

User program:



A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

A100* B

Multiply

C100*

Process 100

D

Multiply

E100*

…

…

E

Graph 1

Graph 2



Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Code skeleton

6


dgemm(C, D, E)

Compiler

Code skeleton

6


dgemm(C, D, E)

Compiler

initialize_CnC()

dgemm_dgemm(A, B, C, D, E)

finalize_Cnc()

User code

Code skeleton

6


dgemm(C, D, E)

Compiler

initialize_CnC()


finalize_Cnc()

User code

struct dgemm_dgemm_context {

item_collection *C_collection;

tuner row_tuner, col_tuner;

Graph * graph1, * graph2;

dgemm_dgemm_context( A, B, C, D, E) {

create C_collection

graph1 = make_dgemm_graph (A, B, C);

graph2 = make_dgemm_graph (C, D, E);

}

}

Context

Code skeleton

6


dgemm(C, D, E)

Compiler

initialize_CnC()


finalize_Cnc()

User code






create C_collection



}

}

Context

Interface

void dgemm_dgemm(A, B, C, D, E) {

dgemm_dgemm_context ctxt(A, B, C, D, E );

ctxt.graph1->start();


ctxt.wait();

ctxt.graph2->copyout();

}

Code skeleton

6


dgemm(C, D, E)

Compiler

initialize_CnC()


finalize_Cnc()

User code






create C_collection



}

}

Context

Interface





ctxt.wait();


}

Domain

expert

written

class dgemm_graph {

tuner *tunerA, *tunerB, *tunerC, *tunerS;

item_collection *A_collection, *B_collection, *C_collection;

tag_collection tags;

step_collection *multiply_steps;

dgemm_graph(_A, _B, _C) {

create A/B/C_collection based on A/B/C

define dataflow graph

}

}

Code skeleton

6


dgemm(C, D, E)

Compiler

initialize_CnC()


finalize_Cnc()

User code






create C_collection



}

}

Context

Interface





ctxt.wait();


}

Domain

expert

written

class dgemm_graph {

tuner *tunerA, *tunerB, *tunerC, *tunerS;

item_collection *A_collection, *B_collection, *C_collection;

tag_collection tags;

step_collection *multiply_steps;

dgemm_graph(_A, _B, _C) {

create A/B/C_collection based on A/B/C

define dataflow graph

}

}

Host

language

C

Key points

Compiler

Generates a context and an interface for a dataflow

Connects expert-written graphs into a pipeline

Minimizes communication with step tuners (Static scheduling) and item collection tuners

(Static data distribution)

7

Key points

Compiler

Generates a context and an interface for a dataflow

Connects expert-written graphs into a pipeline

Minimizes communication with step tuners (Static scheduling) and item collection tuners

(Static data distribution)

Domain-expert written graphs

High-level algorithms for library functions

Input/output collections can be from outside

7

Advantages

Useful for any language

Mature work in compiler/interpreter

– Dataflow analysis, pattern matching, code replacement

Extends a scripting language to distributed computing implicitly

Transparent to users

Transparent to the language

Transparent to libraries

Heavy lifting done in CnC and graph writing by domain experts

8

Open questions

Minimize communication

Item collections: consumed_on

Step collections: computed_on

Scalability

Applications

There might not be many long sequences of library calls

9

Documents

Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed