33
Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015

Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Transparently Composing CnC Graph Pipelines

on a Cluster

Hongbo Rong, Frank Schlimbach

Programming & Systems Lab (PSL)

Software Systems Group (SSG)

7th Annual Concurrent Collections Workshop

9/8/2015

Page 2: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Problem

A productivity program running on a cluster

2

Page 3: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Problem

A productivity program running on a cluster

The programmer is a domain expert, but not a tuning expert

2

Page 4: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Problem

A productivity program running on a cluster

The programmer is a domain expert, but not a tuning expert

Call distributed libraries

2

Page 5: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Problem

A productivity program running on a cluster

The programmer is a domain expert, but not a tuning expert

Call distributed libraries

Library functions are not composable

Black box

Independent

Context-unaware

Barrier at the end

2

Page 6: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Problem

A productivity program running on a cluster

The programmer is a domain expert, but not a tuning expert

Call distributed libraries

Library functions are not composable

Black box

Independent

Context-unaware

Barrier at the end

How to compose these non-composable library functions automatically?

2

Page 7: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Flow Graphs

3

Figure from

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

Page 8: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Flow Graphs

3

Figure from

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

Page 9: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Flow Graphs

3

Figure from

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

Page 10: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Flow Graphs

3

Figure from

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel Pieplined & asynchronous:

Communication

Page 11: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Basic Idea

User program As usual: assume sequential, global shared-memory programming

C = A + B

E = C * D

Page 12: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Basic Idea

User program As usual: assume sequential, global shared-memory programming

C = A + B

E = C * D

Library Add a CnC Graph for each library function

A B

+C

C D

+E

Page 13: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Basic Idea

User program As usual: assume sequential, global shared-memory programming

C = A + B

E = C * D

Library Add a CnC Graph for each library function

A B

+C

C D

+E

Compiler/

interpreter

Compose the corresponding graphs of a sequence of library calls

Both graphs use the identical

memory for C

A B

+C D

+E

Page 14: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Basic Idea

Let CnC do the distribution:

mpiexec -genv DIST_CNC=MPI –n 1000 ./julia user_script.jl

mpiexec -genv DIST_CNC=MPI –n 1000 ./python user_script.py

mpiexec -genv DIST_CNC=MPI –n 1000 ./matlab user_script.m

Execution

User program As usual: assume sequential, global shared-memory programming

C = A + B

E = C * D

Library Add a CnC Graph for each library function

A B

+C

C D

+E

Compiler/

interpreter

Compose the corresponding graphs of a sequence of library calls

Both graphs use the identical

memory for C

A B

+C D

+E

Page 15: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

Page 16: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

Graph 1

Page 17: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

Graph 1 Leverage the library

Page 18: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

Graph 1

Graph 2

Leverage the library

Page 19: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

Graph 1

Graph 2

No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.

Leverage the library

Page 20: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

Graph 1

Graph 2

No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.

Leverage the library

Page 21: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

A100* B

Multiply

C100*

Process 100

D

Multiply

E100*

Graph 1

Graph 2

No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.

Leverage the library

Page 22: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Hello World

User program:

dgemm(A, B, C ) # C = A*B

dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Process 1

D

Multiply

E1*

A100* B

Multiply

C100*

Process 100

D

Multiply

E100*

E

Graph 1

Graph 2

No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.

Leverage the library

Page 23: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Page 24: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Compiler

Page 25: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Compiler

initialize_CnC()

dgemm_dgemm(A, B, C, D, E)

finalize_Cnc()

User code

Page 26: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Compiler

initialize_CnC()

dgemm_dgemm(A, B, C, D, E)

finalize_Cnc()

User code

struct dgemm_dgemm_context {

item_collection *C_collection;

tuner row_tuner, col_tuner;

Graph * graph1, * graph2;

dgemm_dgemm_context( A, B, C, D, E) {

create C_collection

graph1 = make_dgemm_graph (A, B, C);

graph2 = make_dgemm_graph (C, D, E);

}

}

Context

Page 27: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Compiler

initialize_CnC()

dgemm_dgemm(A, B, C, D, E)

finalize_Cnc()

User code

struct dgemm_dgemm_context {

item_collection *C_collection;

tuner row_tuner, col_tuner;

Graph * graph1, * graph2;

dgemm_dgemm_context( A, B, C, D, E) {

create C_collection

graph1 = make_dgemm_graph (A, B, C);

graph2 = make_dgemm_graph (C, D, E);

}

}

Context

Interface

void dgemm_dgemm(A, B, C, D, E) {

dgemm_dgemm_context ctxt(A, B, C, D, E );

ctxt.graph1->start();

ctxt.graph2->start();

ctxt.wait();

ctxt.graph2->copyout();

}

Page 28: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Compiler

initialize_CnC()

dgemm_dgemm(A, B, C, D, E)

finalize_Cnc()

User code

struct dgemm_dgemm_context {

item_collection *C_collection;

tuner row_tuner, col_tuner;

Graph * graph1, * graph2;

dgemm_dgemm_context( A, B, C, D, E) {

create C_collection

graph1 = make_dgemm_graph (A, B, C);

graph2 = make_dgemm_graph (C, D, E);

}

}

Context

Interface

void dgemm_dgemm(A, B, C, D, E) {

dgemm_dgemm_context ctxt(A, B, C, D, E );

ctxt.graph1->start();

ctxt.graph2->start();

ctxt.wait();

ctxt.graph2->copyout();

}

Domain

expert

written

class dgemm_graph {

tuner *tunerA, *tunerB, *tunerC, *tunerS;

item_collection *A_collection, *B_collection, *C_collection;

tag_collection tags;

step_collection *multiply_steps;

dgemm_graph(_A, _B, _C) {

create A/B/C_collection based on A/B/C

define dataflow graph

}

}

Page 29: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Code skeleton

6

User codedgemm(A, B, C)

dgemm(C, D, E)

Compiler

initialize_CnC()

dgemm_dgemm(A, B, C, D, E)

finalize_Cnc()

User code

struct dgemm_dgemm_context {

item_collection *C_collection;

tuner row_tuner, col_tuner;

Graph * graph1, * graph2;

dgemm_dgemm_context( A, B, C, D, E) {

create C_collection

graph1 = make_dgemm_graph (A, B, C);

graph2 = make_dgemm_graph (C, D, E);

}

}

Context

Interface

void dgemm_dgemm(A, B, C, D, E) {

dgemm_dgemm_context ctxt(A, B, C, D, E );

ctxt.graph1->start();

ctxt.graph2->start();

ctxt.wait();

ctxt.graph2->copyout();

}

Domain

expert

written

class dgemm_graph {

tuner *tunerA, *tunerB, *tunerC, *tunerS;

item_collection *A_collection, *B_collection, *C_collection;

tag_collection tags;

step_collection *multiply_steps;

dgemm_graph(_A, _B, _C) {

create A/B/C_collection based on A/B/C

define dataflow graph

}

}

Host

language

C

Page 30: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Key points

Compiler

Generates a context and an interface for a dataflow

Connects expert-written graphs into a pipeline

Minimizes communication with step tuners (Static scheduling) and item collection tuners

(Static data distribution)

7

Page 31: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Key points

Compiler

Generates a context and an interface for a dataflow

Connects expert-written graphs into a pipeline

Minimizes communication with step tuners (Static scheduling) and item collection tuners

(Static data distribution)

Domain-expert written graphs

High-level algorithms for library functions

Input/output collections can be from outside

7

Page 32: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Advantages

Useful for any language

Mature work in compiler/interpreter

– Dataflow analysis, pattern matching, code replacement

Extends a scripting language to distributed computing implicitly

Transparent to users

Transparent to the language

Transparent to libraries

Heavy lifting done in CnC and graph writing by domain experts

8

Page 33: Transparently Composing CnC Graph Pipelines on a Cluster · Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed

Open questions

Minimize communication

Item collections: consumed_on

Step collections: computed_on

Scalability

Applications

There might not be many long sequences of library calls

9