Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Transparently Composing CnC Graph Pipelines
on a Cluster
Hongbo Rong, Frank Schlimbach
Programming & Systems Lab (PSL)
Software Systems Group (SSG)
7th Annual Concurrent Collections Workshop
9/8/2015
Problem
A productivity program running on a cluster
2
Problem
A productivity program running on a cluster
The programmer is a domain expert, but not a tuning expert
2
Problem
A productivity program running on a cluster
The programmer is a domain expert, but not a tuning expert
Call distributed libraries
2
Problem
A productivity program running on a cluster
The programmer is a domain expert, but not a tuning expert
Call distributed libraries
Library functions are not composable
Black box
Independent
Context-unaware
Barrier at the end
2
Problem
A productivity program running on a cluster
The programmer is a domain expert, but not a tuning expert
Call distributed libraries
Library functions are not composable
Black box
Independent
Context-unaware
Barrier at the end
How to compose these non-composable library functions automatically?
2
Flow Graphs
3
Figure from
https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel
Flow Graphs
3
Figure from
https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel
Flow Graphs
3
Figure from
https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel
Flow Graphs
3
Figure from
https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel Pieplined & asynchronous:
Communication
Basic Idea
User program As usual: assume sequential, global shared-memory programming
C = A + B
E = C * D
Basic Idea
User program As usual: assume sequential, global shared-memory programming
C = A + B
E = C * D
Library Add a CnC Graph for each library function
A B
+C
C D
+E
Basic Idea
User program As usual: assume sequential, global shared-memory programming
C = A + B
E = C * D
Library Add a CnC Graph for each library function
A B
+C
C D
+E
Compiler/
interpreter
Compose the corresponding graphs of a sequence of library calls
Both graphs use the identical
memory for C
A B
+C D
+E
Basic Idea
Let CnC do the distribution:
mpiexec -genv DIST_CNC=MPI –n 1000 ./julia user_script.jl
mpiexec -genv DIST_CNC=MPI –n 1000 ./python user_script.py
mpiexec -genv DIST_CNC=MPI –n 1000 ./matlab user_script.m
Execution
User program As usual: assume sequential, global shared-memory programming
C = A + B
E = C * D
Library Add a CnC Graph for each library function
A B
+C
C D
+E
Compiler/
interpreter
Compose the corresponding graphs of a sequence of library calls
Both graphs use the identical
memory for C
A B
+C D
+E
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
Graph 1
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
Graph 1 Leverage the library
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
D
Multiply
E1*
Graph 1
Graph 2
Leverage the library
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
D
Multiply
E1*
Graph 1
Graph 2
No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.
Leverage the library
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
D
Multiply
E1*
…
…
Graph 1
Graph 2
No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.
Leverage the library
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
D
Multiply
E1*
A100* B
Multiply
C100*
Process 100
D
Multiply
E100*
…
…
Graph 1
Graph 2
No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.
Leverage the library
Hello World
User program:
dgemm(A, B, C ) # C = A*B
dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Process 1
D
Multiply
E1*
A100* B
Multiply
C100*
Process 100
D
Multiply
E100*
…
…
E
Graph 1
Graph 2
No barrier/copy/msg between graphs/steps unless required.E.g. No bcast/gather of C.
Leverage the library
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Compiler
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Compiler
initialize_CnC()
dgemm_dgemm(A, B, C, D, E)
finalize_Cnc()
User code
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Compiler
initialize_CnC()
dgemm_dgemm(A, B, C, D, E)
finalize_Cnc()
User code
struct dgemm_dgemm_context {
item_collection *C_collection;
tuner row_tuner, col_tuner;
Graph * graph1, * graph2;
dgemm_dgemm_context( A, B, C, D, E) {
create C_collection
graph1 = make_dgemm_graph (A, B, C);
graph2 = make_dgemm_graph (C, D, E);
}
}
Context
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Compiler
initialize_CnC()
dgemm_dgemm(A, B, C, D, E)
finalize_Cnc()
User code
struct dgemm_dgemm_context {
item_collection *C_collection;
tuner row_tuner, col_tuner;
Graph * graph1, * graph2;
dgemm_dgemm_context( A, B, C, D, E) {
create C_collection
graph1 = make_dgemm_graph (A, B, C);
graph2 = make_dgemm_graph (C, D, E);
}
}
Context
Interface
void dgemm_dgemm(A, B, C, D, E) {
dgemm_dgemm_context ctxt(A, B, C, D, E );
ctxt.graph1->start();
ctxt.graph2->start();
ctxt.wait();
ctxt.graph2->copyout();
}
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Compiler
initialize_CnC()
dgemm_dgemm(A, B, C, D, E)
finalize_Cnc()
User code
struct dgemm_dgemm_context {
item_collection *C_collection;
tuner row_tuner, col_tuner;
Graph * graph1, * graph2;
dgemm_dgemm_context( A, B, C, D, E) {
create C_collection
graph1 = make_dgemm_graph (A, B, C);
graph2 = make_dgemm_graph (C, D, E);
}
}
Context
Interface
void dgemm_dgemm(A, B, C, D, E) {
dgemm_dgemm_context ctxt(A, B, C, D, E );
ctxt.graph1->start();
ctxt.graph2->start();
ctxt.wait();
ctxt.graph2->copyout();
}
Domain
expert
written
class dgemm_graph {
tuner *tunerA, *tunerB, *tunerC, *tunerS;
item_collection *A_collection, *B_collection, *C_collection;
tag_collection tags;
step_collection *multiply_steps;
dgemm_graph(_A, _B, _C) {
create A/B/C_collection based on A/B/C
define dataflow graph
}
}
Code skeleton
6
User codedgemm(A, B, C)
dgemm(C, D, E)
Compiler
initialize_CnC()
dgemm_dgemm(A, B, C, D, E)
finalize_Cnc()
User code
struct dgemm_dgemm_context {
item_collection *C_collection;
tuner row_tuner, col_tuner;
Graph * graph1, * graph2;
dgemm_dgemm_context( A, B, C, D, E) {
create C_collection
graph1 = make_dgemm_graph (A, B, C);
graph2 = make_dgemm_graph (C, D, E);
}
}
Context
Interface
void dgemm_dgemm(A, B, C, D, E) {
dgemm_dgemm_context ctxt(A, B, C, D, E );
ctxt.graph1->start();
ctxt.graph2->start();
ctxt.wait();
ctxt.graph2->copyout();
}
Domain
expert
written
class dgemm_graph {
tuner *tunerA, *tunerB, *tunerC, *tunerS;
item_collection *A_collection, *B_collection, *C_collection;
tag_collection tags;
step_collection *multiply_steps;
dgemm_graph(_A, _B, _C) {
create A/B/C_collection based on A/B/C
define dataflow graph
}
}
Host
language
C
Key points
Compiler
Generates a context and an interface for a dataflow
Connects expert-written graphs into a pipeline
Minimizes communication with step tuners (Static scheduling) and item collection tuners
(Static data distribution)
7
Key points
Compiler
Generates a context and an interface for a dataflow
Connects expert-written graphs into a pipeline
Minimizes communication with step tuners (Static scheduling) and item collection tuners
(Static data distribution)
Domain-expert written graphs
High-level algorithms for library functions
Input/output collections can be from outside
7
Advantages
Useful for any language
Mature work in compiler/interpreter
– Dataflow analysis, pattern matching, code replacement
Extends a scripting language to distributed computing implicitly
Transparent to users
Transparent to the language
Transparent to libraries
Heavy lifting done in CnC and graph writing by domain experts
8
Open questions
Minimize communication
Item collections: consumed_on
Step collections: computed_on
Scalability
Applications
There might not be many long sequences of library calls
9