PSWEEP: A Lightweight Pattern for Distributed Computational Experiments
Christopher Mueller and Andrew Lumsdaine
Open Systems Lab, Indiana University
Introduction
Parameter Sweeps are common cluster applications
Approaches Scripts (sh, perl: ssh, mpi) Low level applications (C++, Fortran: MPI) Parameter sweep applications (e.g., Nimrod)
Problems Custom solutions become tangled quickly Applications are not available on all platforms
How do we use our clusters?Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----882576.aviss.av silin iq SL_DBJ014Q 14636 2 4 -- 200:0 R 109:4890917.aviss.av baikgrp bg DA_NPJ001V 27673 1 2 -- 168:0 R 83:32890932.aviss.av baikgrp bg DA_NPJ002V 18006 1 2 -- 168:0 R 87:31959929.aviss.av rllord iq RL1_NCQ02V 11982 1 2 -- 120:0 R 56:27960044.aviss.av shawnli bg Hairy2b 13703 1 1 -- 100:0 R 42:52960045.aviss.av shawnli bg Xxbp1 21294 1 1 -- 100:0 R 42:51960046.aviss.av shawnli bg Foxa1 15908 1 1 -- 100:0 R 42:49960047.aviss.av shawnli bg Foxa2 19881 1 1 -- 100:0 R 42:49960048.aviss.av shawnli bg Foxd3 19073 1 1 -- 100:0 R 42:49960050.aviss.av shawnli bg Gsc 20886 1 1 -- 100:0 R 42:04960215.aviss.av shawnli bg Foxa1mamma 18296 1 1 -- 100:0 R 35:23960216.aviss.av shawnli bg Foxa2mamma 14926 1 1 -- 100:0 R 34:43960217.aviss.av shawnli bg Foxd3mamma 15016 1 1 -- 100:0 R 34:43960218.aviss.av shawnli bg Gata4mamma 7421 1 1 -- 100:0 R 33:11960220.aviss.av shawnli bg Glimammal 7525 1 1 -- 100:0 R 33:11960221.aviss.av shawnli bg Gscmammal 16626 1 1 -- 100:0 R 33:03960222.aviss.av shawnli bg Hairy2bmam 16760 1 1 -- 100:0 R 33:03960224.aviss.av shawnli bg Hoxd1mamma 32101 1 1 -- 100:0 R 33:01960225.aviss.av shawnli bg Mixermamma 27958 1 1 -- 100:0 R 32:09960279.aviss.av dkberry mdgrape run13_07m 5570 1 1 -- 36:00 R 17:04960283.aviss.av dbaronia iq batch.sh 23862 3 6 -- 24:00 R 22:41960426.aviss.av cwillenb bg CWOA_005 18980 1 1 -- 100:0 R 04:52960428.aviss.av cwillenb bg CWOA_006a 1941 1 1 -- 100:0 R 04:52960429.aviss.av cwillenb bg CWOA_007 -- 1 1 -- 100:0 Q -- 960430.aviss.av cwillenb bg CWOA_008 -- 1 1 -- 100:0 Q -- 960431.aviss.av cwillenb bg CWOA_009 -- 1 1 -- 100:0 Q -- 960432.aviss.av cwillenb bg CWOA_010 -- 1 1 -- 100:0 Q -- 960433.aviss.av cwillenb bg CWOA_011 -- 1 1 -- 100:0 Q -- 960434.aviss.av cwillenb bg CWOA_012 -- 1 1 -- 100:0 Q -- 963115.aviss.av xsong bg par.241 -- 8 16 -- 24:00 Q -- 963116.aviss.av xsong bg par.242 -- 8 16 -- 24:00 Q -- 963121.aviss.av xsong bg par.53.7 -- 8 16 -- 02:00 Q -- 963122.aviss.av xsong bg par.53.8 -- 16 32 -- 02:00 Q -- 963133.aviss.av honfan iq HF_MJ370 23299 3 6 -- 120:0 R 07:13963167.aviss.av whpitcoc iq WP_C572_L0 30829 1 2 -- 24:00 R 01:11963171.aviss.av whpitcoc iq WP_C572_L0 17995 1 2 -- 24:00 R 01:11963186.aviss.av whpitcoc iq WP_C572_TS 5235 1 2 -- 24:00 R 00:08963187.aviss.av whpitcoc iq WP_C572_TS 25746 1 2 -- 24:00 R 00:09963188.aviss.av whpitcoc iq WP_C572_TS 13846 1 2 -- 24:00 R 00:09963189.aviss.av whpitcoc iq WP_C572_TS 26613 1 2 -- 24:00 R 00:08
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----882576.aviss.av silin iq SL_DBJ014Q 14636 2 4 -- 200:0 R 109:4890917.aviss.av baikgrp bg DA_NPJ001V 27673 1 2 -- 168:0 R 83:32890932.aviss.av baikgrp bg DA_NPJ002V 18006 1 2 -- 168:0 R 87:31959929.aviss.av rllord iq RL1_NCQ02V 11982 1 2 -- 120:0 R 56:27960044.aviss.av shawnli bg Hairy2b 13703 1 1 -- 100:0 R 42:52960045.aviss.av shawnli bg Xxbp1 21294 1 1 -- 100:0 R 42:51960046.aviss.av shawnli bg Foxa1 15908 1 1 -- 100:0 R 42:49960047.aviss.av shawnli bg Foxa2 19881 1 1 -- 100:0 R 42:49960048.aviss.av shawnli bg Foxd3 19073 1 1 -- 100:0 R 42:49960050.aviss.av shawnli bg Gsc 20886 1 1 -- 100:0 R 42:04960215.aviss.av shawnli bg Foxa1mamma 18296 1 1 -- 100:0 R 35:23960216.aviss.av shawnli bg Foxa2mamma 14926 1 1 -- 100:0 R 34:43960217.aviss.av shawnli bg Foxd3mamma 15016 1 1 -- 100:0 R 34:43960218.aviss.av shawnli bg Gata4mamma 7421 1 1 -- 100:0 R 33:11960220.aviss.av shawnli bg Glimammal 7525 1 1 -- 100:0 R 33:11960221.aviss.av shawnli bg Gscmammal 16626 1 1 -- 100:0 R 33:03960222.aviss.av shawnli bg Hairy2bmam 16760 1 1 -- 100:0 R 33:03960224.aviss.av shawnli bg Hoxd1mamma 32101 1 1 -- 100:0 R 33:01960225.aviss.av shawnli bg Mixermamma 27958 1 1 -- 100:0 R 32:09960279.aviss.av dkberry mdgrape run13_07m 5570 1 1 -- 36:00 R 17:04960283.aviss.av dbaronia iq batch.sh 23862 3 6 -- 24:00 R 22:41960426.aviss.av cwillenb bg CWOA_005 18980 1 1 -- 100:0 R 04:52960428.aviss.av cwillenb bg CWOA_006a 1941 1 1 -- 100:0 R 04:52960429.aviss.av cwillenb bg CWOA_007 -- 1 1 -- 100:0 Q -- 960430.aviss.av cwillenb bg CWOA_008 -- 1 1 -- 100:0 Q -- 960431.aviss.av cwillenb bg CWOA_009 -- 1 1 -- 100:0 Q -- 960432.aviss.av cwillenb bg CWOA_010 -- 1 1 -- 100:0 Q -- 960433.aviss.av cwillenb bg CWOA_011 -- 1 1 -- 100:0 Q -- 960434.aviss.av cwillenb bg CWOA_012 -- 1 1 -- 100:0 Q -- 963115.aviss.av xsong bg par.241 -- 8 16 -- 24:00 Q -- 963116.aviss.av xsong bg par.242 -- 8 16 -- 24:00 Q -- 963121.aviss.av xsong bg par.53.7 -- 8 16 -- 02:00 Q -- 963122.aviss.av xsong bg par.53.8 -- 16 32 -- 02:00 Q -- 963133.aviss.av honfan iq HF_MJ370 23299 3 6 -- 120:0 R 07:13963167.aviss.av whpitcoc iq WP_C572_L0 30829 1 2 -- 24:00 R 01:11963171.aviss.av whpitcoc iq WP_C572_L0 17995 1 2 -- 24:00 R 01:11963186.aviss.av whpitcoc iq WP_C572_TS 5235 1 2 -- 24:00 R 00:08963187.aviss.av whpitcoc iq WP_C572_TS 25746 1 2 -- 24:00 R 00:09963188.aviss.av whpitcoc iq WP_C572_TS 13846 1 2 -- 24:00 R 00:09963189.aviss.av whpitcoc iq WP_C572_TS 26613 1 2 -- 24:00 R 00:08
Anatomy of a Parameter Sweep
1. for i in range(rank, n, size):2. if process: load_image(i)3. elif stats: query_image(i)4. 5. for j in [1, 2, 4, 8]:6. if process: time(i, j)7. 8. for k in [‘motion’, ‘gaussian’]:9. if process: process_image(i,j,k)10. elif stats: image_stats(i,j,k)11. else:12. print 'ssh n%d run %d %d' % (i, j, k)13. 14. if process: clear_process(k)15. elif bgi: clear_temp(k)16. 17. if process: unload_image(i)
Parameters and Enumeration Order
*
* Resrouce distribution is handled by the execution enviroment, e.g. mpirun
Anatomy of a Parameter Sweep
Tasks and Experiments
1. for i in range(rank, n, size):2. if process: load_image(i)3. elif stats: query_image(i)4. 5. for j in [1, 2, 4, 8]:6. if process: time(i, j)7. 8. for k in [‘motion’, ‘gaussian’]:9. if process: process_image(i,j,k)10. elif stats: image_stats(i,j,k)11. else:12. print 'ssh n%d run %d %d' % (i, j, k)13. 14. if process: clear_process(k)15. elif bgi: clear_temp(k)16. 17. if process: unload_image(i)
Anatomy of a Parameter Sweep
Artifacts and Errors
1. for i in range(rank, n, size):2. if process: load_image(i)3. elif stats: query_image(i)4. 5. for j in [1, 2, 4, 8]:6. if process: time(i, j)7. 8. for k in [‘motion’, ‘gaussian’]:9. if process: process_image(i,j,k)10. elif stats: image_stats(i,j,k)11. else:12. print 'ssh n%d run %d %d' % (i, j, k)13. 14. if process: clear_process(k)15. elif bgi: clear_temp(k)16. 17. if process: unload_image(i)
User’s View
process
load_image()unload_image()
time()
process_image()
clear_process()
[0, n]
[.01, .1, 1.0]
[10, 12, 14]
stats
query_image()
image_stats()
script gen
print …0, 0.01, 100, 0.01, 120, 0.01, 140, 0.1, 100, 0.1, 12…
Experiments
Parameters
[i, j, k]
Resources
The PSWEEP Pattern
Abstracting the Loops
Parameter. A Parameter is an iterator or container that supplies the values for a variable in the experiment.
Enumerator. The enumerator takes a ordered list of parameters and lexigraphically enumerates all possible values.
State. The state contains the current value of each parameter, in order.
1. i = [‘house.jpg’, ‘lena.jpg’]2. j = [1, 2, 4, 8]3. K = [‘motion’, ‘gaussian’]4. 5. params = [i, j, k]6. e = enumerator(params)7. 8. for state in e: process_image(state)
Abstracting the Experiments
Task. A Task is any unit of work performed when a parameter value changes. A Task is subdivided into setup and cleanup operations, corresponding to the work done at the beginning and end of a block of code in a loop, respectively.
Experiment. An Experiment is a collection of tasks.
1. def PrepareImage(state, img):2. # Setup3. db_load(img, './current.jpg')4. yield # suspend the function5. # Cleanup6. delete('./current.jpg')
1. def ProcessImage(state, alg):2. data = load('./current.jpg')3. img = process(data, alg(value))4. save(img, str(state) + '.jpg')5. 6. return # no cleanup
Binding Experiments to State
Bound Task Semantics. Tasks must execute in the same order they would if the parameter sweep was expanded to nested loops.
1. for img in images:2. PrepareImage.setup(img)3. for alg in algs:4. ProcessImage.setup(alg)5. PrepareImage.cleanup(img)
1. e = enumerator([images, algs])2. e.bind(images, PrepareImage)3. e.bind(algs, ProcessImage)4. 5. for state in e: pass
These examples are equivalent.
Distributing the Workload
DistributedEnumerator. DistributedEnumerator is an Enumerator that distributes the state to multiple instances across multiple computing resources.
e = RoundRobin(params)for state in e: pass
States:
p1: [house.jpg, 1, motion]p2: [house.jpg, 1, gaussian] [house.jpg, 2, motion] [house.jpg, 2, gaussian] [house.jpg, 4, motion] [house.jpg, 4, gaussian] [lena.jpg, 1, motion] [lena.jpg, 1, gaussian] [lena.jpg, 2, motion] [lena.jpg, 2, gaussian] [lena.jpg, 4, motion] [lena.jpg, 4, gaussian]
e = Domain(params, images)for state in e: pass
States:
p1: [house.jpg, 1, motion] [house.jpg, 1, gaussian] [house.jpg, 2, motion] [house.jpg, 2, gaussian] [house.jpg, 4, motion] [house.jpg, 4, gaussian]p2: [lena.jpg, 1, motion] [lena.jpg, 1, gaussian] [lena.jpg, 2, motion] [lena.jpg, 2, gaussian] [lena.jpg, 4, motion] [lena.jpg, 4, gaussian]
e = MasterWorker(params)for state in e: pass
States:
p1: [house.jpg, 1, motion]p2: [house.jpg, 1, gaussian] [house.jpg, 2, motion] [house.jpg, 2, gaussian] [house.jpg, 4, motion] [house.jpg, 4, gaussian] [lena.jpg, 1, motion] [lena.jpg, 1, gaussian] [lena.jpg, 2, motion] [lena.jpg, 2, gaussian] [lena.jpg, 4, motion] [lena.jpg, 4, gaussian]
The DistributedEnumerators must ensure that bound state semantics are satisfied.
Implementations
Python Designed around Iterators and Generators DistribtedEnumerator based on pyMPI Ideal for managing experiments on clusters
C++ Template metaprogramming techniques
remove abstraction penalties Ideal for applications with many nested loops
C++ Example
1. struct table_task {2. void setup(State& state) {3. std::cout << "<table title=\"";4. print_last_param()(state);5. std::cout << "\">\n";6. }
7. void cleanup(State&) {8. std::cout << "</table>\n";9. }10. };
11. struct table_row_task {12. // As above with <tr>13. };
14. struct table_data_task {15. // As above with <td>16. };
1. int main()2. {3. using boost::make_tuple;
4. sweep(make_tuple("Sat", "Sun"5. make_tuple(range(24)6. make_tuple(range(0,60,10))))7. empty_state().8. bind<0>(table_task()).9. bind<1>(table_row_task()).10. bind<2>(table_data_task()),11. print_last_param());
12. return 0;13. }
Task Classes Parameter Sweep
Generate HTML tables for days of the week with hours for the rows and minutes for the colums
Conclusions
PSWEEP cleanly separates concerns Parameters Tasks Resources
Modern languages enable flexible and high-performance implementations
Reference
http://www.osl.iu.edu/~chemuell/new/psweep.php
A Lightweight Pattern for Managing Distributed Computational Experiments Christopher Mueller, Douglas Gregor, and Andrew Lumsdaine. Submitted to HPDC 2006.
Questions?