Hybrid Parallelization of Standard Full Tableau Simplex Method with MPI and OpenMP Basilis Mamalis 1, Marios Perlitis 2 (1) Technological Educational Institute

Hybrid Parallelization of Standard Full Tableau Simplex Method with

MPI and OpenMP

Basilis Mamalis1, Marios Perlitis2

(1)Technological Educational Institute of Athens, (2)Democritus University of Thrace

October 2014

Technological Educational Institute of AthensTechnological Educational Institute of AthensDepartment of InformaticsDepartment of Informatics

1

2

Background – LP & Simplex

Linear programming is known as one of the most important and well studied optimization problems.

The simplex method, has been successfully used for solving linear programming problems for many years.

Parallel approaches have also extensively been studied due to the intensive computations required.

Most research (with regard to sequential simplex method) has been focused on the revised simplex method since it takes advantage of the sparsity that is inherent in most linear programming applications.

The standard simplex method is mostly suitable for dense linear problems

3

Background – LP & Simplex

The mathematical formulation of the linear programming problem, in standard form:

Minimize z = cT x s.t. Ax = b

x ≥ 0

where: A is an mxn matrix, x is an n-dimensional variable

vector, c is the price vector, b is the right-hand side vector of

the constraints (m-dimensional)

(*) Dense and sparse linear problems

(*) The standard method and the revised method

4

Background – The Standard Simplex Method

x1 x2 ... xn xn+1 ... xn+m z -c1 -c2 ... -cn 0 ... 0 1 0

xn+1 a11 a12 ... a1n 1 ... 0 0 b1 xn+2 a21 a22 ... a2n 0 ... 0 0 b2 ... ... ... ... ... ... ... ... ... ...

xn+m am1 am2 ... amn 0 ... 1 0 bm

(*) more efficient for dense linear problems

(*) it can be easily convert to a distributed version

The standard full tableau simplex method:

A tableau is an (m+1)x(m+n+1) matrix of the form:

5

Related Work (in simplex parallelization)

Parallel approaches have extensively been studied due to the intensive computations required (especially for large problems). Many of the known parallel (distributed memory) machines have been used in the past (Intel iPSC, MassPar, CM-2 etc.)

Several attempts have been made to parallelize the (quite faster sequentially) revised method, however high speed-up values can not be achieved and most of them do not scale well, when distributed environment is involved.

On the contrary the standard method can be parallelized more easily, naturally many attempts have been done over this - with very satisfactory results/speed-up and good/high scalability.

In the last decade, many attempts have naturally focused on the use of tightly-coupled or shared memory hardware structures as well as on cluster computer systems achieving very good speedup & efficiency values and high scalability.

6

Related Work (in simplex parallelization)

Many typical and generally efficient implementations for low/medium cost clusters that usually differ on the data distribution scheme (column-based, row-based). E.g. Hall et al., Badr et al., Yarmish et al.

Distinguished: The work of Yarmish, G., and Slyke, R.V. 2009. A Distributed

Scaleable Simplex Method. Journal of Supercomputing, Springer, 49(3), 373-381.

Some alternative very promising efforts have been made on distributed-memory many-core implementations, based on the block angular structure (or decomposition) of the initially transformed problems – e.g. Hall et al. 2013 to solve large-scale stochastic LP problems, and K.K. Sivaramakrishnan 2010.

The question remains: is it worth using distributed (shared nothing) hardware platforms to parallelize simplex ???? (since the revised method can not scale well and the standard method is much slower, thus any possible speed-up may be not enough)

7

Background (MPI & OpenMP)

MPI (Message Passing Interface) is the dominant approach for the distributed-memory (message-passing) model.

OpenMP emerges as the dominant high-level approach for shared memory with threads.

Recently, the hybrid model has begun to attract more attention, for at least two reasons:

It is relatively easy to pick a language/library instantiation of the hybrid model.

Scalable parallel computers now strongly encourage this model. The fastest machines these days almost all consist of multi-core nodes connected by a high speed network.

The idea of using OpenMP threads to exploit the multiple cores per node while using MPI to communicate among the nodes is the most known hybrid approach.

8

Background (MPI & OpenMP)

The last two years however, another strong alternative has evolved; the MPI 3.0 Shared Memory support mechanism, which improves significantly the previous existed Remote Memory Access utilities of MPI, towards the direction of optimized operation inside a multicore node.

As analyzed in the literature both the above referred hybrid models (MPI+OpenMP, MPI+MPI 3.0 Shared Memory) have their pros and cons, and it’s not straightforward that they outperform pure MPI implementations in all cases.

Also, although OpenMP is still regarded as the most efficient hybrid approach, it’s superiority over the MPI 3.0 Shared Memory approach (especially when the processors/cores need to access shared data only for reading; i.e. without synchronization needs) is not straightforward too.

9

In Our Work …

We focus on the the standard full tableau simplex method, and we present relevant implementations for all hybrid schemes

(a) MPI + OpenMP,

(b) MPI + MPI 3.0 Shared Memory, and

(c) pure MPI parallelization (one MPI process on each core).

We experimentally evaluate our approaches (over a hybrid environment of up to 4 processors / 16 cores)

(a) among each other

(b) to the approach presented in [23] (Yarmish et al)

In all cases the hybrid OpenMP-based parallelization scheme performs quite/much better than the other two schemes.

All schemes lead to high speed-up and efficiency values, The corresponding values for the OpenMP-based scheme are better than the ones achieved in the Yarmish et al [23]

10

The Standard method – Main steps

Step 0: Initialization: start with a feasible basic solution and construct the corresponding simplex tableau.

Step 1: Choice of entering variable: find the winning column (the one having the larger negative coefficient of the objective function – entering variable).

Step 2: Choice of leaving variable: find the winning row (apply the minimum ratio test to the elements of the winning column – choose the row number with the minimum ratio – leaving variable).

Step 3: Pivoting (involves the most calculations): construct the next simplex tableau by performing pivoting in the previous tableau rows based on the new pivot row found in the previous step.

Step 4: Repeat the above steps until the best solution is found or the problem gets unbounded.

11

Simplex Parallelization – Alternatives

1. Column-based distribution

sharing the simplex table to all processors by columns the most popular and widely used - relatively direct parallelization all the computation parts except step 2 of the basic algorithm are fully

parallelized (even the first one which is of less computation) theoretically regarded as the most effective one in the general case

2. Row-based distribution

sharing the simplex table to all processors by rows

1. try to take advantage of some probable inherent parallelism in the computations of step 2 (is it always an advantage or not?)

2. an extra broadcast of a row is needed instead of an extra broadcast of a column that is required in the column distribution scheme

3. the parallelization of step 1 becomes more problematic and it’s usually preferable to execute the corresponding task on P0

Column vs. Row-based distribution experimentally

Results drawn by a previous work of ours: Mamalis, B., Pantziou, G., Dimitropoulos, G., and Kremmydas, D. 2013.

Highly Scalable Parallelization of Standard Simplex Method on a Myrinet Connected Cluster Platform. ACTA Intl. Journal of Computers and Applications, 35(4), 152-161.

For large problems the speed-up given by the column scheme is clearly better than the one of the row scheme (except the case when the # of rows is much bigger than the # of columns)

For medium sized problems, the behaviour of the two schemes is similar, however the differences are quite lower

For small problems the speed-up given by the row scheme is rather a little better than the one of the column scheme in the general case.

In the general (most likely to happen) case the column-based scheme should be the most preferable.

The Parallel Column-based algorithm

1. The simplex table is shared among the processors by columns. Also, the right-hand constraints vector is broadcasted to all processors.

2. Each processor searches in its local part and chooses the locally best candidate column (as entering variable) – the one with the larger negative coefficient in the objective function part.

3. The local results are gathered in parallel and the winning processor (with the larger negative coefficient) is found and globally known.

4. The processor with the winning column (entering variable) computes the leaving variable (winning row) using the minimum ratio test over all the winning column’s elements.

5. The same (winning) processor then broadcasts the winning column as well as the winning row’s id to all processors.

6. Each processor performs (in parallel) on it’s own part (columns) of the table all the calculations required for the global rows pivoting, based on the pivot data received during the previous step.

7. The above steps are repeated until the best solution is found or the problem gets unbounded.

One MPI process on each core Without any MPI-2/MPI-3 RMA or SM support): The well-known MPI collective communication functions

MPI_Bcast, MPI_Scatter/Satterv, MPI_Gather/Gatherv, MPI_Reduce/Allreduce (with or without MAXLOC/MINLOC operators)

were appropriately used for the efficient implementation of the data communication required by steps 1, 3 and 5 of the parallel algorithm.

A. Pure MPI implementation

Appropriately built parallel for constructs were used for the efficient thread-based parallelization of the loops implied by steps 2, 4 and 6.

With regard to the parallelization of steps 2 (in cooperation with 3) and 4 (which both involve a reduction operation), we used the newly added min/max reduction operators of OpenMP API specification for C/C++.

(*) The min/max operators were not supported in OpenMP API specification for C/C++ until version 3.0 (they were supported only for Fortran); they were added in version 3.1 (in our work we’ve used version 4.0).

B. Hybrid MPI+OpenMP implementation

With regard to the parallelization of step 6, in order to achieve even distribution of computations to the working threads (given that the computational costs of the main loop iterations cannot be regarded a-priori equivalent) we’ve used

collapse-based nested parallelism, with

dynamic scheduling policy.

Beyond the OpenMP-based parallelization inside each node, the collective communication functions of MPI (MPI_Scatter, MPI_Gather, MPI_Bcast, MPI_Reduce) were also used here for the communication between the network connected nodes as in pure MPI implementation.

B. Hybrid MPI+OpenMP implementation

C. Hybrid MPI+MPI Shared Memory implementation

The shared memory support functions of MPI 3.0

mainly: MPI_Comm_split_type, MPI_Win_allocate_ shared, MPI_Win_shared_query, MPI_Get/Put

and the synchronization primitives MPI_Win_fence, MPI_Win_lock/unlock and MPI_Accumulate

were used for the efficient implementation of all the data communication required by the steps of the parallel algorithm over the multiple cores of each node.

Beyond the MPI 3.0 SM-based parallelization inside each node, the collective communication functions of MPI (MPI_Bcast, MPI_Scatter, etc.) were also used here for the communication between the network connected nodes as in pure MPI implementation.

18

Experimental Results

The three outlined different parallelization schemes have been implemented with use of the MPI 3.0 message passing library and OpenMP 4.0 API, and they have been extensively tested over a powerful hybrid parallel environment (distributed memory, multicore nodes).

Our test environment consists of up to 4 quad-core processors (making a total of 16 cores) with 4GB RAM each, connected via a network interface with up to 1Gbps communication speed.

The computing components of the above test environment were mainly available and accessed through the Okeanos Cyclades cloud computing services and local infrastructure in T.E.I. of Athens.

Experimental Results

We’ve performed three kinds of experiments:

experiments that compare our three different schemes to the corresponding one of Yarmish et al [23]

experiments that further compare the two hybrid schemes (MPI+OpenMP vs. MPI+MPI 3.0 Shared Memory), over more realistic (NETLIB) problems

experiments wrt the scalability of the MPI+OpenMP scheme (over larger NETLIB problems)

Test problems either taken from NETLIB or random ones

(*) We measure response times (T), speed-up (Sp) and efficiency (Ep) values for varying number of processors/cores (1…16)]

(*) Speed-up measure for P processors: Sp = T1 / TP

(*) Efficiency measure for P processors: Ep = Sp / P

Comparing to [23] (Yarmish et al) Table 2. Comparing to the implementation of Yarmish et al

P Yarmish et al Hybrid (MPI + OpenMP)

#cores Time/iter Sp=T1/TP Ep=Sp/P Time/iter Sp=T1/TP Ep=Sp/P

1 0.6133 1.00 100.0% 0.2279 1.00 100.0%

2 0.3115 1.97 98.4% 0.1149 1.98 99.2%

3 0.2172 2.82 94.1% 0.0766 2.97 99.1%

4 0.1550 3.96 98.9% 0.0575 3.96 99.0%

5 0.1311 4.68 93.5% 0.0462 4.93 98.7%

6 0.1066 5.75 95.9% 0.0386 5.91 98.5%

7 0.0913 6.72 96.0% 0.0332 6.87 98.2%

8 0.0290 7.85 98.1%

Table 3. Comparing the two MPI-based implementations

P Hybrid (MPI + MPI 3.0 SM) Pure MPI

#cores Time/iter Sp=T1/TP Ep=Sp/P Time/iter Sp=T1/TP Ep=Sp/P

1 0.2279 1.00 100.0% 0.2279 1.00 100.0%

2 0.1178 1.93 96.7% 0.1257 1.81 90.7%

3 0.0789 2.89 96.3% 0.0846 2.70 89.8%

4 0.0592 3.85 96.2% 0.0630 3.62 90.4%

5 0.0478 4.77 95.4% 0.0513 4.45 88.9%

6 0.0400 5.70 95.0% 0.0425 5.36 89.4%

7 0.0347 6.57 93.9% 0.0363 6.28 89.7%

8 0.0305 7.48 93.5% 0.0317 7.20 90.0%

Comparing to [23] (Yarmish et al)

1

2

3

4

5

6

7

8

p=2 p=3 p=4 p=5 p=6 p=7 p=8

Spee

d-up

# of Processors

OpenMP

Yarmish

Comparing the three schemes

1

2

3

4

5

6

7

8

p=2 p=3 p=4 p=5 p=6 p=7 p=8

Spee

d-up

# of Processors

OpenMP

MPI-SM

Pure MPI

Discussion…

The achieved speed-up values of our hybrid MPI+OpenMP implementation are better than the ones in [23] in all cases.

To be more precise, they are slightly better for 2 and 4 processors/cores (which are powers of two) and quite better for 3, 5, 6 and 7 processors/cores.

Furthermore, observing the efficiency values in the last column someone can easily notice the high scalability (higher and smoother than in [23]) achieved by our implementation.

Note also that the achieved speedup remains very high (close to the maximum / speedup = 7.85, efficiency = 98.1%) even for 8 processors/cores.

(*) The implementation of [23] has been compared to MINOS, a well-known implementation of the revised simplex method, and it has been shown to be highly competitive, even for very low density problems.

Discussion… – [continued]

The corresponding speed-up and efficiency values of our hybrid MPI+MPI 3.0 Shared Memory and pure MPI implementations are very satisfactory in absolute values.

However: comparing to [23] and MPI+OpenMP implementations, the measurements for the hybrid MPI+MPI 3.0 Shared Memory approach are slightly worse, whereas the measurements for pure MPI approach are quite worse in all cases.

The main disadvantage of the pure MPI approach is that it follows the pure process model (use of IPC mechanisms for interaction), it naturally cannot simulate shared memory access speed, especially for large scale data sharing needs.

The main disadvantage of the MPI+MPI 3.0 SM approach is that its shared window allocation mechanism still cannot scale up the same well for large data sharing needs and number of cores (as shown more clearly later). However, it obviously offers a substantial improvement over pure MPI use.

Comparing the two hybrid schemes [NETLIB problems]

Linear Problems

MPI+OpenMP MPI+MPI 3.0 SM

2x4=8 cores 4x4=16 cores 2x4=8 cores 4x4=16 cores

Sp Ep Sp Ep Sp Ep Sp Ep

SC50A (50x48)4.89 61.1% 6.50 40.6% 4.85 60.6% 6.40 40.0%

SHARE2B (96x79)5.59 69.8% 8.24 51.5% 5.49 68.6% 8.00 50.0%

SC105 (105x103)5.81 72.6% 8.78 54.9% 5.69 71.1% 8.50 53.1%

BRANDY (220x249)7.00 87.5% 12.17 76.1% 6.60 82.5% 10.63 66.5%

AGG (488x163)6.76 84.5% 12.11 75.7% 6.38 79.8% 11.27 70.4%

AGG2 (516x302)7.00 87.5% 12.89 80.5% 6.58 82.3% 11.99 74.9%

BANDM (305x472)7.42 92.8% 13.61 85.0% 6.82 85.3% 12.03 75.2%

SCFXM3 (990x1371)7.60 95.0% 14.38 89.9% 7.16 89.5% 13.05 81.5%


The achieved speed-up values of MPI+OpenMP are better than the ones of the hybrid MPI+MPI 3.0 SM implementation, in all cases.

Moreover, the larger the size of the linear problem the larger the difference in favour of the MPI+OpenMP implementation.

For linear problems of small size (e.g. the first two problems) the corresponding measurements are almost the same (slightly better for the MPI+OpenMP approach), whereas

For problems of larger size the difference is quite clear (e.g. the last two problems).

27

Overall, we can say that the shared window allocation mechanism of MPI 3.0 offers a very good alternative (with almost equivalent results to the MPI+OpenMP approach) for shared memory parallelization when the shared data are of relatively small/medium scale.

However it cannot scale up the same well (for large windows and large # of cores) due to internal protocol limitations and management costs, especially when some kind of synchronization is required.

As it's also analyzed in [17], the performance of MPI 3.0 Shared Memory can be very close to MPI+OpenMP, even for large scale shared data, if it's for simple sharing (e.g. mainly for reading).


Scalability of MPI+OpenMP [large NETLIB problems]

Linear Problems

Speed-up & Efficiency / MPI+OpenMP

2x1 cores 2x2 cores 2x4 cores 4x4 cores

Sp Ep Sp Ep Sp Ep Sp Ep

FIT2P (3000x13525)

1.977 98.9% 3.94 98.5% 7.80 97.5% 15.24 95.3%80BAU3B (2263x9799)

1.969 98.5% 3.91 97.8% 7.72 96.5% 14.92 93.3%QAP15 (6330x22275)

1.963 98.2% 3.89 97.3% 7.62 95.3% 14.47 90.5%MAROS-R7 (3136x9408)

1.957 97.9% 3.87 96.8% 7.54 94.3% 14.12 88.3%QAP12 (3192x8856)

1.953 97.7% 3.86 96.5% 7.50 93.8% 13.97 87.3%DFL001 (6071x12230)

1.945 97.3% 3.85 96.3% 7.50 93.8% 14.04 87.8%GREENBEA (2392x5405)

1.949 97.5% 3.84 96.0% 7.40 92.5% 13.59 84.9%STOCFOR3

(16675x15695) 1.925 96.3% 3.79 94.8% 7.23 90.4% 12.80 80.0%

Discussion…

The efficiency values decrease with the increase of the # of processors. However, this decrease is quite slow, and both the speedup and efficiency remain high (no less than 80%) even for 16 processors/cores, in all cases.

Moreover, particularly high efficiency values are achieved for all the high aspect ratio problems (e.g. see the values for the first three problems where the efficiency even for 16 processors/cores is over 90%).

In the case of 16 cores (4x4 network connected nodes) the communication overhead starts to be quite significant; however as it is shown in [11] the higher the aspect ratio of the linear problem the better the performance of the column distribution scheme with regard to the total communication overhead.

Conclusion…

MPI+OpenMP: High scalability, highly appropriate for modern supercomputers and large-scale cluster architectures

MPI 3.0 Shared Memory performance very close to OpenMP for simple parallelization needs

Decreased performance (however quite satisfactory), for large scale parallelization and increased synchronization needs

A highly scalable parallel implementation framework of the standard full tableau simplex method on a hybrid (modern

hardware) platform has been presented and evaluated throughout the paper, in terms of typical performance measures.

Thank you for your attention!

Any questions

End of Time…

Documents

Hybrid Parallelization of Standard Full Tableau Simplex Method with MPI and OpenMP Basilis Mamalis 1, Marios Perlitis 2 (1) Technological Educational Institute