Study of parallel programming models on computer clusters ...mqhuang/papers/2015_MICPerformanceStudy_HPCA.pdfmodel, which offloads the workload to MIC cores using OpenMP. (2) On top

Original Article

The International Journal of HighPerformance Computing Applications1–13� The Author(s) 2015Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1094342015580864hpc.sagepub.com

Study of parallel programming modelson computer clusters with Intel MICcoprocessors

Miaoqing Huang1, Chenggang Lai1, Xuan Shi1, Zhijun Hao2

and Haihang You3

AbstractCoprocessors based on the Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MICprocessors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability ofthe MIC processors under different programming models using the Beacon computer cluster. Our findings are as fol-lows. (1) The native MPI programming model on the MIC processors is typically better than the offload programmingmodel, which offloads the workload to MIC cores using OpenMP. (2) On top of the native MPI programming model, mul-tithreading inside each MPI process can further improve the performance for parallel applications on computer clusterswith MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI pro-cesses to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybridMPI programming model, in which data processing is distributed to both MIC cores and CPU cores, can outperform thenative MPI programming model.

Keywordsparallel programming model, Intel MIC processor, MPI, OpenMP, performance evaluation

1. Introduction

Emerging computer architectures and advanced com-puting technologies, such as Intel’s Many IntegratedCore (MIC) Architecture1 and graphics processing units(GPUs) (Kirk and Hwu, 2012), provide a promisingsolution to employ parallelism for achieving high per-formance, scalability and low power consumption. Forexample, the NSF sponsored Beacon supercomputer2

contains 192 MIC-based Intel Xeon Phi 5110P copro-cessors. It is ranked number 397 on the Top 500 list,3

and number 1 on the Green 500 list as of June 2013.4

The Stampede supercomputer5 at the Texas AdvancedComputing Center contains 6880 Intel Xeon Phi copro-cessors. It can provide a computing performance ofnearly 10 petaflops. The majority of the 10 petaflopscomes from the MIC coprocessors.

The current Intel MIC architecture (i.e. KnightsCorner) contains up to 61 low-weight processing cores,as shown in Figure 1. These cores are connectedthrough a high-speed ring bus. Each core can run fourthreads in parallel. Because each core alone is a classicprocessor, traditional parallel programming models,

such as MPI and OpenMP, are supported on each core.The MIC processors typically co-exist with multicoreCPUs, such as Intel Xeon E5, in a hybrid computernode as coprocessors/accelerators. In the remainder ofthis paper, a single MIC card or device will be called aMIC processor or MIC coprocessor. The constituentprocessing core on a MIC card will be called a MICcore.

In this work, we conduct a detailed study regardingthe performance and scalability of five execution modeson Intel MIC processors. In the first mode, the MPIprocess is directly run on each MIC core. In the secondmode, we try to take advantage of the internal process-ing parallelism on each MIC core. Therefore, we launch

1University of Arkansas, Fayetteville, AR, USA2Fudan University, Shanghai, P. R. China3Chinese Academy of Sciences, Beijing, P. R. China

Corresponding author:

Miaoqing Huang, University of Arkansas, JBHT-CSCE 526, 1 University of

Arkansas, Fayetteville, AR 72701, USA.

Email: [email protected]

by guest on April 14, 2015hpc.sagepub.comDownloaded from

http://hpc.sagepub.com/

four threads in each MPI process using OpenMP. EachMPI process is still run on a MIC core. In the third mode,only one MPI process is issued onto each MIC processor.Then OpenMP is used to launch threads to MIC cores. Inthe fourth mode, the MPI processes are run on the CPUs.The data processing is offloaded to the MIC processorsusing OpenMP. Only one thread is scheduled to one MICcore. The fifth mode is a variant of the fourth one. Fourthreads are scheduled to one MIC core in the fifth mode.We use two geospatial applications, i.e. Kriging interpola-tion and cellular automata (CA), to test the performanceand scalability of a single MIC processor and a computercluster with hybrid nodes.

Through this study, we have the following findings.(1) The native MPI programming model on the MICprocessors is typically better than the offload program-ming model, which offloads the workload to MIC coresusing OpenMP. (2) On top of the native MPI program-ming model, multithreading inside each MPI process canfurther improve the performance for parallel applicationson computer clusters with MIC coprocessors. (3) Givena fixed number of MPI processes, it is a good strategy toschedule these MPI processes to as few MIC processorsas possible to reduce the cross-processor communicationoverhead when the capacity of the on-board memory isnot a limiting factor. (4) We also evaluate a hybrid MPIprogramming model, which is not officially supported bythe Intel MPI compiler. In this hybrid model, the dataprocessing is distributed to both the MIC cores and theCPU cores. The benchmarking results show that thehybrid model outperforms the native model.

The remainder of this paper is organized as follows.The Intel MIC architecture and the two major pro-gramming models are discussed in Section 2. We discussthe details of the benchmarks and the experiment plat-form in Section 3. In Section 4, we show the experimentresults on a single MIC device. We also compare theperformance of a single MIC device with a single XeonCPU and the latest GPUs. Then we expand the experi-ment on two geospatial benchmarks to the Beacon clus-ter using many computer nodes in Section 5. We discuss

some related work in Section 6. Finally, we give theconcluding remarks in Section 7.

2. Intel MIC architecture andprogramming models

The first commercially available Intel coprocessorbased on the MIC architecture is Xeon Phi, as shownin Figure 1. Xeon Phi contains up to 61 scalar process-ing cores with vector processing units. These cores areconnected through a high-speed bi-directional, 1024-bit-wide ring bus (512 bits in each direction). In addi-tion to the scalar unit inside each core, there is a vectorprocessing unit to support wide vector processing oper-ations. Further, each core can execute four threads inparallel. The communications between the cores can berealized through the shared memory programmingmodels, e.g. OpenMP. In addition, each core can runMPI to realize communication. Direct communicationbetween MIC processors across different nodes is alsosupported through MPI.

Figure 2 shows two approaches to parallelizingapplications on computer clusters equipped with MICprocessors. The first approach is the native model, asshown in Figure 2(a). In this model, the MPI processesdirectly run on the MIC processors. There are two var-iants under this model. (1) Let each MIC core directlyhost one MPI process. In this way, the 60 cores on theXeon Phi 5110P, which is used in this work, are treatedas 60 independent processors while sharing the 8 GBon-board memory. (2) Only issue one MPI process onone MIC card. This single MPI process then spawnsthreads running on many cores using OpenMP. Thesecond approach is to treat the MIC processors as cli-ents to the host CPUs. As shown in Figure 2(b), theMPI processes will be hosted by CPUs, which will off-load the computation to the MIC processors.Multithreading programming models such as OpenMPcan be used to allocate many MIC cores for data pro-cessing in the offload model.

Figure 1. The architecture of Intel Xeon Phi coprocessor (MIC).

2 The International Journal of High Performance Computing Applications



In this work there are five different parallel imple-mentations in these two models as follows.

� Native model: In this model, MPI processes directlyexecute on MIC processors. There are further threeimplementations.

– Native-1 (N-1): Issue one MPI process onto eachMIC core. If n MIC cores are allocated, then nMPI processes are issued. Each MPI process con-tains only one thread.

– Native-2 (N-2): Issue one MPI process onto eachMIC core. Each MPI process contains fourthreads.

– Native-3 (N-3): Issue only one MPI process ontoeach MIC card. Then allocate many MIC coresusing OpenMP. On each MIC core, issue fourthreads.

� Offload model: In this model, the CPU offloads thework to the MIC processor using OpenMP. Thereare further two implementations.

– Offload-1 (O-1): Issue one thread onto each MICcore. If n MIC cores are allocated, then n threadsare issued.

– Offload-2 (O-2): Issue four threads onto eachMIC core. If n MIC cores are allocated, then43 n threads are issued.

3. Experiment setup

3.1. Benchmarks

Two geospatial applications are chosen to representtwo types of benchmarks in high-performance

computing: the embarrassingly parallel case and theintense communication case.

3.1.1. Embarrassingly parallel case: Kriging interpolation.Kriging is a geostatistical estimator that infers the valueof a random field at an unobserved location (Jensen,2004). Kriging is based on the idea that the value at anunknown point should be the average of the knownvalues of its neighbors.

Kriging can be viewed as a point interpolation thatreads input point data and returns a raster grid withcalculated estimations for each cell. Each input point isin the form (xi, yi, Zi) where xi and yi are the coordi-nates and Zi is the value. The estimated values in theoutput raster grid are calculated as a weighted sum ofinput point values as in (1).

Z(x, y)=Xk

i= 1

wiZi ð1Þ

where wi is the weight of the ith input point.Theoretically the estimation can be calculated by thesummation through all input points. In general, userscan specify a number k so that the summation is over knearest neighbors of the estimated point in terms of dis-tance. This decrease of computation is due to the factthat the farther the sampled point is from the estimatedpoint, the less impact it has on the summation. Forexample, the commercial software ArcGIS6 uses the 12nearest points (i.e. k = 12) in the Kriging calculationby default. In this benchmark, embarrassing parallelismcan be realized since the interpolation calculation overeach cell has no dependency on the others.

Figure 2. Two different parallel programming models on MIC based computer clusters. (a) In the native model, MPI processesdirectly run on MIC cores. (b) In the offload model, MPI processes run on CPU cores, which offload computation to MIC cores.

Huang et al. 3



In the Kriging interpolation benchmark, the prob-lem space as shown in Figure 3(a) is evenly partitionedamong all MPI processes as shown in Figure 3(b), inwhich we use four processes as an example. The com-putation in each MPI process is purely local, i.e. thereis no cross-process communication.

The input size of this benchmark is 171 MB, consist-ing of 4 data sets with the respective sizes of 29, 37, 48,and 57 MB. Each data set has 2191, 4596, 6941, and9817 sample points, respectively. The output raster gridfor each data set has a consistent dimension of14403 720. In other words, each data set will generatea 14403 720 grid. The value of each point in the out-put grid needs to be estimated using those samplepoints in the corresponding input data set. In ourexperiments, the value of an unsampled point will beestimated using the values of the 10 closest samplepoints, i.e. k = 10. These four data sets are processedin a sequence. For each data set, the generation of itscorresponding output grid is evenly distributed amongall MPI processes. In order to generate the value of apoint in the output grid, all of the sampled points inthe data set need to be scanned to find the 10 closestsample points. The pseudocode of Kriging interpola-tion is illustrated in Figure 4.

3.1.2 Intense communication case: cellular automata. CA arethe foundation for geospatial modeling and simulation.Game of Life (GOL) (Gardner, 1970), invented byBritish mathematician John Conway, is a well-knowngeneric CA. It consists of a collection of cells that canlive, die or multiply based on a few mathematical rules.

The universe of the GOL is a two-dimensionalsquare grid of cells, each of which is in one of two pos-sible states, alive (‘1’) or dead (‘0’). Every cell interactswith its eight neighbors, which are the cells that arehorizontally, vertically, or diagonally adjacent. At eachstep in time, the following transitions occur.

� Any live cell with fewer than two live neighborsdies, as if caused by under-population.

� Any live cell with two or three live neighbors liveson to the next generation.

� Any live cell with more than three live neighborsdies, as if by overcrowding.

� Any dead cell with exactly three live neighborsbecomes a live cell, as if by reproduction.

In this benchmark, the status of each cell in the gridwill be updated for 100 iterations. In each iteration, thestatuses of all cells are updated simultaneously. Thepseudocode is illustrated in Figure 5. In order to paral-lelize the updating process, the cells in the square gridare partitioned into stripes along the row-wise order.Each stripe is handled by one MPI process. At thebeginning of each iteration, each MPI process needs tosend the statuses of the cells along the boundaries ofeach stripe to its neighbor MPI processes and receivethe statuses of the cells of two adjacent rows as shownin Figure 3(c).

3.2. Experiment platform

We conduct our experiments on the NSF sponsoredBeacon supercomputer2 hosted at the NationalInstitute for Computational Sciences (NICS),University of Tennessee.

The Beacon system (a Cray CS300-AC ClusterSupercomputer) offers access to 48 compute nodes and6 I/O nodes joined by FDR InfiniBand interconnect,which provides a bi-directional bandwidth of 56 Gb/s.Each compute node is equipped with 2 Intel Xeon E5-2670 8-core 2.6 GHz processors, 4 Intel Xeon Phi(MIC) 5110P coprocessors, 256 GB of RAM, and

(a) (b) (c)

Figure 3. (a) Data partition and (b), (c) communication in twobenchmarks: (b) in Kriging interpolation there is nocommunication among MPI processes (i.e. P(I) in the figure)during computation; (c) in Game of Life, MPI processes need tocommunicate with each other in the computation.

1 for(each data set in the 4 data sets) {/* The following for loop can be parallelized */for(each point in the 1,440 x720 output grid) {

Scan the whole data set to find the 10 closestsampled points;

6 Use Equation (1) to estimate the value of theunsampled point;

}}

Figure 4. Pseudocode of Kriging interpolation. The innerfor loop can be parallelized while the four data sets in the outfor loop are processed in sequence.

1 for(iteration =0; iteration <100; iteration ++) {/* The following for loop can be parallelized */for(all cells in the universe) {

Update the status of cell[i,j] based on thestatuses of cell[i,j] and its 8 neighbors;

6 }}

Figure 5. Pseudocode of Game of Life.




960 GB of SSD storage. Each I/O node provides accessto an additional 4.8 TB of SSD storage. Each Xeon Phi5110P coprocessor contains 60 1.053 GHz MIC coresand 8 GB GDDR5 on-board memory. AltogetherBeacon contains 768 conventional cores and 11,520accelerator cores that provide over 210 TFLOP/s ofcombined computational performance, 12 TB of systemmemory, 1.5 TB of coprocessor memory, and over 73TB of SSD storage.

The compiler used in this work is Intel 64 CompilerXE, Version 14.0.0.080 Build 20130728, which supportsOpenMP. The MPI library is intel-mpi 4.1.0.024.

4. Experiments and results on a singledevice

Since a single Intel Xeon Phi 5110P processor is a 60-core processor, it is worthwhile to investigate the per-formance and scalability of a single MIC processoralone.

4.1. Scalability on a single MIC processor

When the MPI programming model is used to imple-ment the Kriging interpolation application, the work-load is evenly distributed among MPI processes. In thisbenchmark, there are four data sets. For each data set,the output is a 14403 720 raster grid. In the MPIimplementation, we increase the number of MPI pro-cesses from 10 to 60 with a stride of 10 processes. Thecomputation of 720 columns of the output grid isevenly distributed. The 50-process configuration isskipped because 720 columns cannot be distributedamong 50 processes equally. For the offload

programming model, we use OpenMP to parallelize thefor loops in the program. The OpenMP APIs willautomatically distribute workload to the MIC coresevenly.

The detailed execution times of the Kriging interpo-lation benchmark under both programming modelswhile each MIC core hosts only one thread are listed inTable 1. By looking at the time curves in Figure 6, wecan find that both models show a good strong scalabil-ity for this application. Their performance in terms ofinterpolation time is very close too. The reason we donot include the write time in Figure 6 is that the writetime may become dramatically lengthy when the num-ber of MPI processes increases. In the Kriging interpo-lation application, each output raster grid is writteninto a file. When many MPI processes try to write to

Table 1. Performance of Kriging interpolation on a single MIC processor (unit: second).

Execution mode: Native-1

Number of MIC cores

10 20 30 40 50 60

Read 0.65 0.60 0.66 0.72 NA* 0.79Interpolation 2734.45 1353.48 921.76 664.74 455.34Write 9.44 9.21 11.04 8.04 7.95Total 2744.54 1363.30 933.46 673.50 464.09

Execution mode: Offload-1

Number of MIC cores

10 20 30 40 50 60

Read 0.04 0.05 0.04 0.04 0.04 0.04Interpolation 2758.22 1570.75 1040.44 784.30 632.65 548.15Write 1.77 1.99 1.65 1.44 1.45 1.57Total 2760.03 1572.78 1042.12 785.78 634.14 549.75

*The workload could not be distributed among 50 cores evenly.

10 20 30 40 50 600

500

1000

1500

2000

2500

3000

Inte

rpol

atio

n Ti

me

(s)

Number of Cores

Offload-1Native-1

Figure 6. Performance of Kriging interpolation on a single MICprocessor. For both implementations, only one thread runs on aMIC core. The native implementation outperforms the offloadone with a small margin.

Huang et al. 5



the same file, their writes need to be serialized. Further,the arbitration takes a lot of time. This effect is not verysignificant when one MIC processor is used. Later, wewill find that the write time can become extremely sig-nificant when many MIC processors are allocated.

For GOL three different grid sizes are tested, i.e.81923 8192, 16,3843 16,384, and 32,7683 32,768.However, we encounter either out-of-memory error orruntime error for the 32,7683 32,768 case when onlyone MIC processor is used. From the results in Table 2,it can be found that the native model consistently out-performs the offload model for this intense communica-tion case. By looking at the performance curves inFigure 7, we can find that both programming modelsshow strong scalability when the number of coresincreases from 10 to 20. Beyond that, both models losethe strong scalability although the total computationtime still decreases. For both problem sizes, the reduc-tion of workload is gradually offset by the increase ofcommunication overhead when the number of cores

increases. Further, when more cores are allocated, thememory access demand increases as well. Eventually,the communication and memory bandwidth becomethe limiting factors for the performance.

4.2. Performance comparison of single devices

As an emerging new technology, it is worthwhile tocompare the performance of the Intel MIC processorwith the other popular accelerator, i.e. GPU.Furthermore, it is a routine to include very powerfulmulticore CPUs in supercomputers. Therefore, we con-duct a comparison among these three technologies atthe full capacity of a single device. For Intel Xeon Phi5110P, we use all 60 cores under two programmingmodels for 5 different implementations. For the 8-coreXeon E5-2670 CPU on Beacon cluster, we useOpenMP to issue either 8 or 16 threads. For GPU, wetest two devices, the Nvidia Tesla C2075 based on

Table 2. Performance of Game of Life on a single MIC processor (unit: second).

Execution mode: Native-1

Problem Size Number of MIC cores

10 20 30 40 50 60

8192 3 8192 82.85 42.27 32.56 24.91 21.37 23.1516,384 3 16384 338.57 173.57 131.10 103.30 94.41 56.31

Execution mode: Offload-1

Problem Size Number of MIC cores

10 20 30 40 50 60

8192 3 8192 152.06 71.9 51.23 38.88 29.1 31.3316,384 3 16384 627.94 313.88 223.54 171.33 131.14 131.72

10 20 30 40 50 600

102030405060708090

100110120130140150

Com

puta

tion

Tim

e (s

)

Number of Cores

Offload-1Native-1

(a)

10 20 30 40 50 600

50100150200250300350400450500550600650

Com

puta

tion

Tim

e (s

)

Number of Cores

Offload-1Native-1

(b)

Figure 7. Performance of Game of Life on a single MIC processor: (a) 819238192; (b) 16,384316,384. The native modeloutperforms the offload model with a big margin when only a few cores are used. The performance gap decreases as more coresare allocated.




Fermi architecture (NVIDIA Corporation, 2009) andthe Tesla K20 based on Kepler architecture (NVIDIACorporation, 2012). The CUDA version is 5.5.

The execution times of Kriging interpolation on var-ious devices are listed in Table 3. It can be found thatthe performance of the MIC processor and the CPU isat the same order of magnitude. When running at thefull capacity, the performance of the Intel Xeon Phi5110P is equivalent to the Xeon E5-2670. By increasingthe number of threads in an MPI process to four, theNative-2 implementation is able to improve the perfor-mance by three times compared with Native-1 imple-mentation. However, the Native-3 implementation, i.e.one MPI process with 240 threads, has the much worseperformance. We varied the number of threads in theMPI process and found that the performance didnot change significantly. We speculate that theOpenMP library does not work well with the Kriginginterpolation under Native-3 programming model. ForXeon CPU the 16-thread CPU implementation isalmost two times faster than the 8-thread implementa-tion because each CPU core can execute two threadssimultaneously. Both GPUs are able to improve theperformance by one order of magnitude. Further, K20is more than two times faster than C2075, as shown inFigure 8.

The performance results of GOL on three differenttypes of processors are listed in Table 4. The perfor-mance of both models on the MIC processor is at thesame order of magnitude as the implementations on theCPU and the C2075. All five implementations workquite well on MIC and the native model is typically bet-ter than offload model. The Native-3 implementationhas the best performance compared with the other four

implementations on MIC. Overall the K20 implementa-tion is generally one order of magnitude better in termsof performance compared with other implementations.

5. Experiments and results using multipleMIC processors

We also conduct the experiments using multiple MICprocessors to demonstrate the scalability of the parallelimplementations for those two geospatial applications.For both benchmarks we have five parallel implemen-tations on the Beacon computer cluster using multiplenodes.

We want to show the strong scalability of the paral-lel implementations as the case on the single device.Therefore, the problem size is fixed for each benchmark

Table 3. Performance of Kriging interpolation on single devices (unit: second).

MIC (60 cores) CPU (Xeon E5-2670) Nvidia GPU

N-1 N-2 N-3 O-1 O-2 8-thread 16-thread C2075 K20

Read 0.79 1.03 0.45 0.04 0.42 0.01 0.01 0.01 0.01Intp. 455.34 173.89 5147.95 548.15 225.93 330.11 182.60 23.87 10.90Write 7.95 8.57 16.71 1.57 1.38 9.85 10.27 1.68 1.68Total 464.09 183.49 5165.11 549.75 227.72 339.96 192.86 25.55 11.77

MIC (N-1)MIC (N-2)

MIC (O-1)MIC (O-2)

CPU (8 threads)

CPU (16 threads)C2075 K20

050

100150200250300350400450500550

Inte

rpol

atio

n Ti

me

(s)

Device

Figure 8. Performance of Kriging interpolation on singledevices (excluding Native-3 implementation on Intel MICdevice).

Table 4. Performance of Game of Life on single devices (unit: second).

MIC (60 cores) CPU (Xeon E5-2670) Nvidia GPU

N-1 N-2 N-3 O-1 O-2 8-thread 16-thread C2075 K20

81922 23.15 18.22 11.23 31.33 19.53 12.03 8.13 15.36 3.2516,3842 56.31 82.66 41.12 131.72 79.93 48.22 32.65 58.44 12.5832,7682 NA 217.33 114.98 274.03 46.99

Huang et al. 7



while the number of participating MPI processes isincreased.

5.1. Comparison among five execution modes

5.1.1. Kriging Interpolation. We allocate 2, 4, 8, and 16MIC processors for 4 different implementation cases.For the Native-1 and the Native-2 implementations,m3 60 MPI processes are created if m MIC processorsare used. For the Native-3, Offload-1, and Offload-2execution modes, m MPI processes are created if mMIC processors are used. As mentioned before, foreach output raster grid, the generation of the 720 col-umns is evenly distributed among the MPI processes.Therefore, only 360 or 720 MPI processes, which exe-cute on 360 or 720 MIC cores, are created when 8 or 16MIC processors are allocated, respectively, for bothNative-1 and Native-2 cases.

The detailed results of the five execution modes forKriging interpolation are listed in Table 5. It is noticedthat the system does not return results when more than2 MIC processors are used for Offload-2 executionmode. We can find that the write time grows dramati-cally when more MIC processors are used for bothNative-1 and Native-2 execution modes. As mentionedbefore, the serialization of the write and the arbitrationamong the numerous MPI processes contribute to thelengthy write process. Therefore, we only include theinterpolation time, which includes both the time spenton data processing and the time spent on cross-processor communication, when comparing the perfor-mance of the four execution modes in Figure 9. We donot include Native-3 in Figure 9 because its interpola-tion time is significantly larger than other executionmodes although it obeys the strong scalability. It canbe found that the Native-1 and the Offload-1 executionmodes have the very close performance for this bench-mark. When the multithreading is applied in each MPIprocess on the native MPI programming model, theperformance can be improved by roughly three times.This case shows that it is not enough to only parallelizeapplication to all the cores on MIC processors. It isequally important to increase the parallelism on eachMIC core to further improve the performance.

5.1.2. Conway’s Game of Life. For GOL on multiple MICprocessors, three different grid sizes are tested, i.e.81923 8192, 16,3843 16,384, and 32,7683 32,768. Byobserving the performance results in Table 6 andFigure 10, it can be found that the behavior is quite dif-ferent from the performance behavior of Kriging inter-polation. First, the strong scalability does not hold forall five execution modes. Although the offload execu-tion modes are still able to reduce the computationtime to half when moving from 2-processor implemen-tation to 4-processor implementation, the performance Tab

le5.

Perf

orm

ance

ofK

rigi

ng

inte

rpola

tion

under

vari

ous

exec

ution

modes

on

multip

leM

ICpro

cess

ors

(uni

t:se

cond

).

Num

ber

ofPro

cess

ors

Nat

ive-

1N

ativ

e-2

Nat

ive-

3

Rea

dIn

terp

ola

tion*

Wri

teTo

tal

Rea

dIn

terp

ola

tion

1W

rite

Tota

lR

ead

Inte

rpola

tion

1W

rite

Tota

l

21.2

4232.4

312.2

4245.9

00.5

760.4

38.8

269.8

20.3

32563.0

712.5

32575.9

74

1.2

7116.3

416.4

4134.0

50.5

136.5

4122.

53

159.5

90.3

31284.9

310.0

41305.3

58

1.2

361.4

8y

54.4

3117.1

40.5

020.4

32

240.

33

261.2

60.3

3730.5

89.3

7740.2

916

1.3

136.7

42

300.2

3338.2

80.5

212.3

32

210.

45

223.3

00.3

4377.9

59.1

0387.3

9

Num

ber

ofPro

cess

ors

Offlo

ad-1

Offlo

ad-2

Rea

dIn

terp

ola

tion*

Wri

teT

ota

lR

ead

Inte

rpola

tion*

Wri

teTo

tal

20.1

8280.8

31.6

0282.6

10.3

991.6

51.8

895.7

94

0.0

4141.0

31.2

7142.3

3Sy

stem

does

not

retu

rnre

sult.

80.0

474.3

01.1

975.5

316

0.0

438.5

45.9

444.5

1

* The

inte

rpola

tion

tim

ein

cludes

both

the

tim

esp

ent

on

dat

apro

cess

ing

and

the

tim

esp

ent

on

com

munic

atio

n.

yO

nly

360

or

720

MIC

core

sar

euse

din

the

com

put

atio

nw

ith

8or

16

pro

cess

ors

,re

spec

tive

ly.




plateaus afterwards. For Native-1 and Native-2 execu-tion modes, it almost stops scaling when more proces-sors are allocated. Apparently, for this communicationdense application, there is not much performance gainwhen increasing the number of MIC processors from 4to 8 and 16. When the grid is partitioned into m3 60MPI processes on m MIC processors, the performancegain from the reduced workload on each MIC core iseasily offset by the increase of the communication costamong the cores. Therefore, it is critical to keep a bal-ance between computation and communication forachieving the best performance. For Native-3 executionmode, there is a big increase of computation time fromone-processor implementation to multiple-processorimplementation. Native-3 execution mode is not offi-cially mentioned in the programming guide on Beaconcomputer cluster. Therefore, we speculate that thelibrary support for Native-3 execution mode on multi-ple devices is premature at this moment.

5.2. Experiments on theMPI@MIC_Core+OpenMP execution mode

For the implementations using the Native-2 executionmode in Section 2, the number of threads running oneach MIC core is four, which is the number of threads aMIC core can physically execute in parallel. We alsowant to check the potential of performance improve-ment by running more threads on a single core.Therefore, in addition to the case of four, we double thenumber of threads to eight for the GOL benchmark.The results are listed in Table 7. It can be found thatthe benefit of adding more threads to MIC cores is verymarginal. For small problem sizes, e.g. 81923 8192, the8-thread OpenMP implementation actually has a worseperformance than the 4-thread OpenMP implementa-tion for most cases. For this communication-intensivebenchmark, partitioning the computation into more T

ab

le6.

Perf

orm

ance

ofG

ame

ofLi

feunder

vari

ous

exec

ution

modes

on

multip

leM

ICpro

cess

ors

(uni

t:se

cond

).

Num

ber

ofPro

cess

ors

8192

38192

16,3

84

316,3

84

32,7

68

332,7

68

N-1

N-2

N-3

O-1

O-2

N-1

N-2

N-3

O-1

O-2

N-1

N-2

N-3

O-1

O-2

214.5

67.9

992.9

420.4

013.6

648.3

933.1

1275.8

778.7

148.5

9194.1

5149.4

3964.6

2308.0

1184.7

24

11.6

38.0

444.4

111.5

78.5

846.3

124.0

6172.0

442.6

526.3

1169.5

4104.1

4544.4

4155.9

996.7

58

7.8

49.2

823.2

612.3

28.0

839.7

822.9

8108.0

142.0

828.8

6157.7

3106.2

4317.2

4154.5

699.2

616

7.1

88.7

421.4

613.5

29.3

935.3

023.6

0107.0

147.9

134.1

9128.4

0110.9

9300.6

8176.7

3105.8

2

2 4 8 160

50

100

150

200

250

300In

terp

olat

ion

Tim

e (s

)

Number of Processors

Offload-1Native-1Offload-2Native-2

Figure 9. Performance of Kriging interpolation under variousexecution modes on multiple MIC processors (excluding Native-3execution mode).

Huang et al. 9



threads introduces more cross-thread communicationoverhead. For large problem sizes, it is still possible toachieve some performance benefit if each MPI processis given a relatively large amount of data, e.g.32,7683 32,768 partitioned into 120 MIC cores.

5.3. Experiments on the Offload-1 execution mode

For the implementations using the Offload-1 executionmode in Section 2, the number of OpenMP threads off-loaded to the a MIC processor by an MPI process,which runs on the CPU, is 60, i.e. one OpenMP threadper MIC core. In this experiment, we change the num-ber of threads offloaded to the MIC processor by theMPI process from 10 to 60, as shown in Table 8 and

Figure 11. In each case, when the number of threadsincreases from 10 to 30, the scalability holds. Whenmore threads are scheduled, the computation timedecreases, however, at a much smaller rate. For mostcases, the computation time actually grows when thenumber of offloaded threads is increased from 50 to 60.This performance degradation may be due to theincreased inter-thread communication overhead.

We also can find that the performances are almostthe same for implementations using more than twoMIC processors. Apparently when four or more MICprocessors are allocated, the cross-processor communi-cation overhead becomes dominant in the computationprocess so that adding more processors will not increasethe overall performance.

2 4 8 160

102030405060708090

Com

puta

tion

Tim

e (s

)


Native-1Native-2Native-3Offload-1Offload-2

(a)

2 4 8 160

20406080

100120140160180200220240260280

Com

puta

tion

Tim

e (s

)



(b)

2 4 8 1650

100150200250300350400450500550600650700750800850900950

1000

Com

puta

tion

Tim

e (s

)



(c)

Figure 10. Performance of Game of Life under various execution modes on multiple MIC processors: (a) 819238192;(b) 16,384316,384; (c) 32,768332,768.

Table 7. Performance of Game of Life using MPI@MIC_Core+OpenMP execution mode (unit: second).

Number of Processors 8192 3 8192 16,384 3 16,384 32,768 3 32,768

4 threads 8 threads 4 threads 8 threads 4 threads 8 threads

2 7.99 10.94 33.11 32.92 149.43 110.374 8.04 9.03 24.06 27.94 104.14 109.798 9.28 8.39 22.98 25.69 106.24 100.7916 8.74 10.77 23.60 27.11 110.99 110.67




5.4. Experiments on the distribution of MPIprocesses

When an MPI parallel application runs on a computercluster with nodes consisting of manycore processorssuch as Xeon Phi, the distribution of MPI processes isnot uniform. Some MPI processes are scheduled to thecores on the same processor. The others are scheduledto different processors. Two MPI processes on the sameprocessor are physically close to each other. On theother hand, two MPI processes on two separate proces-sors are distant. The difference of the distance betweentwo MPI processes will cause the disparity of the inter-MPI communication time.

We design a simple benchmark consisting of only 2MPI processes using native programming model. Inthis benchmark, MPI process_A sends 500 MB data toMPI process_B. Then MPI process_B returns the 500MB data back to MPI process_A. We have two optionsto run the benchmark. In Implementation_1, both MPIprocesses are scheduled to the same MIC processor. InImplementation_2, these two MPI processes are sched-uled to two separate MIC processors. It turns out thatImplementation_1 and Implementation_2 take 1.59 sec-onds and 2.81 seconds, respectively. Apparently, thelonger distance between the two MPI processes inImplementation_2 contributes to the more time spenton communication.

The location difference of MPI processes can resultin the performance disparity of an application when itis executed under different MPI configurations whilethe total number of MPI processes is the same.Figure 12 illustrates the different performances of theGOL benchmark using the various configurationswhen 120 MPI processes run on 120 MIC cores. EachMPI process contains only one thread. In the 23 60configuration, 2 MIC processors are allocated, each ofwhich hosts 60 MPI processes. When the number ofMIC processors doubles, the number of MPI processeson a processor is halved. The more processors are allo-cated, the more cross-processor communication, whichbrings down the performance. Therefore, when thecapacity of the on-board memory is not a limiting fac-tor, it is typically a good strategy to schedule as manyMPI processes to a single MPI processor as possible tominimize the cross-board communication overhead.

5.5. Hybrid MPI versus native MPI

Another programming/execution model that is not offi-cially supported on the Beacon computer cluster is theNative MPI@Hybrid CPU/MIC, i.e. the MPI pro-cesses run on both CPUs and MIC processors. Theresults in Section 4.2 already demonstrate the impres-sive performance of the latest multicore CPUs.

Table 8. Performance of Game of Life (32,768332,768) using Offload-1 execution mode (unit: second).

Number of Processors Number of OpenMP threads offloaded to each MIC processor

10 20 30 40 50 60

2 1375.37 730.96 478.81 382.15 317.84 308.014 709.70 382.15 258.39 196.05 158.40 155.998 687.71 351.86 240.56 184.40 149.45 154.5616 689.78 367.11 244.26 193.79 160.14 176.73

10 20 30 40 50 600

100200300400500600700800900

10001100120013001400

Com

puta

tion

Tim

e (s

)

Number of Threads Offloaded

2 MIC Processors4 MIC Processors8 MIC Processors16 MIC Processors

Figure 11. Performance of Game of Life (32,768332,768)using Offload-1 execution mode. The number of threads on aMIC processor is increased from 10 to 60.

2x60 4x30 8x15

194

196

198

200

202

204

206

208

Com

puta

tion

Tim

e (s

)MPI Configuration

Figure 12. Performance of Game of Life (32,768332,768)under different MPI configurations using Native-1 executionmode. Given 120 MPI processes, 2360 means that 120processes are distributed in 2 MIC cards, each of which hosts 60processes. The fewer MIC cards are used, the better theperformance.

Huang et al. 11



Therefore, it is necessary to use both processors in theapplications. We first implement the Kriging interpola-tion on the 57 MB data set using 16 MPI processes ona single Xeon E5-2670 CPU, which support 16 parallelthreads. The total execution time is 46.02 seconds.Then we implement the same application using a16 + 14 hybrid MPI model, i.e. 16 MPI processes ona single Xeon CPU and 14 MPI processes on 14 MICcores of a single card,7 the total execution time is 24.75seconds, an almost 23 speedup. Again, each MPI pro-cess contains only one thread in this sub-study.

We also carry out the hybrid MPI programmingmodel on a separate workstation, which contains oneXeon E5-2620 CPU and two Xeon Phi 5110P cards.On this platform, we use the GOL (16,3843 16,384) asthe benchmark. The native MPI implementation of 120MPI processes on two MIC cards takes 30 seconds.The 12 + 120 hybrid MPI implementation in whichthe additional 12 MPI processes run on the single CPUtakes 27.42 seconds. The 1.13 speedup aligns with theratio of number of MPI processes between the hybridmodel and the native model.

6. Related work

MPI and OpenMP are two popular parallel program-ming APIs and libraries. MPI is primarily for inter-node programming on computer clusters. On the otherhand, OpenMP is mainly used for parallelizing a pro-gram on a single device. Krawezik compared MPI andthree openMP programming styles on shared memorymultiprocessors using a subset of the NAS benchmark(CG, MG, FT, LU) (Krawezik, 2003). Experimentalresults demonstrate that OpenMP provides competitiveperformance compared with MPI for a large set ofexperimental conditions. However, the price of this per-formance is a strong programming effort on data setadaptation and inter-thread communications.Numerous benchmarks have been used to evaluate theperformance of supercomputers. For example, theHPC Challenge (HPCC) benchmark suite and the IntelMPI Benchmark (IMB) are used to compare and evalu-ate the combined performance of processor, memorysubsystem and interconnect fabric of five leading super-computers: SGI Altix BX2, Cray XI, Cray OpteronCluster, Dell Xeon cluster, and NEC SX-8 (Saini et al.,2006, 2008). The portability and performance efficiencyof radio astronomy algorithms are discussed in Malladiet al. (2012). Derivative calculations for radial basisfunction were accelerated on one Intel MIC card(Erlebacher et al., 2014). We use two representativegeospatial applications with different communicationpatterns for benchmarking purpose. Although they areboth domain-specific applications, many applicationsin other domains share the same internal communica-tion patterns as these two cases.

Schmidl et al. (2013) compared a Xeon-based two-socket compute node with the Xeon Phi stand-alone inscalability and performance using OpenMP codes.Their results show significant differences in absoluteapplication performance and scalability. The work inSaini et al. (2013) evaluated the single-node perfor-mance of an SGI Rackable computer that has IntelXeon Phi coprocessors. NAS parallel benchmarks andCFD applications are used for testing four program-ming models, i.e. offload, processor native, coprocessornative and symmetric (processor plus coprocessor).They also measured the latency and memory band-width of L1, L2 caches, and the main memory of Phi;measured the performance of intra-node MPI functions(point-to-point, one-to-many, many-to-one, and all-to-all); and measured and compared the overhead ofOpenMP constructs. Compared with Saini et al. (2013),our work in this paper presents the results on singleMIC device as well as on multiple MIC cards. Further,we discuss multiple variances of the native models andthe offload models.

7. Conclusions

In this work, we conduct a detailed study regarding theperformance and scalability of the Intel MIC proces-sors under different parallel programming models.Between the two programming models, i.e. native MPIson MIC processors and the offload to MIC processors,the native MPI programming model typically outper-forms the offload model. It is very important to furtherimprove the parallelism inside each MPI process run-ning on a MIC core for a better performance. Forembarrassingly parallel benchmarks such as Kriginginterpolation, the multithreading inside each MPI pro-cess can achieve three times speedup compared with thesingle-thread MPI implementation. Due to the fact thatthe physical distance between two MPI processes maybe different under various MPI distributions, it is typi-cally a good strategy to schedule MPI processes to asfew MIC processors as possible to reduce the cross-processor communication overhead given the samenumber of MPI processes. Finally, we evaluate thehybrid MPI programming model, which is not officiallysupported by the Intel MPI compiler. Through bench-marking, it is found that the hybrid MPI programmingmodel in which both CPU and MIC are used for pro-cessing is able to outperform the native MPI program-ming model.

Acknowledgements

This research used resources of the Beacon supercomputer,which is a Cray CS300-ACTM Cluster Supercomputer. TheBeacon Project is supported by the National ScienceFoundation and the State of Tennessee. The authors wouldalso thank Nvidia Corporation for GPU donations.




Funding

This research received no specific grant from any fundingagency in the public, commercial, or not-for-profit sectors.

Notes

1. Intel Many Integrated Core Architecture, see http://www.intel.com.

2. See http://www.jics.tennessee.edu/aace/beacon3. Top 500 Supercomputer Sites, see http://www.top500.org.4. The Green 500, see http://www.green500.org5. See http://www.tacc.utexas.edu/resources/hpc/stampede6. See http://www.esri.com/software/arcgis/

7. When we allocate more than 14 MPI processes on theMIC card, the result is incorrect.

References

Erlebacher G, Saule E, Flyer N and Bollig E (2014) Accelera-

tion of derivative calculations with application to radial

basis function: Finite-differences on the Intel MIC archi-

tecture. In: Proceedings of the 28th ACM international con-

ference on supercomputing (ICS’14), pp. 263–272.Gardner M (1970) Mathematical games - the fantastic combi-

nations of John Conway’s new solitaire game ‘‘life’’. Scien-

tific American 223: 120–123.Jensen JR (2004) Introductory Digital Image Processing: A

Remote Sensing Perspective, 3rd edn. Upper Saddle River,

NJ: Prentice Hall.Kirk DB and HwuWW (2012) Programming Massively Paral-

lel Processors: A Hands-on Approach, 2nd edn. Burlington,

MA: Morgan Kaufmann.Krawezik G (2003) Performance comparison of MPI and

three OpenMP programming styles on shared memory

multiprocessors. In: Proceedings of the fifteenth annual

ACM symposium on parallel algorithms and architectures

(SPAA’03), pp. 118–127.Malladi R, Dodson R and Kitaeff V (2012) Intel Many Inte-

grated Core (MIC) architecture: Portability and perfor-

mance efficiency study of radio astronomy algorithms. In:

Proceedings of the 2012 workshop on high-performance

computing for astronomy date (Astro-HPC’12), pp. 5–6.NVIDIA Corporation (2009) NVIDIA’s next generation

CUDA compute architecture: Fermi. White paper V1.1.

Available online at: http://www.nvidia.com.NVIDIA Corporation (2012) NVIDIA’s next generation

CUDA compute architecture: Kepler gk110. White paper

V1.0. Available online at: http://www.nvidia.com.Saini S, Ciotti R, Gunney B, et al. (2006) Performance eva-

luation of supercomputers using HPCC and IMB bench-

marks. In: Proceedings of the 20th international on parallel

and distributed processing symposium (IPDPS).Saini S, Ciotti R, Gunney BTN, et al. (2008) Performance

evaluation of supercomputers using HPCC and IMB

benchmarks. Journal of Computer and System Sciences 74:

965–982.Saini S, Jin H, Jespersen D, et al. (2013) An early performance

evaluation of many integrated core architecture based SGI

rackable computing system. In: Proceedings of the

international conference on high performance computing,

networking, storage and analysis (SC’13), pp. 94:1–94:12.Schmidl D, Cramer T, Wienke S, Terboven C and Muller MS

(2013) Assessing the performance of OpenM programs on

the Intel Xeon Phi. In: Proceedings of Euro-Par 2013 par-

allel processing (Lecture Notes in Computer Science, vol.

8097). New York: Springer, pp. 547–558.

Author biographies

Miaoqing Huang is an Assistant Professor at theDepartment of Computer Science and ComputerEngineering, University of Arkansas since 2010. Hisresearch interests include manycore computer architec-ture, high-performance computing, and hardware-oriented security. He earned his bachelor degree inelectrical engineering from the Fudan University in1998 and his doctoral degree in computer engineeringfrom The George Washington University in 2009.

Chenggang Lai received the BS degree in electric engi-neering from the Shandong University, China, in 2012.He is currently pursuing a PhD degree in computerengineering at the University of Arkansas. His researchinterests include high-performance computing andalgorithm design for big data.

Xuan Shi is an Assistant Professor in the Departmentof Geosciences at the University of Arkansas withexpertise in distributed GIS, GIS interoperability andsemantic Web services, and high-performance geocom-putation covering topics about vector geometric calcu-lation, spatial modeling over raster grid, and processand analytics on satellite imagery and aerial photos.His research has been supported by NSF, DOE andNIH and awarded with XSEDE allocation on super-computers Kraken, Keeneland and Beacon.

Zhijun Hao received his BS degree in electrical engi-neering from the Hangzhou Dianzi University in 2006and PhD degree in computer science from the FudanUniversity in 2014. His research interests include oper-ating system and system virtualization focusing onreusing device drivers across different OperatingSystems and ISA

Haihang You is a Professor at the Institute ofComputing Technology, Chinese Academy of Sciences.Prior joining ICT, Dr. You was research scientist atNational Institute of Computational Sciences at OakRidge National Laboratory and Innovative ComputingLaboratory at University of Tennessee. Dr. You’sresearch interest is in the field of high performancecomputing, specifically parallel algorithm, numericalalgorithm, workload analysis, and performance optimi-zation and autotuning.

Huang et al. 13



Documents

Study of parallel programming models on computer clusters ...mqhuang/papers/2015_MICPerformanceStudy_HPCA.pdfmodel, which offloads the workload to MIC cores using OpenMP. (2) On top