A Parallel Implementation of the Element-Free Galerkin Method

A PARALLEL IMPLEMENTATION OF THE ELEMENT-FREE GALERKIN METHOD

W. Barry1 and T. Vacharasintopchai2

ABSTRACT : This work focuses on the application of parallel processing to element-free Galerkin method analyses, particularly in the formulation of the stiffness matrix, the assembly of the system of discrete equations, and the solution for nodal unknowns. The objective is to significantly reduce the analysis time while retaining high efficiency and accuracy. Several relatively low-cost Intel Pentium-based personal computers are joined together to form a parallel computer. The processors communicate via a local high-speed network using the Message Passing Interface. Load balancing is achieved through the use of a dynamic queue server that assigns tasks to available processors. Benchmark problems in 3D structural mechanics are analyzed to demonstrate that the parallelized computer program can provide substantially shorter run time than its serial counterpart, without loss of solution accuracy. KEYWORDS : meshless method, parallel processing, element-free Galerkin method, EFGM, queue server, Beowulf, solid mechanics 1. INTRODUCTION In performing the finite element analysis of structural components, meshing, which is the process of discretizing the problem domain into small sub-regions or elements with specific nodal connectivities, can be a tedious and time-consuming task. Although some relatively simple geometric configurations may be meshed automatically, some complex geometric configurations require manual preparation of the mesh. The element-free Galerkin method (EFGM), one of the recently developed meshless methods, avoids the need for meshing by employing a moving least-squares (MLS) approximation for the field quantities of interest. With EFGM, the discrete model of the problem domain is completely described by nodes and a description of the problem domain boundary. This is a particular advantage for problems involving propagating cracks or large deformations since no remeshing is required at each step of the analysis. Detailed formulations of the MLS approximation functions and the application of EFGM to problems in solid mechanics may be found in [1]. However, the advantage of avoiding the requirement of a mesh does not come cheaply, as EFGM is much more computationally expensive than the finite element method (FEM). The increased computational cost is especially evident for three-dimensional and non-linear applications of the EFGM, due to the usage of MLS shape functions, which are formulated by a least-squares procedure at each integration point. This computational costliness is the predominant drawback of EFGM. Parallel processing has long been an available technique to improve the performance of scientific computing programs. Typically, a parallel computer program employs the ‘divide and conquer’

1 Asian Institute of Technology, Thailand, Assistant Professor 2 Asian Institute of Technology, Thailand, Graduate Student

Geng

Text Box

Paper No. 1068, Proc. 8 th. East Asia-Pacific Conference on Structural Engineering and Construction (EASEC-8), Singapore, December 5-7, 2001.

paradigm [2], which involves the partitioning of a large task into several smaller tasks that are then assigned to available computer processors. Efficient load balancing ensures that all processors are busy working on assigned tasks as long as there are unfinished tasks. The most common approach taken in computational mechanics is domain decomposition [3], a method of static load balancing in which the tasks are identified prior to the analysis and assigned to each processor, along with any data that may be required. Due to the complex nodal connectivities that arise in the EFGM, domain decomposition may not be the most efficient approach, and thus a dynamic class of load balancing based on the concept of a queue server is employed in this work. 2. THE AIT BEOWULF The effort to deliver low-cost, high-performance computing platforms to scientific communities has been on-going for many years. A network of personal computers is attractive for this type of use since it has the same architecture as a distributed memory multi-computer system [4]. Many research groups have assembled commodity off-the-shelf PC’s and fast LAN connections to build parallel computers. Parallel computers of this type, termed Beowulf computers after the NASA project of the same name [5], are suitable for coarse-grained applications that are not communication intensive because of the high communication start-up time and the limited bandwidth associated with the underlying network architectures [6]. The AIT Beowulf, a four-node Beowulf class parallel computer was assembled based on the guidelines in [5] and [7]. Red Hat Linux 6.0, including both the server and workstation operating system packages, was installed on each node. The AIT Beowulf is a message-passing multiple-instruction, multiple-data (MIMD) architecture and thus a message-passing infrastructure is needed. The mpich library [8], which is the most widely used free implementation of the Message Passing Interface was chosen for the AIT Beowulf. Meschach, a powerful matrix computation library [9] is employed for serial matrix operations that are performed on each processor. 3. THE QUEUE SERVER Load balancing has a crucial role in the performance of parallel software. If unbalanced workloads are assigned to the processors, some may finish their work and be forced to wait for the other processors to finish, leading to reduced efficiency and increased run-times. In this work, a dynamic load-balancing agent named Qserv is developed within the framework of the EFGM. Qserv balances the computational load among the processors in the AIT Beowulf during run-time by acting as clerk that directs the queued tasks to the available processors. When one processor finishes a task, it requests another task from Qserv, which continues assigning the tasks to processors until no unfinished tasks remain. Figure 1 presents a flowchart of the queue server designed and implemented in the current work. To separate the dynamic workload allocation from normal operations, the communication between Qserv and the processors is done through the UNIX socket concept developed at the University of California at Berkeley [4]. When the Qserv process is initiated, it creates a socket that allows the processors to simultaneously connect. Initially, the number of total unprocessed subtasks known to Qserv is zero, and one processor, usually the master processor, must inform Qserv of the actual value. This number is stored in the max_num variable and can be altered by processors through the SET_MAX_NUM request. A processor can ask Qserv, through the GET_NUM request, for a subtask to work on. It will be assigned the numerical identifier of an unprocessed subtask, ranging from zero to max_num. When the unprocessed subtasks are exhausted, an ALL_DONE signal will be sent to acknowledge the requesting processor. During the execution of Qserv, a process can also reset the subtask identifier counter by the RESET_COUNTER request. Qserv will continue serving tasks to processors until the TERMINATE signal is received.

Initialize the Socket

Accept a client connectionrequest and update max_fd

START

ENDrunstate =TERMINATE

NO

fd <=max_fd

Errorreceiving

request_msg

Close theconnectionand update

max_fd

Process therequest

Receiverequest_msg

YES

runstate = READYcount = 0

max_num = 0

Close theSocketYES

NO

request msg =TERMINATE

request msg =RESET_COUNTER

request msg =SET_MAX_NUM

request msg =GET_NUM

runstate =TERMINATE count = 0

get the newmax_num from

the client

max_num = newmax_num

count <=max_num

send 'count' to theclient

count = count + 1

send the messageALL_DONE to the

client

YES YES YES

YES

NO

YES

move to the next client

YES

NO

fd = current client identifiermax_fd = number of client connections maintained

runstate = run state of the server programcount = current counter value

max_num = maximum counter valuerequest_msg = current client's request message

Figure 1: Flowchart of the Queue Server

4. SOFTWARE IMPLEMENTATION When a parallel program is run, each parallel processor will have one copy of the executable program, termed a process. One process is assigned as the master process while the remaining processes are worker processes. The MPI default process identifier of the master is 0. In addition to performing the basic tasks of a worker process, the master process performs additional work involved with coordinating the tasks among all the workers. Therefore the master process is assigned to run on the

server node, which is the most powerful processor, in terms of both processor speed and core memory, in the AIT Beowulf. A flowchart of the main process computer code for both the master node and the workers nodes is presented in Figure 2. The analysis procedures can be grouped into five phases, namely, the pre-processing phase, the stiffness matrix formulation phase, the force vector formulation phase, the solution phase, and the post-processing phase. A custom-made parallel Gaussian elimination equation solver, developed based on the algorithm presented in [10], is employed in the solution phase since the available public domain parallel equation solvers are typically efficient only for banded, sparse matrices, which does not match the dense property of the EFGM global stiffness matrix.

START

Broadcast the processed input data

Connect to the queue server

ddefg_stiff(form the stiffness matrix)

Form the concentrated load vector

ddforce(form the distributed load vector)

Assemble the global force vectors

master_ddsolve(apply B.C.'s then solve eqns)

ddpost(post-process for desired

displacements and stresses)

END

START

Receive the processed input data

Connect to the queue server

ddefg_stiff(form the stiffness matrix)

ddforce(form the distributed load vector)

worker_ddsolve(solve eqns)

ddpost(post-process for desired

displacements and stresses)

END

broadcast

gather

gather

collaborate

gather

MASTER PROCESS WORKER PROCESSES

Disconnect from the queue server Disconnect from the queue server

Write the post-processedresults to the output file

Write nodal displacementsto the output file

dd_input(process the input file)

Figure 2: Flowcharts of the Master and Worker Modules

5. NUMERICAL RESULTS Several 3D, elastostatic examples are solved to illustrate the performance and to verify the validity of the parallel EFGM analysis code. The results obtained for each analysis closely matched the analytical solutions [11], as shown in previous serial EFGM works [1]. Thus, the main focus of these numerical examples is to investigate the run-time and efficiency of the parallel implementation of the EFGM. Four test cases, with increasing numbers of degrees of freedom, are analyzed using parallel

Figure 4: Speedup of the Stiffness Computation Module

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0 1000 2000 3000Degrees of Freedom

Stiff

ness

Spe

edup

NP1

NP2

NP3

NP4

Figure 5: Speedup of the Gaussian Elimination Solver

0.0

0.5

1.0

1.5

2.0

2.5

0 1000 2000 3000

Degrees of Freedom

Solv

er S

peed

up

NP1

NP2NP3

NP4

processor counts ranging from one to four. The specific test cases are listed as: 1) linear displacement patch test (336 d.o.f.); 2) cantilever beam with end loading (825 d.o.f.); 3) pure bending of a thick arch (975 d.o.f.); and 4) perforated tension strip (2850 d.o.f.). The speedup of the overall solution process, the computation and assembly of the global stiffness matrix, and the solution of the discrete system of equations are shown in Figures 3 to 5, respectively. When the number of degrees of freedom is less than 1,000, Figure 4 shows that the speedup of the stiffness matrix formulation phase gradually approaches the theoretical limit value which is equal to the number of processors used in the analysis. However, the speedup begins to decrease when the number of degrees of freedom exceeds 1,000, apparently due to the initiation of memory page file swapping on each processor. This may occur since the current implementation requires the full storage of the global stiffness matrix on each processor. Figure 5 shows that the optimal points, in terms of speedup, for the parallel Gaussian elimination solver are near 350, 550, and 600 equations for two, three, and four processors, respectively. When the number of equations is greater than 1000, the speedup of the solver begins to decrease. This may be due to the same reason as in the stiffness matrix formulation phase, that is, memory page file swapping commences. Hence, it can be concluded that the current implementation is scalable up to 1,000 degrees of freedom.

6. CONCLUSION AIT Beowulf, a high-performance yet low-cost parallel computer assembled from a network of commodity personal computers, was established. A parallel implementation of the element-free Galerkin method was developed on this platform. Four desired properties of parallel software, which are concurrency, scalability, locality, and modularity, were taken into account during the design of the

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0 1000 2000 3000Degrees of Freedom

Ove

rall

Spee

dup

NP1

NP2

NP3

NP4

Figure 3: Overall Speedup of the EFGM Analysis Code

parallel version of the element-free Galerkin method. A dynamic load-balancing algorithm was utilized for the computation of the structural stiffness matrix and external force vector and a parallel Gaussian elimination algorithm was employed in the solution for the nodal unknowns (displacements). Several numerical examples showed that the displacements and stresses obtained from the parallel implementation closely matched the analytical solutions and exactly matched solutions obtained by the sequential element-free Galerkin method software. With Qserv, a dynamic load-balancing algorithm, high scalability was obtained for the three-dimensional structural mechanics problems up to approximately 1,000 degrees of freedom. However, scalability was not achieved for larger problems, due to the requirement of full stiffness matrix storage on each processor while only 64 megabytes of memory was available on each worker node. The parallel Gaussian elimination equation solver took less time to solve the system of equation than its sequential counterpart. With larger systems of equations, the efficiency of the parallel equation solver tended to increase because of the increased computation-to-communication ratio. Nevertheless, in the current implementation of the parallel EFGM analysis code, when the number of equations was more than 1,000, high efficiency was not obtained. Refinement of the memory management algorithms is recommended so that the parallel EFGM analysis code may be scalable for problem sizes much larger than 1,000 degrees of freedom. 7. REFERENCES [1] T. Belytschko, Y. Krongauz, D. Organ, M. Fleming, and P. Krysl, “Meshless methods: An

overview and recent developments”, Computer Methods in Applied Mechanics and Engineering, Vol. 139, No. 1-4, pp. 3-47, 1996.

[2] Adeli and O. Kamal, Parallel Processing in Structural Engineering, Elsevier Science Publishers Ltd., U.K., 1993.

[3] K. T. Danielson, S. Hao, W. K. Liu, A. Uras, and S. Li, “Parallel computation of meshless methods for explicit dynamic analysis”, Accepted for publication in International Journal for Numerical Methods in Engineering, 1999.

[4] Brown, UNIX Distributed Programming, Prentice Hall International (UK) Limited, UK, 1994. [5] P. Merkey, “Beowulf: Introduction & overview”, Center of Excellence in Space Data and

Information Sciences, University Space Research Association, Goddard Space Flight Center, Maryland, USA, September 1998, URL:http://www.beowulf.org/intro.html.

[6] Baker and R. Buyya, “Cluster computing: The commodity supercomputer”, Software—Practice and Experience, Vol. 29, No. 6, pp. 551-576, 1999.

[7] J. Radajewski and D. Eadline, “Beowulf HOWTO”, November 1998, URL:http://www.linux.org/help/ldp/howto/Beowulf-HOWTO.html.

[8] W. Gropp and E. Lusk, User's Guide for mpich, a Portable Implementation of MPI, Technical Report ANL-96/6, Argonne National Laboratory, USA, 1996.

[9] Stewart and Z. Leyk, Meschach: Matrix Computations in C, Proceedings of the Center for Mathematics and Its Applications, Vol. 32, Australian National University, 1994.

[10] Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, The Benjamin/Cummings Publishing Company, Inc., USA, 1994.

[11] S. P. Timoshenko and J. N. Goodier, Theory of Elasticity, 3rd ed., McGraw-Hill, 1970.