Parallel fem algorithms based on recursive spatial decomposition—I. Automatic mesh generation

Comprfers & Slrucfures Vol. 45, No. 516, pp. 817-831, 1992 0045-7949/92 $5.00 + 0.00 Printed in Great Britain. 0 1992 Pergamon Press Ltd

PARALLEL FEM ALGORITHMS BASED ON RECURSIVE SPATIAL DECOMPOSITION-I. AUTOMATIC

MESH GENERATION

M. SA~ENA~ and R. PERUCC~O~

TSolid Mechanics Laboratory, GE-Corporate Research and Development, Schenectady, NY 12301, U.S.A.

SDepartment of Mechanical Engine&n&, :‘niversity of Rochester, Rochester, NY 14627, U.S.A.

(Received 5 November 1991)

Abstract-This paper discusses an automatic meshing scheme that is suitable for parallel processing. Meshes derived from solid models through recursive spatial decompositions inherit the hierarchical organization and the spatial addressability of the underlying grid. These two properties are exploited to design a meshing algorithm capable of operating in parallel (concurrent) processing environments. The concept of a meshing operator for parallel processing is detined and algorithms for various stages of the automatic meshing scheme are presented. A systematic simulation of fine- and coarse-gain parallel configurations is used to evaluate the performance of the meshing scheme. A companion paper focuses on parallel processing for the analysis of these automatically derived meshes via hierarchical substructuring.

1. INTRODUCI’ION

The importance of parallel processing in computational mechanics and, in particular, in the analysis stage of the FEM is well documented-see, for example, the collection of papers edited by Noor [ 11. However, the problem of automatic mesh generation in a parallel processing environment has not received much attention. Research on parallel algorithms for the FEM has concentrated mainly on the development of solution algorithms for multi-processor architectures [2-l. The problem of parallel automatic mesh generation has been addressed for the first time in [8] and more recently in [9]. In both cases, however, the domain to be meshed is a regular quadrilateral network, that is, a collection of four-sided convex polygons with the additional requirement that each vertex belongs at most to four distinct polygons. Several algorithms for automatic mesh generation for complex three-dimensional domains described in a solid modeling system are known--see references in [lo] and, for more recent work [11,12], but, with the exception of a preliminary study [ 131, work on parallelizing automatic meshing for solid models has never been reported.

Current research in FEM automation at the University of Rochester addresses the problem of parallel processing for automatic mesh generation, analysis, and integration of meshing and analysis into a self-adaptive analysis system. Central to our approach is the notion of recursive spatial decomposition (RSD): the domain to be analyxed- described in a solid modeling system-is recursively decomposed via RSD into a collection of regular

cells which are conveniently stored into a tree data structure. The RSD and the underlying tree structure form the basis of the automatic meshing algorithm described in [14, 151 and of the automatic FEM substructuring scheme-hereafter denoted as hierarchical substructuring-described in [ 161. Self- adaptive analysis systems for two- and three-dimensional problems based on RSD meshing and hierarchical substructuring are described in [16] and [ 171, respectively.

In the present paper we discuss the parallel implementation of the RSD meshing algorithm, while in a companion paper we turn our attention to the parallel implementation of hierarchical substructuring. The two algorithms in question derive their parallelism from the same underlying RSD structure. There is, however, a fundamental difference in the degrees of parallelism achievable in meshing and analysis which justifies-at least at this stage of our work-addressing the two problems separately. As shown in this paper, the RSD meshing algorithm can be recast as a linite set of totally disjoint tasks, i.e., no communication or synchronization between tasks is required. This property, however, does not hold true for the analysis algorithm. As for any other FEM substructuring scheme, only a subset of the operations involved in hierarchical substructuring can be performed as totally disjoint tasks, while other operations require both communication and synchronization.

We begin by introducing a strict definition of algorithmic parallelism in automatic meshing. We then show that the RSD meshing algorithm developed at Rochester [14, IS] satisfies the requirements for parallelism. Next, we perform a simulation study

817

818 M. SAXENA and R. PERUCCHIO

of three alternative parallel configurations of this meshing algorithm including both fine- and coarse- grain architectures. For each configuration, theoretical expressions for the speed-up factor and efficiency are derived in terms of machine parameters (numbers and allocation of processors) and problem-related parameters (average CPU time for each key oper- ation). These expressions are then evaluated for several test problems using average CPU times extracted from a sequential implementation of the meshing algorithm running on a DEC Micro VAX II under VMS. Results indicate that fine-grain architectures produce high speed-ups but low efficiency, while coarse-grain configurations give reasonable speed- ups and high efficiency.

This paper is organized as follows: Sec. 2 presents a brief overview of the classification of computer architectures and of the key issues in the design of algorithms for parallel processing environments. Section 3 introduces a framework for defining parallelism in mesh generation algorithms. In Sec. 4 the RSD meshing algorithm is reviewed and evaluated with respect to parallelism. Alternative parallel configurations for the meshing algorithm are ana- lyzed in Sec. 5. Finally, Sec. 6 discusses the advan- tages, limitations, and open issues related to the meshing algorithm and its parallel implementations.

2. CLASSIFICATION OF COMPUTER ARCHITECTURES

Flynn’s classification [18, 191 is the most widely used scheme for categorizing digital computers. This categorization is based on the concurrency between the instruction sets and the data streams. Four categories are identified:

1. Single instruction stream-single data stream (SZSD): each instruction operates sequentially on a single data element. This category includes the con- ventional scalar machines, e.g. the VAX 1 l/780, IBM 370, etc.

2. Single instruction stream-multiple data stream (SZMD): instructions operate sequentially on multiple data elements. These machines are classified as parallel or-more appropriately-vector machines because each instruction operates on all the elements of a data vector. Common examples are the CRAY 1 and Convex Cl20.

3. Multiple instruction stream-single data stream (MZSD): multiple instructions execute concurrently on a single data stream. Although a physical realization of such a concurrent scheme is absurd, this category is usually included for the sake of complete- ness.

4. Multiple instruction stream-multiple data stream (MZMD): multiple instruction sets operate concurrently on multiple data streams.7 In the most general realization of the MIMD architecture the instructions

t It is assumed that identical data streams are not executed by multiple instructions concurrently.

operate on both scalar as well as vector data elements. Examples of vector concurrent machines are the CRAY 2, ALLIANT FX/8, Convex C240, and Hypercube machine.

In this study we consider MIMD architectures only, but we generalize the definition of MIMD systems to include networks of engineering workstations. Hereafter, the terms ‘concurrent’ and ‘parallel’ are used interchangeably. The MIMD systems machines can be loosely coupled or tightly coupled depending upon the degree of iteraction between processors and memory:

l In loosely coupled MIMD architectures the inter- action between the processors and the memory is explicitly controlled by passing messages (or data packets) through the interconnecting communication channels. A host processor controls the flow of messages. For this reason these machines are also referred to as message passing machines. A common example of such an architecture is the Hypercube machine.

l In tightly coupled MIMD machines all the processors are intimately connected through mutual sharing of a global memory, e.g., ALLIANT FX/8.

The multi-processor machines are also classified on the basis of (i) processor granularity, (ii) memory organization, and (iii) connection topology [20]:

l Processor granularity describes the number of processors available on the MIMD machine and their maximum throughput.

l Memory organization classifies the MIMD machines based on the organization of the memory to provide a fast access of the requested data to each of the processors. Each processor may have a small independent memory or may be sharing a global memory. The memory is either (i) hierarchically organized to bridge the bandwidth gap between the fast processor and the slow memory by using high speed cache or (ii) organized in banks that are usually interleaved and provide an access to different processors through explicit switches.

l Connection topology refers to the connection of the processors. Common topologies are (i) the bus connection: each processor communicates with the other through a common bus, (ii) the ring topology: each processor communicates with its two immediate neighbors, (iii) the mesh topology: each processor communicates with a different number of processors depending upon its location in the mesh, (iv) the star-shaped connection: all the processors share a common global memory, and (v) the hypercube topology: each of the 2” processors is located at a vertex of the n-dimensional hypercube and communicates with different processors through the communication channels.

Parallel FEM ~go~~s baaed on recursive spatial d~rn~~tion-I 819

2.1. Key Awes in the &sign of algorithms for MIMD machines

Assume that, in the limit, a MIMD machine with n processors will reduce the computational time by a factor of n. In this case, if the execution time on a serial machine is T,, the multi-processor machine solves the same problem in time T,,, = T,/n. This leads to the definiton of speed-up and efficiency in the context of parallel processing. The speed-up p is defined as

P = T&T,,

where T,$ is the actual execution time on a parallel machine. The efficiency q is defined as

Clearly, a maximum efficiency of 100% is obtained when the concurrent implementation results in p = n, i.e., T, = T:, . Theoretically, maximum efficiency can be obtained only with balanced processor loads, that is, the computational load is equally distributed over all n processors. For unbalanced loads, some processors will execute in time greater than T,--while other processors are idle-and thus the speed-up will be less than n.

However, even with balanced processor loads, p is usually less than n because of the overhead costs associated with the parallel architecture. These overhead costs are usually attributed to one or both of the following:

1. Synchronization: In most of the shared memory MIMD architectures conflicts arise when several processors try to gain access to the same variables (i.e., access the same memory location) simultaneously. Such a conflict is usually resolved by using synchronization mechanisms which allow only one processor to access the variable while every other processor must wait for the memory location to become free for access. If several processors share a global memory, the delays due to syn~hroni~tion can be very high.

2. Communication: In the message passing MIMD machines data is shared between processor by explicitly communicating it over the channels. Such a communication is either (a) synchronous: two processors exchange the data when both are ready to do so or (b) ~ynchrono~ : a processor co~uni~a~s the data to the other processor irrespective of the fact that the receiving processor is ready to receive it or

t Hereafter, we follow the notion that solids are r-sets, that is, subsets of E’ that are bounded, closed, regular, and semianalyytic. R-sets are algebraically closed under regu- Iutized set intersection, union, and difference, denoted n*, n*, -* [Zl]. In a solid modeling system, r-sets a= described either in a constructive solid geometry (CSG) representation, or in a boundary representation (B-Rep) scheme, or both [22].

$Two solids are quasidisjoint if their interiors are disjoint.

not. Hardware considerations make asynchronous communication impractical because no prior estimates can be made of the amount of data that will be transmitted over the communication channels. Syn- chronous ~~u~~tion involves idle time for the processors as the transmitting processor has to wait for the receiving processor to be ready for the data.

A common strategy for reducing the synchronization and communication delays is to use a small number of processors with small communicating distances, An alternative approach is to design soft- ware such that these overheads are minimized. Thus, the following key issues must be addressed for the design of efficient parallel algorithms:

1. The distribution of the work load across different processors must be well balanced.

2. Inter-procedural data dependency between tasks assigned to separate processors must be avoided.

3. If (2) is not possible, the assignment of the various tasks to different processors must be such that the communication distance is minimized.

3. A FRAMEWORK FOR DKFINING PARALLELLslM IN MESHING

Let f represent an automatic meshing operator whose domain S is a class of solids describable in a solid modeling system.? Then Mj = f (Sj) represents the finite element d~om~sition induced by f in the solid Sic S. Note that &fi is a topological complex consisting of the assembly of a finite set of quasidisjoint cells called e1ements.t Thus

i%fi=-pj,

where Ej denotes the generic element, and Z: the assembly operator.

Dejnition

The meshing operator f lends itself to parallel processing if the following two conditions are satisfied:

1. Si can be decomposed into a finite set of disjoint solids (s, , s2, . . . , s,) such that

where u* denotes the regularized union operator. 2. f can be applied to each element of the

set (s,, s2, . . . , s,) to produce a set of meshes

(m,, m2, . . . , m,) such that

~mjsMi,

where, as before, Z denotes the assembly operator. Although, in a broader sense, conditions (1) and (2)

are necessary and sufficient for parallelism, it is


appropriate to narrow down the above definition by adding the following:

3. vsjSi(r,,.sr,. . . , s,) mj is strictly a function of sj only. More precisely, no exchange of data is needed between sj and si, for i #j, to insure that mj =f(sj) exists and eqn (5) is satisfied.

Clearly, condition (3) has the important effect of eliminating the overhead costs due to inter-procedural data dependency mentioned in Sec. 2.1. Condition (2) should actually be expanded into

(2.1) f(s,) exists Vsj E (s, , s2, . . . , s,), (6)

and

(2.2) Cftsj) =ftsi)* (7) i

Condition (2.1) is not as trivial as it may appear. In fact it is quite possible that operator f may not be applicable to one or more elements of (s, , s,, . . . , s,,) even though condition (1) is satisfied. Let, for example, S’ and S denote the classes of solids bounded by manifolds and non-manifolds, respectively. Let also f' indicate a meshing function whose domain coincides with S’ only. It is easy to prove that VSiE S’, 3 a finite set of disjoint elements

(s,,s21 . . . , s,,) such that condition (1) is satisfied and one or more elements of the set are E S2 (the remaining elements are E S’). Clearly, since f' does not apply to element of S2, condition (2.1) is not satisfied.

The simplest way to ensure satisfaction of (2.1) is to require that all elements of (si, s,, . . . , s,) belong to exactly the same domain S of which S, is a member. If this is not possible, then f must be applicable to all other domains which may contain elements of

(s,,s2, . . .,S,).

Condition (2.2) requires that the interfaces between m, meshes have identical topologies. Let s, and s2 be two adjacent solids sharing an interface R,, given

by

R,,=bs,nbs2, (8)

where b denotes the boundary operator and n is the standard intersection operator between sets. The meshing operator f transforms s, and s2 into the discretixed FEM models m, and m, and, in general, modifies R,, to produce the new topologies T’(R,,) on s, and T2(Ri2) on s2. To ensure compatability, the induced topologies must be identical, that is

T’(R,2) = Z-2(R,2). (9)

In short, condition (2.2) requires that shared interfaces between m, meshes must always have identical vertices and edges.

t Similar, but less efficient, classification procedures can be derived for solids described in a B-Rep scheme [14].

If f satisfies conditions (l)-(3) and n independent processors are available then a parallel meshing procedure based on f can be organized as follows: (i) Si is subdivided into n disjoint subdomains si and each subdomain is assigned to an individual processor; (ii) operating in parallel on all subdomains, f produces n local meshes m(s,); and (iii) the local meshes are assembled together to yield the global mesh f(S,).

4. AUTOMATIC RSD MESHING IN A PARALLEL PROCESSING ENVIRONMENT

In this section we outline the RSD meshing procedure developed at the University of Rochester and then prove its applicability to parallel processing. For a detailed description of the various operations performed in this procedure see [14,15].

Stage 1: The solid S, described in a CSG representation, is enclosed in a suitable ‘box’ and the box is recursively decomposed into octal cells which are classified as being wholly ‘IN’ S, wholly ‘OUT’ of S, or undetermined (‘?). This procedure uses the cell- classifier developed by Lee and Requicha [23]. Unde- termined cells are further subdivided and classified until a pre-specified level of subdivision (the ‘resolution’ level) is reached. IN cells are subdivided to resolution level without further classification and OUT cells are discarded.t Stage 1 ends with special operations that reclassify resolution level ‘? cells as IN, OUT, or NIO (Neither-In-nor-Out). The hierarchical collection of cells produced by this kind of recursive spatial decomposition can be represented by logical trees whose nodes have eight sons-thus the popular name ‘octree’. The root node represents the enclosing box while the terminal nodes are associated with the cells at resolution level. Figure 1 shows a 2-D example. In this case the RSD proceeds by quadrants and the hierarchy of cells is represented by a quadtree. From the definition of RSD it follows that [14]

S c (( u * ‘IN’ Cells) u * (u’ ‘NW Cells)). (10)

This property ensures that a valid finite element mesh can be induced in S by operating only on IN and NIO cells.

Stage 2: At the end of octree decomposition all the IN cells are directly mapped onto hexahedral elements (or superelements) and the grid points or the vertices of the cells become the nodes of the FEM mesh. Thus at the end of the first stage of the meshing procedure we have an interior mesh consisting of highly regular brick elements with an underlying hierarchical structure.

The interior mesh is extended to the boundary of the domain bS by visiting all the NIO cells and decomposing them into combinations of tetrahedral, pyramid, and wedge elements. All such cells are (i) intersected with the solid S to yield the polyhedral sub-domain R = Cell n * S and (ii) a finite element

Parallel FEM algorithms based on recursive spatial decomposition-1 821

bS / Enclosing box

Cells at level

Legend

cl = NIO

::::::::: q :::;. z:::.:.:. = Inside

IEi = Outside

3 F Unknown

Node numbering scheme

P = Potent node

Fig. 1. Recursive spatial decomposition and related logical tree structure.

topology is inserted in R such that the nodal points are defined on the exact intersection points of the cell edges with S or on the comer vertices of the cell.

Step (ii) follows a dual approach. The cell is classified as being Simple or Complex depending on its topological description. In the case of Simple MO (SNIO) cells the finite element topology is induced by mapping R to a set of predefined templates whereas for Complex NIO (CNIO) cells R is decomposed by the recursive application of element extractors [ 151.

This high-level algorithm is illustrated in Fig. 2. It is also possible to generate all tetrahedral meshes by decomposing all non-tetrahedral elements into tetrahedra. For details of the algorithm see [13].

Hereafter cell is a generic resolution-level cell in the octree, index (same as cell-index in Fig. 2) ident- ifies the location of cell in the octree, and classifica- tion_tug denotes the IN/SNIO/CNIO classification of cell.

With reference to the framework established in Sec. 3, the satisfaction of condition (1) is examined separately from (2) and (3).

Procedure f(S: solid; L: resolution level);

Begin (1) Stage 1: Generate Octree with level = L;

Stage 2: For all resolution level Cells do begin

Case cell of

I:; IN : Hexahedral Element( Cell-index); NIO : Compute R and Classify-Cell;

(4) SNIO : Template(R, Cellindex);

(5) CNIO : Element Extractors(R, Cellindex); end {do-loop}

(6) Assemble localized meshes. end; {procedure}

Fig. 2. The meshing procedure f(S, L).


Procedure Classify_cell(bR, index); Begin

Identify Set R from index; For all faces of bR do

If (face C EI)

then A = A + 1 else r = r + 1;

If (I’ = 1) then simple else complez;

end;{ procedure)

Fig. 3. Procedure for simpie/~mplex eel1 dassitication.

4.1. Satisfaction of condition (1)

Intuitively, the collection of IN cells and R solids (R = NIO Cell n * S) corresponds to S. The proof is simple. Let W denote the union of all IN and NIO cells,

W = ( U * IN Cells) u * (U* NIO Cells). (11)

Let also R denote Cell R * S for both IN and NIO cells. To prove the ~tisfaction of condition (1) one needs to show that

u; R,=S. (12)

introducing the definition of Ri and using the dis- tributive property of set operations, the left-hand side of eqn (12) becomes

UT&= U*(Cell n*S)=(U;Cell,) n*s. (13)

Note that in this case ( fi* Cell,) corresponds to W. Recalling that-eqn (10)-S c W, eqn (13) yields

u;Ri= W n+S=S.

4.2. Satisfaction of conditions (2) and (3)

(14)

The three cases encountered in stage 2 of meshing (IN, SNIO, CNIO) are considered separately. For each case, it is shown that index and R suffice for the satisfaction of conditions (2) and (3). The coordinates of the cell centroid as well as the cell size (and therefore the coordinates of the vertices) can be determined directly from the cell index, using the spatial addr~sability of the octree structure [15]. Spatial addressability is an important property of the hierarchical structure that allows a direct calculation of the size and spatial position of any octant, without searching, pertinent geometrical data from the hierarchical datastructure.

IN ceil. As explained earlier IN cells are always meshed by mapping. In this case, since f(si) always exists and is uniquely defined by the vertices of the cell (and, therefore, by index), condition (2.1) is satisfied. Also, since the interface topology is always a square (a face of a cell) with nodes at the vertices

and the meshes inserted in SNIO and CNIO cells are constrained to preserve identical topologies on their interfaces with IN cells, condition (2.2) is satisfied. Finally, since only index is required for performing f(si), condition (3) is satisfied.

SNIO/CNIO cell classification. The cell classification is based on the topological description of the boundary of R, bR. Let bS and bCel1 be represented by the following set of faces

bS=(F,,F, ,..., F,,) (15)

bCel1 = (H,, Hz, . . . , He). (16)

In general bR can be represented as

bR=(A,,Az, . . . . As,f‘,,r,,...,f,) (17)

where Ai denotes a face of bR embedded on, or identical to, a face of bCel1 and ri a face of bR embedded on, or identical to, a face of bS. In formal notation

VAi E bR, 3 oniy one Hi E bCet1 (18)

such that Air HI, and

Vri E bR, 3 only one 4 E bS such that T,rq.

(1%

The ambiguous case of a face that can be- classified as both A and r face is resolved by classifying the face as A. The simple/complex cell classification is based on the number of f faces present in bR and can be simply described as:

if (k = 1) then simple else “complex’. (20)

Thus an NIO cell is classified as being shtple or complex by counting the number of r faces in bR (if only one r face is present, then the cell is SNIO, else it is CNIO). As shown in Fig. 3, the counting is done

Y Hl HZ

H4

H3

X

Fig. 4. A quadtree cell and related R set.

0%

-If Fig. 5. Operators for element extraction.

Parallel FEM algorithms based on recursive spatial decomposition-1 823

by comparing each face of bR to the set of planes R associated with the cell. Since the R set is uniquely defined in terms of the cell centroid and half-size-see Fig. 4 for a 2-D exampl-lassification is a function of R and index only, and therefore, satisfies condition (3).

SNIO cell &composition. By definition a SNIO cell is intersected by one cutting surface only. The number of templates to be defined to mesh any possible SNIO cell configuration is restricted to seven and depends upon the number of cell vertices that are shaved off by the cutting surface. In the design of the templates and in the process of mapping from the template space into R, (= SIN0 Cell n l S), special care is taken to ensure that the mesh must be compatible with those embedded in neighboring cells. This requirement imposes both geometrical and topological constraints on the element triangulation induced on the A faces of bR,. The issue is resolved by placing the following conventions in the placement of nodes on bR, and in the decomposition of A faces into quadrilateral and triangular polygons:

1. Each A face is triangulated with a fixed topology based on the number of edges that bound the face. From the definition of R, it follows that A faces are planar polygons with either 3, 4, or 5 edges that represent the interface between R, and the neighboring cells. Thus a triangular face of R, is preserved as a triangular face; a four-sided face of R, is mapped to a four-sided face of either an hexahedral, wedge, or pyramid element and a five-sided face is mapped onto three triangular faces.

2. Nodes are located only at the vertices of bR,. However, there is one exception to this rule. An extra mode is inserted in R, for the topological case when the cutting plane shaves off three vertices of the o&ant. This is done to ensure condition (1) stated above.

Thus, for SNIO cellsf(si) is a mapping which is a function of the vertices of bR and of index only. Also, by definition f&r,) exists for any cell classified as SNIO. Thus conditions (2.1) and (3), Sec. 3, are satisfied. As described above f(s,) can induce only three type of topologies on the interface of bR. Since adjacent IN, SNIO, and CNIO cells are constrained to have identical topologies on the interfaces, condition (2.2) is also satisfied.

CNIO cell &composition. The polyhedral domain R, defined as

R, = CNIO Cell n * S (21)

is decomposed into tetrahedral and pyramid elements by the recursive application of a set of operators that use topological and geometrical information to extract elements. Specifically operators (OP, , OP,, OP,) developed by Woiidenweber [24] are used to remove tetrahedral elements while a special operator OP* [25] is used for pyramid elements (Fig. 5). The decomposition of the polyhedral domain R, proceeds in three stages:

1. The domain R, is converted to its planar equivalent R: by replacing all curved surfaces with collec- tions of planar triangles,

824 M. SAXENA and

V8 V

V5

VB A 2 4

V5

~~

V3

4 fiV

v3 fi t”

&)

v2

t”

v2

vtx_loop - (vi ,v2,v3,v4,v5,v6)

Fig. 6. Node sequencing for Delaunay triangulation.

2. Domain Ri is reduced to RF by recursive pyramid extractions via operator OP*.

3. Operators OP,, . . . , OP, are used to reduce R: to a single tetrahedron.

Operator OP* is used to extract pyramid elements by introdu~ng multiple cuts in the domain R:. For the extraction to be valid, the new faces of the pyramid must not interfere with any of the pre-existing topological entities in the polyhedral domain. At the end of each extraction the boundary of the polyhedral domain is updated. The operator is recursively applied until all qua~lateral faces are removed from the domain and R: is reduced to Ra.

R: is reduced to a tetrahedron by the recursive application of OP,, i = 0, 1,2,3, which extract a tetrahedral element by introducing i cuts in the domain. The validity of each extraction is confirmed by checking for interference between the faces of the candidate element and the existing faces in bR:.

In stage 1 above, to enforce compatability with the meshes embedded in adjacent cells, nodes are introduced only at the vertices of bR,. A faces that correspond to the three-, four-, or five-sided polygons are ~iangulat~ according to the same convention adopted for similar A faces on SNIO templates. All other A faces are decomposed by a constrained 2-D Delaunay triangulation procedure [lo] that operates on the vertices and edges of A. A vertex sequencing scheme, which uses geometrical information derived from the octal cell index, ensures that the translation procedure will produce identical meshes on the A face under consideration as well as on its counterpart in the adjacent cell.?

The uniqueness of the triangulation is guaranteed by inserting vertices into the triangulation starting with the one closest to the origin and following a specific vertex loop. Let n be the outward normal to

‘/’ Vertex sequencing resolves the degeneracy problem, associated with Delaunay triangulation, described by Field and Frey [26].

the face and t be the tangent to the vertex loop, then for the order to be preserved, the algorithm selects t such that the vector product n x t is directed towards the interior of the face when n is directed along one of the principal axes, and -n x t is directed towards the interior of the face when n is directed opposite to one of the pricipal axes (Fig. 6). This ensures that nodes will be inserted in the correct order each time the face is triangulated. Note that the direction for n is determined from the cell index. Figure 7 shows an example of this type of interface, in this case a six-sided polygon. In the figure, corresponding vertices of two neighboring R solids are identified by identical labels. Although each R is handled separately, the 2-D triangulation procedure ensures that identical edges will be inserted on the two faces.

Finally, each r face is translated using the following three-step procedure: (1) the vertices and the edges of the face are projected onto a plane, (2) the resulting planar polygon is decomposed into triangles by the same 2-D Delaunay procedure used for A faces, and (3) the vertices of the triangles are projected back onto the surface.

Thus, for CNIO cells f(sJ is the element extraction procedure which operates only on planar manifolds. Satisfaction of conditions (2.1) hinges on R being (a) a manifold and (b) mappable to a planar polyhedra. In genera1 both (a) and (b) are true[25] and thus (2.1) is satisfied. The curve-to-planar mapping is totally based on R and that imbedding of interface topologies on R is also based on the vertices of R and the cell index. Thus condition (3) is

U U

t t

S 8

~~

r P r

P 9

q (‘I

(b)

Fig, 7. Interface between CNIO cells (a) and triangulation of corresponding polygons (b).

Parallel FEM algorithms based on recursive spatial decomposition--I 825

(4 I Fig. 8. Meshes for block (a), housing (b), cylinder-cylinder-intersection (c)? bracket (d). and object-x (e).

satisfied. The unique triangulations for the shared interfaces, derived by the algorithm described above, result in the satisfaction of condition (2.2) in Sec. 3.

4.3. Implementation and examples

The meshing procedure described above has been implemented in an experimental program called X- MESH3D. The procedure operates on solids defined in the PADL-2 solid modeling system. Figure 8 shows five meshes, generated using X-MESH3D, which are used as benchmark problems for evaluation of the RSD-based meshing scheme with respect to different parallel configurations described in the following section.

The fist two examples, denoted as block and housing include only IN and SNIO cells while

the remaining examples, denoted as cy~_cy&~t, bracker, and object-x, include also CNIO cells. The relevant data for these examples are summarized in Tables l-3. Table 1 indicates the number of IN, SNIO, CNIO cells at resolution level as well as the number of nodes and elements for each example. Table 2 gives the CPU times for the various operations included in stage 2, while the total times for stages 1 and 2 are given in Table 3. The following general conclusion can be drawn from the examples considered-stage 1 takes about 1% of the total CPU time for meshing; this is due to the efficiency of the Lee-Requicha algorithm [23] used to derive the RSD of the original solid S. Therefore, only stage 2 is evaluated for concurrent implementation.


Problem

Table 1. Composition of the meshes in Fig. 8

IN cells ‘? cells SNIO cells CNIO cells Nodes Elements

Block 16 36 36 - 125 84 Housing 30 56 56 - 198 118 Cyl_cyLint - 71 63 8 88 104 Bracket 30 129 117 12 336 297 Object-X 28 108 93 15 308 288

Table 2. CPU times, in seconds, for stage 2 operations for examples in Fig. 8

T, (Cell A* Solid (SNIZ Ceil (CNIZ Cell Total time

Problem and overheads) decomposition) d~m~aition) for stage 2

Block 122.32 11.48 - 133.80 Housing 217.24 16.86 - 294.10 CyLcyLint 263.11 21.17 81.44 365.72 Bracket 313.86 46.84 501.84 862.54 Object-X 213.62 38.88 524.05 776.55

5, PARALLEL CONFIGURATIONS FOR STAGE 2 OF RSD MESHING

Three alternative configurations for the parallel implementation of stage 2 of RSD meshing are considered. The key design parameters are the number of processors and the actual stage 2 operations performed in each processor. Each configuration is simulated by running X-MESH3D on a MicroVAX II under the VMS operating system. The following notation applies: NP denotes the total number of processors; n,, and ncnio denote the total number of SNIO and CNIO cells, respectively; n, indicates the total number of NIO cells (n, = nsnjo + n,,,).

5.1. Con&ration 1: N, = n, and R computed in parallel

Consider a fine-grain parallel architecture with NP = 12, and assume that each processor computes R = Cell n * S independently. Assume also that all processors communicate with each other through a central host processor and a global shared memory resulting in a star-shaped connection topology.

Let tf& denote the average time required to compute R = Cell n* S and t&, t&, tfiPp the average time required to mesh an IN, SNIO, and CNIO cell, respectively. Because of the type of operations involved, t& and tfi, are problem related, while t& and Z& are approximately constant. It is appropriate to assume

t& 3 t&g > t& % td, (22)

and, thus, ignore tdy in the following calculations. For a serial implementation of the meshing scheme

the time require for stage 2 can be approximated as

G, = t& * n, + tLK * nsnfo + t&8 * hcmiO. (23)

The execution time for the proposed parallel architecture is given by

T:,a/ x ~~~ + ffnyrl (24)

where, according to eqn (22), t& is used to derive an upper bound for T&,,. (For domains with SNIO cells only, t& replaces t& .) For these estimates the speed- up is

T;,a,

(25)

=df’n,,+n&. (26)

The efficiency of parallel implementation is given by

v =plN,=a.R:+Rf, (27)

where R;= ~~~n~ and R: = n~ln~. From eqn (27) it follows that if all NIO cells are of

the same type-SNIO or CNIO- the theoretical efficiency of this con6guration is 100%. For all other cases a is problem related and 0 6 a < 1 [a < 1 follows from eqn (22) and a = 0 from the limit case t& % t&l*

Configuration 1 is tested on the problems described in Sec. 5 and the results are reported in Table 4.

Table 3. CPU times, in seconds, for the two stages of mesbinp for examples in Fig. 8

Time for Time for Total time Problem Staue 1 staxe 2 for meshina

Biock t 47 133.80 135.27 Housing 2.53 294.10 296.63 Cyl_cyLint 2.13 365.72 367.85 Bracket 3.33 862.54 865.87 Object-X 3.08 776.55 719.63

Parallel FEM algorithms based on recursive spatial decomposition-I 827

Table 4. Results of tests for parallel configuration 1 (times given in seoonds)

Object

Block Housing Cyl-in t Bracket Obiect -X

“ml0 “do lb

3.:

I”

0.:

tc

_”

u R: P rt (%I 36 0 1 0 - 36.0 100.0 56 0 4.95 0.30 - 1:o - 56.0 100.0 63 8 3.71 0.34 10.18 0.29 0.11 26.4 37.1

117 12 2.43 0.40 41.82 0.06 0.09 19.6 15.2 93 15 1.98 0.42 34.94 0.07 0.14 21.5 19.5

Block, housing, cyl_cyl_int, bracket, and object-X denote the meshes in Fig. 8. For each problem, tb ts and t& are computed from the values in TTble 7.’ Specifically, tk = T, In,, t& = TJn,, and

t& = T, lnWl,. For the two problems of realistic com- plexity-bracket and object-X+ is small enough to have p directly related to nentO. In this case, since the gain in speed is almost totally due to the parallel processing of CNIO cells, the efficiency is directly related to Rf (i.e. n,, processors-assigned to work on SNIO cells-are left idle for a considerable amount of time).

5.2. Conjiguration 2: N, = n, and R computedsequen- tially

Consider the same configuration as before but now assume that each R is computed sequentially within the host processor and passed to the appropriate processor as soon as computed. Note that, in this case, the ith processor begins to operate only after

R,,4,. . . , R, have been computed by the host. This means that, at any given time, only few processors (in the worst case one processor) are active, as shown in the activity chart in Fig. 9. Therefore efficiency is lower than for the previous configuration.

Using the same notation as in Sec. 5.1, the time requirement for this parallel configuration can be approximated as

i-L,,/ = t&*n,+t&. (28)

The estimate in eqn (28) is conservative in that it assumes that the last cell is CNIO. The speed-up is

n, + Ksbnn,0 + K&0 = n,+K; ’

(29)

Pl-

Pl I--

solution mlE+ he

Fig. 9. Froccssors activity chart for configuration 2.

where K; = t&/t& and Kg = t&/t&. The efficiency of parallel implementation is given by

rl =plN,= 1 + K;R; + K;R;

n,+ K; (30)

where R: and Rf are the same as in the previous section. From eqn (30) it follows that with an increase in n, there is a drop in the efficiency of the parallel implementation. Each processor is active for only a small fraction of the total processing time (t&/TP,,a, or t&/T:,) and this ratio decreases with n, increasing.

Table 5 gives speed-ups and efficiencies for the test problems. The values of t&, tsw, t& are derived as for the tests on configuration 1. Intuitively, the speed-up is related to the ratio jl = t&/te (or

B = t&It&, if only SNIO cells are present). Thus the speed-up is highest for object-x. In contrast, p is marginal for block and housing as these problems do not contain complex cells. Note that, as expected, the speed-ups in Table 5 are considerably lower than those produced by configuration 1.

5.3. Conjiguration 3: N, = 8

Consider now a coarse-grain parallel configuration with NP= 8. To take advantage of the inherent parallelism of octree decomposition, each of the eight first level octants--i.e. the eight cells in which the

Object

Block Hous~g CyLint Bracket Objrct-X

Table 5. Results of tests for parallel configuration 2 (times given in seconds)

nnlo 4n& tb t* -a tc Qv B P 36 0 3: 0.32 - 0.09 1.10 z 0 4.95 0.30 - 0.06 1.06

8 3.71 0.34 10.18 2.74 1.34 117 12 2.43 0.40 41.82 17.21 2.69 93 15 1.98 0.42 34.94 17.65 3.12

rl cw 3.03 1.89 1.89 2.09 2.89


Pl

P5

P2

P6

P3

P7

P4

PS Fig. 10. Decomposition of a solid star resulting in a fully balanced octree.

enclosing box is decomposed-is mapped to an individual processor. Thus the ith processor performs both stage 1 and 2 of meshing on S using Cellf as the root node of the decomposition, where Cell: denotes the first-level octant assigned to the processor.

Let Tsi denote the time necessary to execute the meshing procedure on the ith processor. Thus, a serial implementation requires

Tfom, = i Tsi (31) i-0

and the parallel configuration requires

G,, = maxi { TX, 1, (32)

resulting in a speed-up of

i=8 c Tsi i-0

’ = maxi ’

The efficiency of this configuration is

8 Pm

P t ISi i= I

’ =~=8*max{T,}

(33)

(34)

The upper and lower bounds for p and q are derived by considering the two limiting cases for the octree balance:

(a) The octree is fully balanced: Cell; n * S is virtu- ally the same for all processors. In this case the


Table 6. Meshing with Np = 8: CPU times for decomposing various subdomains and unbalance factor

Block Housing C@tl Bracket Star

=s 20.832 89.700 91.720 211.722 ss.700 =A2 20.808 67.632 92.180 137.568 58.709 =sJ 21.864 22.566 143.082 58.706 =, 14.270 100.602

:E 94.356 58.700

=L% 8.112 91:814 21.936 58.700 =s6 8.190

i:E 91.905 2.508 58.700

=, 21.786 2.061 0.000 16.860 58.700 =SI 14.208 2.050 0.000 2.466 58.700 T mm/ 130.070 286.651 367.620 630.498 469.600 Y 1.34 2.81 2.00 2.68 1.00

object

Block Housing CyLin t Bracket Star

Table 7. Results of tests for parallel configuration 3

=smA=) =,o,s,/ (=I %,A (=I P rl W) P’ 21.79 130.07 135.27 5.97 74.6 6.21

100.60 286.65 296.63 2.85 35.6 2.95 92.18 367.62 367.87 3.99 50.0 3.99

211.72 630.50 865.87 2.98 37.3 4.09 58.70 469.60 568.42 8.0 100.0 9.68

(b)

so

maximum speed-up (p = 8) and efficiency (q = 100%) are obtained. The octree is fully unbalanced in the sense that S is fully contained by one of the octants, say Call: (thus Cell; n * S #O only for i = 1). In this case there is no difference in speed between serial and parallel configuration @ = 1) and the efficiency reaches the maximum value of q = l/8.

1 <p <8 and 0.13<~ < 1.0. (35)

The present configuration is tested on four unbalanced problems--block, housing, c$qLirtt, and bracket-and a fully balanced on-star shown in Fig. 10.7 To simulate as closely as possible the parallel configuration, X-MESH3D is slightly modified so that it operates only on a single first-level octant at a time. The program is executed eight times on each problem and a different octant is specified each time. The results are reported in Table 6, where r,, indicates the time for running X-MESH3D on the ith octant, T,,,,a, = Z T,, figures in bold-face denote max{ Ts,} for each problem, and y is given by

max{ Tsi > = max{ Ts, 3 = 1 ’ = avg{ T,} Tmd8 i -

(36)

Factor y measures the unbalance status of the octree (y = 1 for a fully balanced tree and 7 = 8 for fully unbalanced tree).

Speed-up and efficiency--p and rl computed according to eqns (32) and (33), respectively-are reported in Table 7. Notice the direct correlation

t For the star problem n,, = 224, nenio = 0, t& = 2.24 sbc, IS m’~ = 0.39 sec.

between speed and efficiency. The speed-up factor is also computed according to

(37)

where T&, is the ac~uul serial time for each problem taken from Table 2. Notice that the difference between T&,d and T,*,=, given by eqn (3 1) is substantial for large problems. This indicates that overhead costs, which are responsible for the above difference, can be greatly reduced by executing smaller problems in a parallel configuration. With the effect of overheads taken into account, speed-up factors greater than eight are possible, as shown for the star problem.

6. DISCUSSION

The RSD automatic meshing procedure introduced in the present work satisfies all the conditions for parallel processing-Sec. 4-and is directly applicable to both fine-grain and coarse-grain configurations-Sea. 5. As such, this procedure is ideally suitable for parallel implementation. It is important to realize that none of the other algorithms for automatic meshing [ 11,24,27-291 satisfies the stringent conditions for parailel processing, defined in Sec. 3.

Algorithms based on global Delaunay triangulation [28] or global element extraction [24,271 must be applied to the whole domain in order to produce a valid mesh and, therefore, their immediate paral- lelization is not possible. Also, attempts to modify these algorithms to introduce a degree of parallelism have never been reported. The 3-D algorithm by Shephard and co-workers [I l] does not appear to include provisions for satisfying eqn (9) for general

830 M. SAXENA and R. F%RUCCHIO

cases. Moreover, its octree decompostion scheme- equivalent to stage 1 in the present approach-is inherently sequential. As shown below, parallelism for stage 1 is crucial for an efficient implementation of octree based automatic meshing.

Of the configurations examined in Sec. 5 only 1 and 3 should be considered for implementation since 2 offers only marginal efficiencies. An implementation of a fine-grain configuration-such as in configuration l-must take into account that the required number of processors flp is dictated by the highest level of octree decomposition. For the example problems, n = 3 and Bp = 512. Since p depends only on the processors working on NIO cells, implementing configuration 1 on 5 12 processors will decrease tl and leave p unchanged (for the two worst cases bracket and object-X in Table 4-_11 will be reduced to 3% and 4%, respectively).

Besides the number of processors, memory requirements must also be considered. X-MESH3D in its current experimental status requires approximately 32 MB of memory (x 95% of which is for the PADL- 2 solid modeler). MIMD machines with 512 processors, each having 32 MB of memory, do not exist at present but are within the reach of current technol-

ogy. Before concluding the discussion on configuration

1, it is important to realize that its validity depends very much on stage 1. The results in Sec. 5.1 are based on the assumption that only stage 2 is done in parallel. Clearly, this is justified if the cost of stage 1 is much smaller than the cost of stage 2- which is the case for X-MESH3D, see Table 3- or if stage 1 can be done in parallel-which is also the case for X-MESH3D [15]. If neither of the above conditions is satisfied, speed-up and efficiency of configuration 1 will decrease substantially.

The coarse-grain configuration-configuration 3 in Sec. 6.5-although resulting in lower speed-up factor, is particularly appealing for the following important reasons:

It can be implemented directly on a relatively small network of engineering workstations, easily available in both research and production environments. Its average efficiency is much higher than that of a fine-grain implementation. It can be used to tightly couple meshing and analysis within the same parallel configuration [30].

Notice also that for a configuration based on standard workstations, the actual speed-ups will be higher than the theoretical values, due to the reduction of overheads.

6.1. Open issues

As indicated in Sec. 4, the present meshing procedure is designed to be algorithmically robust, com- putationally efficient and rigorously automatic and lends itself for parallel processing. A performance

evaluation of the meshing procedure for various parallel configurations, described in Sec. 5, suggests the following avenues to be explored for further research:

l Algorithms for octree balancing. The degree of balance of the octree affects the general efficiency of parallel meshing configurations. The balance can be improved by controlling the size and orien- tation of the enclosing box in which the RSD takes place.

l Integration and optimization of automatic meshing and analysis of such automatically derived meshes on a network of workstations as well as on a vector processing MIMD machine (such as the AL- LIANT/FXS and the Hypercube).

l Algorithms for parallel boundary evaluation. Based on our preliminary experience, the considerable amount of computational time required for extracting the boundary information of the boundary octants is a major stumbling block in the way of an efficient parallel configuration. Algorithms that can perform localized boundary evaluation in a concurrent environment will also resolve the issue of large memory requirements for individual processors as the solid modeling system, residing in a global shared memory, can be accessed by independent processors simultaneously.

In conclusion, we believe that the meshing procedure described here represents an important contri- bution to the problem of automatic mesh generation from solid models in a parallel processing environment. As shown in the companion paper [30], the hierarchical substructuring scheme for analysis of such RSD based meshes is also ideally suited for parallel processing. Therefore, we expect this meshing-analysis system will play a crucial role in the research and development of future parallel FEM systems.

Acknowledgements-The Industrial Associates of the Pro- duction Automation Project of the University of Rochester, the Gleason Memorial Foundation, and the University of Rochester provided sustaining support for this work. The findings and opinions expressed here do not necessarily reflect the views of the sponsors.

1.

2.

3.

4.

5.

REFERENCES

A. K. Noor (Rd.), Parallel Computations and Their Impact on Mechanics, AMD-Vol. 86. American Society of Mechanical Engineers (1987). B. Nour-omid and K. C. Park, Solving structural mechanics problems on the caltech hypercube machine. Cornput. Meth. Appl. Mech. Engng 61, 161-176 (1987). J. G. Malone, Automated mesh decomposition and concurrent finite element analysis for hypercubc multi- processor computers. Comput. Meth. Appl. Mech. Engng 70, 27-58 (1988). D. Zois, Parallel processing techniques for FE analysis: stiffness, loads and stresses evaluation. Compur. Srrucr. 28, 247-260 (1988). D. Lois, Parallel processing techniques for FE analysis: system solution. Compur. Strucr. 28, 261-274 (1988).


6.

I.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

M. Al-Nasra and D. T. Nguyen, An algorithm for domain decomposition in finite element analysis. Compur. Swuct. 39, 211-289 (1991). We&Ping Zhang and E. M. Lui, A parallel frontal solver on the Alliant FX/80. Comput. Srrucr. 38, 203-215 (1991). P. Zave and W. C. Rheinboldt, Design of an adaptive, parallel finite-element system. ACM Trans. Math. Soft- ware 5, l-17 (1979). F. Cheng, J. W. Jarmoczyk, J. Lin, S. Chang and J. Lu, A parallel mesh generation algorithm based on vertex label assignment scheme. Int. J. Numer. Meth. Engng 28, 1429-1448 (1989). N. Sapidis and R. Perucchio, Advanced techniques for automatic finite element meshing from solid models. Computer-Aided Design 21, 248-253 (1989). W. J. Schroeder and M. S. Shephard, Geometry-based fully automatic mesh generation and the delaunay triangulation. Ink J. Numer. Meth. Engng 26, 2503-2515 (1988). N. Sapidis, Domain delaunay tetrahedrization of arbitrarily shaped curved polyhedra defined in a solid modeling system. Ph.D. dissertation, Department of Mechanical Engineering, University of Rochester, Rochester, NY (1991). M. Saxena, Y. Pressburger and R. Perucchio, Auto- matic mesh generation in a parallel processing environment. Proceedings ASME Inrernational Computers in Engineering Conference, pp. 623-631, San Francisco, CA (1988). R. Perucchio, M. Saxena and A. Kela, Automatic mesh generation based on recursive spatial decomposition of solids. Int. J. Numer. Meth. Engng 28, 2469-2501 (1989). M. Saxena, Parallel algorithms for 3-D automatic meshing and hierarchical substructuring. Ph.D. dissertation, Department of Mechanical Engineering, University of Rochester, Rochester, NY (1989). A. Kela, R. Perucchio and H. B. Voelcker, Toward automatic finite element analysis. ASME Comput. Mech. Engng 5, 57-71 (1986). Y. Pressuburger, Self-Adaptive FEM procedure based on recursive spatial decomposition and multi-grid analysis. Ph.D. dissertation, Department of Mechanical Engineering, University of Rochester, Rochester, NY (1991).

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

M. J. Flynn, Very high speed computing systems. Proc. IEEE 54, 1901-1909 (1966). M. J. Flynn, Some computer organization and their effectiveness. IEEE Trans Computing 21, 948-960 (1972). J. Worlton, Toward a science of parallel computation. In Computation Mechanics-Adoances and Trends (Edited by A. K. Noor), AMD-75, pp. 23-35. American Society of Mechanical Engineers, New York (1986). A. A. G. Requicha, Representations for rigid solids: theory, methods and systems. ACM Computing Suroeys 12, 437-464 (1980). C. M. Hoffmann, Geometric & Solid Modeling: An Introduction. Morgan-Kaufmann Publishers, Sam Mateo, CA (1989). Y. T. Lee and A. A. G. Requicha, Algorithms for computing the volume and other integral properties of solids: Part I-Known methods and open issues, pp. 635-641, and Part II-A family of algorithms based on representation conversion and cellular approxi- mation, pp. 642-650. Commun. ACM 25, No. 9 (1982). B. Wiirdenweber, Finite-element analysis for the naive user. In Solid Modeling by Compurers (Edited by M. S. Pickett and J. W. Bovsej. DD. 81-102. Plenum Press,

_ ” __ New York (1984). M. Saxena and R. Perucchio, Element extraction for automatic meshing based on recursive spatial decompositions. Comnut. Struct. 36. 513-529 (1990). D. A. Field and W. H. Fre;, Automation ok tetrahedral mesh generation. Technical report GMR-4967, General Motors Research Labs, Warren, MI (1985). T. C. Woo and T. Thomasma, An algorithm for generating solid elements in objects with holes. Compul. Strucr. 18, 333-342 (1984). J. C. Cavendish, D. A. Field and W. H. Frey, An approach to automatic three-dimensional finite element mesh generation. Int. J. Numer. Meth. Engng 21, 329-347 (1985). M. S. Shephard and M. K. Georges, Automatic three- dimensional mesh generation by the finite octree technique. Int. J. Numer. Melh. Engng 32, 709-749 (1991). M. Saxena and R. Perucchio, Parallel FEM algorithms based on recursive spatial decomposition-II. Auto- matic analysis via hierarchical substructuring. Compuk Strucf., to be published.

Documents

Parallel fem algorithms based on recursive spatial decomposition—I. Automatic mesh generation