Simulation of Multibody Dynamics Leveraging New Numerical

NSF GRANT # 0700191NSF PROGRAM NAME: CMMI

Simulation of Multibody Dynamics Leveraging New Numerical Methodsand Multiprocessor Capabilities

Dan NegrutSimulation Based Engineering Lab, Department of MechanicalEngineering

University of Wisconsin, Madison, WI, 53706

Laurent JayDepartment of Mathematics

University of Iowa, Iowa-City, IA, 52242

Alessandro TasoraDepartment of Industrial Engineering

University of Parma, V.G.Usberti 181/A, 43100, Parma, Italy

Mihai AnitescuMathematics and Computer Science Division

Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439

Hammad Mazhar, Toby Heyn, Arman PazoukiSimulation Based Engineering Lab, Department of MechanicalEngineering

University of Wisconsin, Madison, WI, 53706

Abstract This paper describes an approach for the dy-namic simulation of computer-aided engineering modelswhere large collections of rigid bodies interacting throughmillions of frictional contacts and bilateral mechanicalconstraints. Thanks to the massive parallelism availableon modern GPUs, we are able to simulate sand, granu-lar materials, and other complex physical scenarios withone order of magnitude speedup when compared to a se-quential CPU–based implementation of the discussed al-gorithms.

1. Introduction, Problem Statement, and ContextThe ability to efficiently and accurately simulate thedynamics of rigid multibody systems is relevant incomputer-aided engineering design, virtual reality, videogames, and computer graphics. Devices composed ofrigid bodies interacting through frictional contacts andmechanical joints pose numerical solution challenges be-cause of the discontinuous nature of the motion. Con-sequently, even relatively small systems composed of afew hundred parts and constraints may require significantcomputational effort. More complex scenarios such as ve-hicles running on pebbles and sand as in Fig. 1 and Fig. 2,soil and rock dynamics, and flow and packing of granularmaterials are particularly challenging and prone to very

long simulation times. Results reported in [3] indicatethat the most popular rigid body software for engineer-ing simulation, which uses an approach based on the socalled Discrete Element Method, runs into significant dif-ficulties when handling problems involving thousands ofcontact events.

Until recently, the high cost of parallel computing limitedthe analysis of such large systems to a small number ofresearch groups. This is rapidly changing owing in largepart to general-purpose computing on the GPU. Anotherexample of commercially available rigid body dynamicssoftware is NVIDIA’s PhysX [4]. This software is com-monly used in real-time applications where performance(rather than accuracy) is the primary goal. The goal ofour effort was to somewhat de-emphasize the efficiencyattribute and instead implement an open source, general-purpose physics-based GPU solver for multibody dynam-ics backed by convergence results that guarantee the ac-curacy of the numerical solution.

Unlike the so-called penalty or regularization methods,where the frictional interaction can be represented by acollection of stiff springs combined with damping ele-ments that act at the interface of the two bodies [5], theapproach embraced here draws on time-stepping proce-dures producing weak solutions of the differential varia-

Proceedings of 2011 NSF Engineering Research and Innovation Conference, Atlanta, Georgia Grant #0700191

Figure 1: Chrono::Engine [1] simulation of a complex, rigid multi-body mechanism with contacts and joints.

Figure 2: Chrono::Engine simulation of a tracked vehicle on agranular soil. The GPU was used for both dynamics and collisiondetection between tracks, sprockets, and pebbles [2].

tional inequality (DVI) problem, which describes the timeevolution of rigid bodies with impact, contact, friction,and bilateral constraints. Early numerical methods basedon DVI formulations can be traced back to the early 1980sand 1990s [6, 7, 8]. Recent approaches based on time-stepping schemes have included both acceleration-forcelinear complementarity problem (LCP) approaches [9, 10]and velocity-impulse, LCP-based time-stepping methods[11, 12, 13]. The LCPs, obtained as a result of the intro-duction of inequalities accounting for nonpenetration con-ditions in time-stepping schemes, coupled with a polyhe-dral approximation of the friction cone, must be solved ateach time step in order to determine the system state con-figuration as well as the Lagrange multipliers represent-ing the reaction forces [7, 11]. If the simulation entails alarge number of contacts and rigid bodies, as is the casefor granular materials, the computational burden of clas-sical LCP solvers can become significant. Indeed, a well-known class of numerical methods for LCPs based onsim-plex methods, also known asdirect or pivoting methods[14], may exhibit exponential worst-case complexity [15].Moreover, the three-dimensional Coulomb friction caseleads to a nonlinear complementarity problem (NCP). Theuse of a polyhedral approximation to transform the NCPinto an LCP introduces unwanted anisotropy in frictioncones and significantly augments the size of the numeri-cal problem [11, 12].

In order to circumvent the limitations imposed by the useof classical LCP solvers and the limited accuracy associ-ated with polyhedral approximations of the friction cone,a parallel fixed-point iteration method with projection ona convex set has been developed [16]. The method isbased on a time-stepping formulation that solves at ev-ery step a cone-constrained quadratic optimization prob-lem [17]. The time-stepping scheme has been proved toconverge in a measure differential inclusion sense to thesolution of the original continuous-time DVI. This paper

illustrates how this problem can be solved in parallel byexploiting the parallel computational resources availableon NVIDIA’s GPU cards.

2. Core Method The formulation of the equationsof motion, that is, the equations that govern the timeevolution of a multibody system, is based on the so-called absolute, or Cartesian, representation of the at-titude of each rigid body in the system. The state ofthe system is denoted by the generalized positionsq =[rT

1 ,εT1 , . . . , rT

nb,εT

nb

]T∈ R

7nb and their time derivatives

q =[rT

1 , εT1 , . . . , r

Tnb, εT

nb

]T∈ R

7nb, wherenb is the num-ber of bodies,r j is the absolute position of the center ofmass of thejth body, and the quaternions (Euler param-eters)ε j are used to represent rotation and to avoid sin-gularities. Instead of using quaternion derivatives inq, itis more advantageous to work with angular velocities ex-pressed in the local (body-attached) reference frames; inother words, the method described will use the vector ofgeneralized velocitiesv =

[rT

1 , ωT1 , . . . , r

Tnb, ωT

nb

]T∈ R

6nb.Note that the generalized velocity can be easily obtainedasq= L(q)v, whereL is a linear mapping that transformseachωi into the corresponding quaternion derivativeεi

by means of the linear algebra formulaεi =12GT(q)ωi ,

with 3x4 matrix G(q) as defined in [18]. We denoteby fA (t,q,v) the set of applied, or external, generalizedforces.

2.1. Bilateral constraints Bilateral constraints repre-sent kinematic pairs, for example spherical, prismatic orrevolute joints, and can be expressed as algebraic equa-tions constraining the relative position of two bodies. As-suming a setB of constraints is present in the system, theylead to the scalar equationsΨi(q, t) = 0, i ∈ B . Assum-ing smoothness of constraint manifold,Ψi(q, t) can bedifferentiated to obtain the Jacobian∇qΨi = [∂Ψi/∂q]T .


The notation∇ΨTi = ∇qΨT

i L(q) will be used in what fol-lows.

2.2. Contacts with friction Given a large numberof rigid bodies with different shapes, modern collision-detection algorithms are able to find efficiently a set ofcontact points, that is, points where agap functionΦ(q)can be defined for each pair of near-enough shape fea-tures. Where defined, such a gap function must satisfythe nonpenetration conditionΦ(q) ≥ 0 for all contactpoints.

When a contacti is active, that is,Φi(q) = 0, a normalforce and a tangential friction force act on each of the twobodies at the contact point. In terms of notation,A will de-note the set of all active contacts for a given configurationq of the system at timetl . In fact,A (q,ε) includes evenpotential contacts between bodies that are attl within adistanceε of each other and might collide during the timestep fromtl to tl+1. If no collision occurs, the algorithmwill lead to zero normal/tangential forces for inactive col-lisions that were conservatively added toA (q,ε).

We use the classical Coulomb friction model to definethese forces [12]. If the contact is not active, that is,Φi(q) > 0, no contact or friction forces exist. This im-plies that the mathematical description of the model leadsto a complementarity problem [11]. Consider two bodiesA andB in contact, as shown in Fig. 3. Letni be the nor-mal at the contact pointing toward the exterior of the bodyof lower index, which by convention is considered to bebodyA. Let ui andwi be two vectors in the contact planesuch thatni ,ui ,wi ∈R

3 are mutually orthonormal vectors.The frictional contact force is impressed on the system bymeans of multipliersγi,n ≥ 0, γi,u, andγi,w, which lead tothe normal component of the forceFi,N = γi,nni and thetangential component of the forceFi,T = γi,uui + γi,wwi .The Coulomb model is expressed by using the maximumdissipation principle:

(γi,u, γi,w) = argmin√γ2i,u+γ2

i,w≤µi γi,n

vTi,T (γi,uui + γi,wwi) . (1)

2.3. The complete model The time evolution of thedynamical system is governed by the following differen-

tial variational inequality [16]:

q = L(q)vMv = f (t,q,v)+ ∑

i∈Bγi,b∇Ψi+

+ ∑i∈A

(γi,nDi,n+ γi,uDi,u+ γi,wDi,w)

i ∈ B : Ψi(q, t) = 0i ∈ A : γi,n ≥ 0 ⊥ Φi(q)≥ 0, and

(γi,u, γi,w) = argminµi γi,n≥

√γ2i,u+γ2

i,w

vT (γi,uDi,u+ γi,wDi,w) .

(2)

The tangent space generatorsDi = [Di,n, Di,u, Di,w] ∈R

6nb×3 are sparse and are defined given a pair of contact-ing bodiesA andB as

DTi =

[0 . . . −ATi,p AT

i,pAAsi,A 0 . . .

0 . . . ATi,p −AT

i,pABsi,B 0 . . .] ,(3)

where, using the notation in Fig. 3,AA is the orientationmatrix associated with bodyA, A i,p = [ni ,ui ,wi ] is theR

3×3 matrix of the local coordinates of theith contact, thevectorssi,A andsi,B are the contact point positions in bodycoordinates. A tildex over a vectorx ∈ R

3 represents theskew symmetric matrix associated with the outer productof two vectors [18].

3. The time-stepping scheme Given a positionq(l)

and velocityv(l) at the time stept(l), the numerical so-lution is found at the new time stept(l+1) = t(l) + h bysolving the following optimization problem with equilib-rium constraints [19]:

M(v(l+1) −v(l)) = hf(t(l),q(l),v(l))+ ∑i∈B

γi,b∇Ψi +

+∑i∈A(γi,nDi,n+ γi,uDi,u+ γi,w Di,w

), (4)

i ∈ B : 1hΨi(q(l), t)+∇ΨT

i v(l+1)+ ∂Ψi∂t = 0 (5)

i ∈ A : 0≤ 1hΦi(q(l))+ DT

i,nv(l+1) ⊥ γin ≥ 0, (6)

(γi,u,γi,w

)= argmin

µi γi,n≥√

γ2i,u+γ2

i,w

v(l+1),T(γi,u Di,u+ γi,wDi,w

)(7)

q(l+1) = q(l)+hL(q(l))v(l+1). (8)

Here, γs represents the constraint impulse of a contactconstraint; that is,γs = hγs, for s= n,u,w. The 1

hΦi(q(l))term achieves constraint stabilization; its effect is dis-cussed in [20]. Similarly, the term1

hΦi(q(l)) achievesstabilization for bilateral constraints. The scheme con-verges to the solution of a measure differential inclusion[17] when the step sizeh→ 0.

The proposed approach casts the problem as a monotoneoptimization problem through a relaxation over the com-plementarity constraints, replacing Eq. (6) with

i ∈ A : 0≤ 1hΦi(q(l))+ DT

i,nv(l+1)−

−µi√

(vT Di,u)2+(vT Di,w)2 ⊥ γin ≥ 0.


Body

A Body

B

0 x

y

z

ni

ui

wi

rB

rA

si,A

i-th contact

si,B

x

y

z

x y

z

Figure 3: Contacti between two bodiesA,B∈ {1,2, . . . ,nb}

The solution of the modified time-stepping scheme willapproach the solution of the same measure differentialinclusion forh → 0 as the original scheme [17], yet, insome situations, for largeh, µ, or relative velocityv(l+1);i.e., when not in an asymptotic regime, this relaxationcan introduce motion oscillations. It was shown in [16]that the modified scheme is a cone complementarity prob-lem (CCP), which can be solved efficiently by an itera-tive numerical method that relies on projected contractivemaps. Omitting for brevity some of the details discussedin [16, 21], we note that the algorithm makes use of thefollowing vectors:

k ≡ Mv (l)+hf(t(l),q(l),v(l)) (9)

bi ≡{

1hΦi(q(l)),0,0

}Ti ∈ A , (10)

bi ≡ 1hΨi(q(l), t)+ ∂Ψi

∂t , i ∈ B . (11)

The solution, in terms of dual variables of the CCP (themultipliers), is obtained by iterating the following con-traction maps until convergence [19]:

∀i∈A : γr+1i = Πϒi

[γri −ωηi

(DT

i vr +bi)]

(12)

∀i∈B : γr+1i = Πϒi

[γri −ωηi

(∇ΨT

i vr +bi)]. (13)

At each iterationr, before repeating (12) and (13), alsothe primal variables (the velocities) are updated as

vr+1 = M−1

(

∑z∈A

Dzγr+1z + ∑

z∈B

∇Ψzγr+1z + k

). (14)

Note that the superscript(l +1) was omitted. Interestedreaders are referred to [16] for a proof of the convergenceof this method.

4. Algorithms, Implementations, and EvaluationsA detailed analysis of the computational bottlenecks inthe proposed multibody dynamics analysis method revealsthat the CCP solution and the prerequisite collision detec-tion represent, in this order, the most compute-intensivetasks of the numerical solution at each integration (sim-ulation) time step. This section concentrates on two ap-proaches that expose a level of fine-grained parallelismthat allows an efficient implementation of these two taskson the GPU.

4.1. Parallel multibody dynamics GPU solver

Buffers for data structures The data structures on theGPU are implemented as large arrays (buffers) to matchthe execution model associated with NVIDIA’s CUDA.Four main buffers are used: the contacts buffer, the con-straints buffer, the reduction buffer, and the bodies buffer.The data structure for the contacts has been mapped intocolumns of four floats, as shown in Fig. 4. Each con-tact will reference its two touching bodies through the twopointersBA andBB, in the fourth and seventh rows of thecontact data structure. There is no need to store the entireDi matrix for the ith contact because it has zero entries


for most of its part, except for the two 12x3 blocks corre-sponding to the coordinates of the two bodies in contact.In fact, once the velocities of the two bodiesrAi , ωAi andrBi , ωBi have been fetched, the productDT

i vr in Eq. (12)can be performed as

DTi vr = DT

i,vArAi +DT

i,ωAωAi +DT

i,vBrBi +DT

i,ωBωBi (15)

with the adoption of the following 3x3 matrices:

DTi,vA

= −ATi,p, DT

i,ωA= AT

i,pAA˜si,A

DTi,vB

= ATi,p, DT

i,ωB= −AT

i,pAB˜si,B.(16)

SinceDTi,vA

= −DTi,vB

, there is no need to store both ma-trices. Therefore, in each contact data structure only amatrix DT

i,vABis stored, which is then used with opposite

signs for each of the two bodies.

The velocity update vector∆vi , needed for the sum inEq. (14) also is sparse: it can be decomposed into smallsubvectors. Specifically, given the masses and the iner-tia tensors of the two bodiesmAi , mBi and JAi , JBi , theterm∆vi will be computed and stored in four parts as fol-lows:

∆rAi = m−1Ai

Di,vA∆γr+1i , ∆ωAi = J−1

AiDi,ωA∆γr+1

i

∆rBi = m−1Bi

Di,vB∆γr+1i , ∆ωBi = J−1

BiDi,ωB∆γr+1

i .(17)

Note that those four parts of the∆vi terms are not storedin the ith contact data structure or in the data structure ofthe two referenced bodies (because multiple contacts mayrefer the same body, they would overwrite the same mem-ory position). These velocity updates are instead stored inthe reduction buffer, which will be used to efficiently per-form the summation in Eq. (14). This will be discussedshortly.

The constraints buffer, shown in Fig. 5, is based ona similar concept. Jacobians∇Ψi of all scalar con-straints are stored in a sparse format, each correspond-ing to four rows∇Ψi,vA, ∇Ψi,ωA, ∇Ψi,vB, ∇Ψi,ωB. There-fore the product∇ΨT

i vr in Eq. (13) can be performedas the scalar value∇ΨT

i vr = ∇ΨTi,vA

rAi + ∇ΨTi,ωA

ωAi +

∇ΨTi,vB

rBi +∇ΨTi,ωB

ωBi . Also, the four parts of the sparsevector∆vi can be computed and stored as

∆rAi = m−1Ai

∇Ψi,vA∆γr+1i , ∆ωAi = J−1

Ai∇Ψi,ωA∆γr+1

i

∆rBi = m−1Bi

∇Ψi,vB∆γr+1i , ∆ωBi = J−1

Bi∇Ψi,ωB∆γr+1

i .(18)

Figure 6 shows that each body is represented by a datastructure containing the state (velocity and position), themass moments of inertia and mass values, and the exter-nal applied forceF j and torqueC j . Note that to speedthe iteration, it is advantageous to store the inverse of the

mass and inertias rather than their original values, becausethe operationM−1Di∆γr+1

i must be performed multipletimes.

The parallel algorithm A parallelization of computa-tions in Eq. (12) and Eq. (13) is easily implemented, bysimply assigning one contact per thread (and, similarly,one constraint per thread). In fact the results of these com-putations would not overlap in memory, and two parallelthreads will never need to write in the same memory lo-cation at the same time. These are the two most numer-ically intensive steps of the CCP solver, called theCCPcontact iteration kerneland theCCP constraint iterationkernel.

However, the sums in Eq. (14) cannot be performed withembarrassingly-parallel implementations: it may happenthat two or more contacts need to add their velocity up-dates to the same rigid body. A possible approach toovercome this problem is presented in [22], for a sim-ilar problem. We adopted an alternative method, withhigher generality, based on theparallel segmented scanalgorithm [23] that operates on an intermediate reductionbuffer (Fig. 7); this method sums the values in the bufferusing a binary-tree approach that keeps the computationalload well balanced among the many thread processors. Inthe example of Fig. 7, the first constraint refers to bodies0 and 1, the second to bodies 0 and 2; multiple updatesto body 0 are then accumulated with parallel segmentedreduction.

The following pseudocode shows the sequence of maincomputational phases at each time step, for the most partexecuted as parallel kernels on the GPU.

Algorithm 1: Time Stepping Using GPU

1. (GPU, see section 4.2.) Perform collision detec-tion between bodies, obtainingnA possible contactpoints within a distanceδ, as contact positionssi,A,si,B on the two touching surfaces, and normalsni .

2. (GPU, body-parallel) Force kernel. For each body,compute forcesf(t(l),q(l),v(l)), if any (for example,gravity). Store these forces and torques intoFj andCj .

3. (GPU, contact-parallel) Contact preprocessingkernel. For each contact, given contact normal andposition, compute in place the matricesDT

i,vA, DT

i,ωA,

andDTi,ωB

. Then computeηi and the contact resid-

ualbi = {1hΦi(q),0,0}T .

4. (GPU, body-parallel) CCP force kernel. For each

body j, initialize body velocities:r (l+1)j = h m−1

j F j


Thread

Thread block … Thread block

Thread grid

GPU contacts buffer i-th contact data

bi,n

bi,u

bi,v

Bi,A

Bi,B

ηi

γγγγi,n γγγγ i,u γγγγ i,v µi

DT i,vA,B

DT i,ωA

1

2

3

4

5

6

7

8

9

10

float4

DT i,ωB

Ri,A Ri,B ni,A ni,B 11

Figure 4: Grid of data structures for frictional contacts, in GPUmemory.

Thread Thread block … Thread block

Thread grid

GPU constraints buffer

j-th constraint data

i,vA ∇ΨΨΨΨT

i,vB ∇ΨΨΨΨT

Bi,A

Bi,B

i,ωA ∇ΨΨΨΨT

i,ωB ∇ΨΨΨΨT

ηηηηi γγγγi bi

1

2

3

4

5

float4

6 Ri,A Ri,B ni,A ni,B

Figure 5: Grid of data structures for scalar constraints, inGPUmemory.

Thread Thread block … Thread block

Thread grid

GPU bodies buffer

j-th body data

1

2

3

4

5

6

7

float4

vj,x vj,y

vj,z

ω j,x ω j,y ω j,z x j,x x j,y x j,z ρj,0 ρj,1 ρj,2 ρj,3

Jj,x Jj,y

Jj,x

mj

Fj,x Fj,y Fj,z Cj,x Cj,y Cj,z

-1 -1 -1 -1

Rj

Figure 6: Grid of data structures for rigid bodies, in GPU memory.

GPU reduction buffer

∆∆∆∆ωωωω 0

∆∆∆∆v


∆∆∆∆v


∆∆∆∆v


∆∆∆∆v

float4

vj,x vj,y

vj,z


Jj,x Jj,y

Jj,x

mj


-1 -1 -1 -1

0

vj,x vj,y

vj,z


Jj,x Jj,y

Jj,x

mj


-1 -1 -1 -1

2

vj,x vj,y

vj,z


Jj,x Jj,y

Jj,x

mj


-1 -1 -1 -1

3

Body 0

Body 1

Body 2

Constraint

Ker

nel:

C

CP

impu

lse

Ker

nel:

re

duct

ion

Ker

nel:

sp

ee

d up

date

0

1

2

3

. . .

i,vA ∇ΨΨΨΨT

i,vB ∇ΨΨΨΨT

0

1

i,ωA ∇ΨΨΨΨT

i,ωB ∇ΨΨΨΨT


0 2 0 0

Constraint

i,vA ∇ΨΨΨΨT

i,vB ∇ΨΨΨΨT

0

2

i,ωA ∇ΨΨΨΨT

i,ωB ∇ΨΨΨΨT


1 3 1 0

. . .

Figure 7: The reduction buffer avoids race conditions in parallelupdates of the same body state.

andω(l+1)j = h J−1

j C j .

5. (GPU, contact-parallel) CCP contact iterationkernel. For each contacti, doγr+1i = λ Πϒi

(γri −ωηi

(DT

i vr +bi))

+ (1− λ)γri .

Note that DTi vr is evaluated with sparse data, us-

ing Eq. (15). Store∆γr+1i = γr+1

i − γri in the contact

buffer. Compute sparse updates to the velocities ofthe two connected bodiesA andB, and store themin theRi,A andRi,B slots of the reduction buffer.

6. (GPU, constraint-parallel) CCP constraint itera-tion kernel. For each constrainti, doγr+1i = λ

(γri −ωηi

(∇ΨT

i vr +bi))

+ (1 − λ)γri .

Store∆γr+1i = γr+1

i − γri in the contact buffer. Com-

pute sparse updates to the velocities of the two con-nected bodiesA andB, and store them in theRi,A

andRi,B slots of the reduction buffer.

7. (GPU, reduction-slot-parallel) Segmented reduc-tion kernel. Sum all the∆r i , ∆ωi terms belongingto the same body, in the reduction buffer.

8. (GPU, body-parallel) Body velocity updates ker-nel. For each j body, add the cumulative veloc-

ity updates that can be fetched from the reductionbuffer, using the indexRj .

9. Repeat from step 5 until convergence or until num-ber of CCP steps reachedr > rmax.

10. (GPU, body-parallel) Time integration kernel.For each j body, perform time integration as

q(l+1)j = q(l)

j +hL(q(l)j )v(l+1)

j .

11. (Host, serial) If needed, copy body, contact, andconstraint data structures from the GPU to hostmemory.

4.2. Parallel collision detection algorithm The 3Dcollision detection algorithm implemented performs atwo-level spatial subdivision using axis aligned boundingboxes. The first partitioning occurs at the CPU level andyields a relatively small number of largeboxes. The sec-ond partitioning of each of these boxes occurs at the GPUlevel leading to a large number of smallbins. The GPU3D collision detection, which handles spheres, ellipsoids,


and planes, occurs in parallel at the bin level. Any othergeometries are represented as a collection of these prim-itives using a padding (decomposition) process presentedin detail in [2]. Several kernel calls build on each otherto eventually enable, in a one-thread-per-bin GPU paral-lel fashion, an exhaustive collision detection process inwhich threadi checks for collisions between all the bodiesthat happen to intersect the associated bini. This requiresO (b2

i ) computational effort, wherebi represents the num-ber of bodies touching bini. The value ofbi is controlledby an appropriate selection of the bin size. Figure 8 il-lustrates a typical collision detection scenario and is usedin what follows to outline the nine stages of the proposedapproach.

Stage 1.The process begins by identifying all object-to-bin intersections. As Figure 9 shows, an object (body)can intersect, or touch, more than one bin; there is nolimit to how many such intersections take place. The min-imum and maximum bounding points of each object aredetermined and placed in their respective bins. For ex-ample, Fig. 9 shows that object 4’s minimum point lies inB4 and its maximum point in A5. The entire object mustfit between the minimum and maximum points; thereforethe number of bins that the object intersects can be deter-mined quickly by counting the number of bins betweenthe two points in each axis and multiplying them. In thiscase the number is 4. This number of bins touched byeach body is saved into an array,T (see Fig. 10), of sizeequal to the number of objectsN.

Figure 8: Two-dimensional example used to introduce the nine stages ofthe collision detection process. The grid is aligned to a global Cartesianreference frame.

Figure 9: Minimum and maximum bounds of object, based on spatialsubdivision in Fig. 8.

Stage 2. Next, we perform an inclusive parallel prefixsum onT. The CUDA-based Thrust library implementa-tion [24] of the scan algorithm operates onT to return inS (see Fig. 11) the memory offset information.

Stage 3. An array B (see Fig. 12), is first allocated ofsize equal to the value of the last element inS. This valueis equal to the total number of object-bin intersections.Each element inB is set to a key-value pair of two un-signed integers. The key is the bin id and the value is theobject id. As Fig. 18 shows, objects not fully containedwithin the outer edge of the grid are restricted so that theirmaximum bound cannot be greater than the bounds of theuniform grid. The per-body parallel process used to deter-mine the object-bin; i.e., value-key, pairs is essentiallythesame as in Stage 1 with the caveat that this information isnow saved inB rather than just being counted inT. In thisstage, the memory offsets contained inS are used so thatthe thread associated with each body can write data to thecorrect location inB.

Stage 4.The key-value arrayB is sorted in this stage bykey, that is, by bin id. This effectively inverts the body-to-bin mapping to a bin-to-body mapping by grouping to-gether all bodies in a given bin for further processing. Thestage draws on the GPU-based radix sort from the Thrustlibrary [24].

Stage 5.Next, we identify in parallel the start of each binin the sorted arrayB by using the pseudocode in Fig. 16.The number of threads used to this end is equal to thenumber of elements inB; i.e., the number of object-bininteractions. Each thread reads the current and previousbin value; if these values differ, then the start of a bin hasbeen detected. The first thread reads only the first ele-ment and records it as the initial value. The starting posi-


Figure 10: ArrayT with N entries, based on spatial subdivision in Fig. 8.

Figure 11: Result of prefix sum operation onT, based on spatial sub-division in Fig. 8. Each entry represents an object’s offsetbased on thenumber of bins it touches.

Figure 12: ArrayB, based on spatial subdivision in Fig. 8.

tions for each bin are written into an arrayC of key-valuepairs of size equal to the number of bins in the 3D grid.When the start of a bin is found in arrayB, the thread andbin id are saved as the key and value, respectively. Thispair is written to the element inC indexed by the bin id.Note that not all bins are active. Inactive bins (i.e., binstouched by zero or one bodies), are set to 0xffffffff, thelargest possible value for an unsigned integer on a 32-bit,X86 architecture. Figure 14 shows the outcome of thisstage.

Figure 14: ArrayC, based on spatial subdivision in Fig. 8.

Figure 15: Sorted arrayC, based on spatial subdivision in Fig. 8.

Figure 16: Pseudocode: Bin starting index computation.

Stage 6. The arrayC is next radix-sorted [24] by key.Consequently, inactive bins (identified by the 0xffffffff en-tries, represented for brevity as 0xfff in Fig. 15) “migrate”to the end of the array.

Stage 7. The total number of active bins is determinednext by finding the index in the sorted arrayC of the

first occurrence of 0xffffffff. Determining this index al-lows memory and thread usage to be allocated accuratelythus having no threads wasted on inactive bins. OneGPU thread is assigned in this stage to each active bin toperform an exhaustive, brute-force, bin-parallel collisiondetection for the purpose ofonly countingthe collisionevents. By carefully selecting the bin size, the number ofobjects being tested for collisions is expected to be small;i.e., on average, in the range of 3 to 4 objects per bin. Af-ter counting the total number of collisions in its bin, thethread writes that tally into an unsigned integer arrayD ofsize equal to the number of active bins.

More involved, the algorithm for counting and subse-quently computing ellipsoid collision information is de-scribed in detail in [25]. For spheres, the algorithm checksfor collisions by calculating the distance between the ob-jects. Contacts can occur only when the distance betweenthe spheres’ centers is less than or equal to the sum of theirradii. Because one object could be contained within morethan one bin, checks were implemented to prevent doublecounting. Since the midpoint of a collision volume can becontained only within one bin, only one thread (associatedwith that bin) will register/count a collision event. For ex-ample, in order to determine the midpoint of the collisionvolume we use the vector from centroid of object 4 to thecentroid of object 7; see Fig. 17. The points where thisvector intersects each object defines a segment; the loca-tion of the middle of this segment is used to decide theunique bin that claims ownership of the contact. If oneobject is completely inside the other, the midpoint of thecollision volume is the centroid of the smaller object. Us-ing this process, the number of collisions are counted foreach bin and written toD. This stage is outlined in thepseudocode in Fig. 19.


Figure 13: Sorted arrayB, based on spatial subdivision in Fig. 8.

Figure 17: Center of collision volume. Based on spatial subdivi-sion in Fig. 8.

Figure 18: Max bound is constrained to bin A5.

Figure 19: Pseudocode: Determine number of collisions.

Figure 20: Pseudocode: Computing collision data.


Stage 8.We perform next an inclusive parallel prefix scanoperation [24] onD. This returns an arrayE whose lastelement is the total number of collisions in the uniformgrid, a value that allows an exact amount of memory to beallocated in the next stage.

Stage 9. The final stage of the collision detection algo-rithm computes the actual contact information. To thisend, an array of contact information structuresF is allo-cated with a size equal to the value of the last element inE.The collision pairs are then found by using the algorithmoutlined in Stage 7. Instead of simply counting the num-ber of collisions, actual contact information iscomputedand writtento its respective place inF; see pseudocode inFig. 20.

5. Final Evaluation The GPU iterative solver and theGPU collision detection outlined herein have been embed-ded in our C++ simulation software Chrono::Engine. Wetested the GPU-based parallel method with benchmarkproblems and compared it with the serial implementationin terms of efficiency.

For the results in Table 1, we simulated densely packedspheres that flow from a silo. The CPU was an Intel Xeon2.66 GHz; the GPU was an NVIDIA Tesla C1060. Thesimulation time increases linearly with the number of bod-ies in the model. The GPU algorithm is at least one orderof magnitude faster than the serial algorithm.

The test of Fig. 21 simulates 1 million rigid bodies inside atank being shaken horizontally. This represents to date thelargest multibody dynamics problem solved on one GPUcard. The track system shown in Fig. 2 was exercised ongranular terrain that was made up of more than 480,000bodies.

6. Validation against and comparison with state-of-the-art sequential collision detection A first set of ex-periments was carried out to validate the implementationof the algorithm using various collections of spheres thatdisplay a wide spectrum of collision scenarios: disjointspheres, spheres fully containing other spheres, spheresbarely touching each other, and spheres that are in con-tact but not full containment. The first column of Table 2reports the number of objects for five scenarios. For eachscenario the error between the reference algorithm and theimplemented algorithm is reported for the total numberof contacts identified, the average error and standard de-viation of the contact distance, contact unit normal, andpoint of contact. The reference algorithm used for vali-dation was the sequential (nonparallel) collision detectionimplementation available in the open source, state-of-the-art Bullet Physics Engine [26].

These results demonstrate that the error in the proposedalgorithm, when compared to the CPU implementation, isof the order of single precision round-off error. This istraced back to the fact that the CPU-based algorithm per-forms computations in double precision, while the GPUalgorithm uses single precision arithmetic. For all scenar-ios the number of contacts was the same in both the CPUand GPU analyses.

A second set of numerical experiments was carried outto gauge the efficiency of the parallel CD algorithm de-veloped. The reference used was the same sequential CDimplementation from Bullet Physics Engine. The CPUused in this experiment (relevant for the Bullet implemen-tation) was AMD Phenom II Black X4 940, a quad core3.0 GHz processor that drew on 16 GB of RAM. The GPUused was NVIDIA’s Tesla C1060. The operating systemused was the 64bit version of Windows 7. Two scenar-ios were considered. The first scenario determined howmany contacts a single GPU could determine before run-ning out of memory. As Fig. 23 shows, approximately22 million contacts were determined in less than 4 sec-onds. The second scenario gauged the relative speedupgained with respect to a serial implementation. The firsttest stopped when dealing with about 6 million contacts(see horizontal axis of Fig. 24), when Bullet ran into mem-ory management issues. While providing a sufficient levelof accuracy, the single precision GPU algorithm, tailoredto deal with sphere-to-sphere contact only, led to a rela-tive speedup of up to 180. We want to emphasize that the180 factor does not reflect a GPU vs. CPU issue. TheBullet engine was the solution used prior to using the pro-posed method. It was a much more versatile collision de-tection engine, which in retrospect was unnecessarily usedin double precision.

Figure 22: Collision time as the number of contacts increases.

6.1. Scaling analysis Our second set of experimentswas designed to illustrate the scaling of the parallel nu-merical solution and collision detection. The vehicle usedto this end was the simulation of a cylindrical tank that hada constant height with the radius varying with the number


Number of BodiesCPU GPU

Speedup CCP Speedup CDCCP CCP[s] [s]

16,000 7.11 0.57 12.59 4.6732,000 16.01 1.00 16.07 6.1464,000 34.60 1.97 17.58 10.35128,000 76.82 4.55 16.90 21.71

Table 1: Benchmark test of the GPU CCP solver and GPU collisiondetection.

Figure 21: Light ball floating on 1 million rigid bodies moving around in a tank while interacting through friction and contact.

Table 2: Errors computed by taking the Euclidean norm of the difference between the collision data from Bullet and the collision detection algorithmdiscussed. AE stands for Average Error. SD stands for Standard Deviation

SpheresContacts

Contact Dist. Contact Normal Contact PointError [m] Error [m] Error [m]

[×106]AE SD AE SD AE SD

[×10−7] [×10−4] [×10−10] [×10−7] [×10−6] [×10−3]

1 462,108 1.46 2.48 0.82 2.21 2.73 2.982 1,015,556 0.74 2.91 1.91 2.15 2.37 3.353 1,379,397 1.69 3.52 2.75 2.26 3.58 4.094 1,530,309 5.49 4.14 2.33 2.24 1.94 4.785 1,995,548 6.35 4.38 1.09 2.23 3.10 5.09


REFERENCES REFERENCES

of spheres added to the tank. Specifically, the number ofspheres in the tank was increased with each simulationwithout increasing the fill-in depth of the tank. Instead,the radius of the cylinder was increased for each simula-tion based on the number of spheres and their packing fac-tor. Each test was run using an NVIDIA Tesla C1060 untilthe number of collisions and the compute time per solu-tion time step reached steady state. The results presentedin Table 3 and graphed in Fig. 22 indicate that the over-all algorithm scales linearly. Furthermore, the results sug-gest that the bulk of the computation at each time step wastaken by the GPU dynamics solver, with a small amount oftime taken up by the collision detection. These collisiondetection times are longer than the raw times presentedearlier due to the pre- and post-processing required by thephysics engine as it organizes data on the GPU for usebetween the solver and collision detection.

Figure 23: Collision time vs. contacts detected. This graph shows thatwhen the algorithm is executed on a single GPU it scales linearly.

Figure 24: Overall speedup when comparing the CPU algorithm to theGPU algorithm. The maximum speedup achieved was approximately180 times.

7. Future Directions The GPU dynamics engine pro-posed is more than one order of magnitude faster thana previously developed sequential implementation. Thelargest GPU simulation run to date had approximately 1.1million bodies. Two barriers prevented the simulation oflarger systems. First, we exhausted the GPU memory;second, we noticed a convergence stalling in the Gauss-Jacobi algorithm for CCP problems with more than 15million variables. In order to address these aspects we aredeveloping a distributed computing framework that lever-ages multiple GPUs, and we are investigating a minimalresidual type Krylov method for the CCP solution. Forthe latter, GPU sparse preconditioning remains an openquestion.

Acknowledgments We would like to thank Richard Tongefor the substantial feedback and assistance he provided in gen-erating this manuscript. Financial support for D. Negrut wasprovided in part by the National Science Foundation AwardsCMMI–0700191 and CMMI–0840442. Financial support forA.Tasora was provided in part by the Italian Ministry of Edu-cation under the PRIN grant 2007Z7K4ZB. Mihai Anitescu wassupported by the Office of Advanced Scientific Computing Re-search, Office of Science, U.S. Department of Energy, underContract DE-AC02-06CH11357. We thank NVIDIA and Mi-crosoft for sponsoring our research programs in the area of high-performance computing.

References

[1] Tasora A. Chrono::Engine, An OpenSource Physics–Based Dynamics Sim-ulation Engine. Available online atwww.deltaknowledge.com/chronoengine, 2006.

[2] T. Heyn. Simulation of Tracked Vehicles onGranular Terrain Leveraging GPU Comput-ing. M.S. thesis, Department of MechanicalEngineering, University of Wisconsin–Madison,http://sbel.wisc.edu/documents/TobyHeynThesisfinal.pdf,2009.

[3] J. Madsen, N. Pechdimaljian, and D. Negrut. Penaltyversus complementarity-based frictional contact ofrigid bodies: A CPU time comparison. TechnicalReport TR-2007-06, Simulation-Based EngineeringLab, University of Wisconsin, Madison, 2007.

[4] PhysX. NVIDIA PhysX for De-velopers. Available online athttp://developer.nvidia.com/object/physx.html,2010.

[5] Peng Song, Jong-Shi Pang, and Vijay Kumar.A semi-implicit time-stepping model for frictionalcompliant contact problems.International Journalof Numerical Methods in Engineering, 60(13):267–279, 2004.

[6] Jean J. Moreau. Standard inelastic shocks andthe dynamics of unilateral constraints. In G. DelPiero and F. Macieri, editors,Unilateral Prob-lems in Structural Analysis, pages 173–221, NewYork, 1983. CISM Courses and Lectures no. 288,Springer–Verlag.

[7] P. Lotstedt. Mechanical systems of rigid bodies sub-ject to unilateral constraints.SIAM Journal of Ap-plied Mathematics, 42(2):281–296, 1982.


REFERENCES REFERENCES

Table 3: Total time taken per time step at steady state and the number of contacts associated with it.

Objects Total Time GPU CollisionGPU Solver Contacts

[×106] [sec] Detection [sec]0.2 12.1190 1.0758 10.5881 718,3770.4 23.2806 1.9746 20.4606 1,403,7840.6 35.0433 2.9785 30.7971 2,124,6390.8 46.9516 4.0234 41.2297 2,838,8321.0 58.1518 4.9473 51.1686 3,548,594

[8] M. D. P. Monteiro Marques. Differential Inclu-sions in Nonsmooth Mechanical Problems: Shocksand Dry Friction, volume 9 ofProgress in Nonlin-ear Differential Equations and Their Applications.Birkhauser Verlag, Basel, 1993.

[9] David Baraff. Issues in computing contact forces fornon-penetrating rigid bodies.Algorithmica, 10:292–352, 1993.

[10] Jong-Shi Pang and Jeffrey C. Trinkle. Comple-mentarity formulations and existence of solutionsof dynamic multi-rigid-body contact problems withCoulomb friction. Mathematical Programming,73(2):199–226, 1996.

[11] David E. Stewart and Jeffrey C. Trinkle. An implicittime-stepping scheme for rigid-body dynamics withinelastic collisions and Coulomb friction.Interna-tional Journal for Numerical Methods in Engineer-ing, 39:2673–2691, 1996.

[12] Mihai Anitescu and Florian A. Potra. Formulat-ing dynamic multi-rigid-body contact problems withfriction as solvable linear complementarity prob-lems.Nonlinear Dynamics, 14:231–247, 1997.

[13] David E. Stewart. Rigid-body dynamics with fric-tion and impact.SIAM Review, 42(1):3–39, 2000.

[14] Richard W. Cottle and George B. Dantzig. Com-plementary pivot theory of mathematical program-ming. Linear Algebra and Its Applications, 1:103–125, 1968.

[15] David Baraff. Fast contact force computation fornonpenetrating rigid bodies. InComputer Graphics(Proceedings of SIGGRAPH), pages 23–34, 1994.

[16] M. Anitescu and A. Tasora. An iterative approachfor cone complementarity problems for nonsmoothdynamics. Computational Optimization and Appli-cations, 47(2):207–235, 2010.

[17] Mihai Anitescu. Optimization-based simulation ofnonsmooth rigid multibody dynamics.Mathemati-cal Programming, 105(1):113–143, 2006.

[18] E. J. Haug. Computer-Aided Kinematics and Dy-namics of Mechanical Systems Volume-I. Prentice-Hall, Englewood Cliffs, New Jersey, 1989.

[19] A. Tasora. A Fast NCP Solver for Large Rigid-BodyProblems with Contacts. In C.L. Bottasso, editor,Multibody Dynamics: Computational Methods andApplications, pages 45–55. Springer, 2008.

[20] Mihai Anitescu and Gary D. Hart. A constraint-stabilized time-stepping approach for rigid multi-body dynamics with joints, contact and friction.In-ternational Journal for Numerical Methods in Engi-neering, 60(14):2335–2371, 2004.

[21] A. Tasora, D. Negrut, and M. Anitescu. Large-scaleparallel multi-body dynamics with frictional contacton the graphical processing unit.Journal of Multi-body Dynamics, 222(4):315–326, 2008.

[22] T. Harada. Real-time rigid body simulation onGPUs.GPU Gems, 3:611–632, 2007.

[23] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens.Scan primitives for GPU computing. InProceed-ings of the 22nd ACM SIGGRAPH/EUROGRAPH-ICS symposium on Graphics hardware, page 106.Eurographics Association, 2007.

[24] J. Hoberock and N. Bell. Thrust: A Par-allel Template Library. Available online athttp://code.google.com/p/thrust/, 2009.

[25] A. Pazouki, H. Mazhar, and D. Negrut. Paral-lel ellipsoid collision detection with application incontact dynamics-DETC2010-29073. In ShuichiFukuda and John G. Michopoulos, editors,Proceed-ings to the 30th Computers and Information in En-gineering Conference. ASME International DesignEngineering Technical Conferences (IDETC) andComputers and Information in Engineering Confer-ence (CIE), 2010.

[26] Physics Simulation Forum. BulletPhysics Library. Available online athttp://www.bulletphysics.com/Bullet/wordpress/bullet,2008.


Documents

Simulation of Multibody Dynamics Leveraging New Numerical