8
Large-scale powder mixer simulations using massively parallel GPU architectures Charles A. Radeke a , Benjamin J. Glasser b , Johannes G. Khinast a,c, a Research Center Pharmaceutical Engineering GmbH, A-8010 Graz, Austria b Rutgers, The State University of New Jersey, Department of Chemical and Biochemical Engineering, Piscataway, NJ 08854, USA c Institute for Process- and Particle Engineering, Graz University of Technology, A-8010 Graz, Austria article info Article history: Received 17 June 2010 Received in revised form 10 September 2010 Accepted 23 September 2010 Available online 1 October 2010 Keywords: DEM GPU Mixing Parallel processing Pharmaceuticals Powder blending abstract Granular flows are extremely important for the pharmaceutical and chemical industry, as well as for other scientific areas. Thus, the understanding of the impact of particle size and related effects on the mean, as well as on the fluctuating flow field, in granular flows is critical for design and optimization of powder processing operations. We use a specialized simulation tool written in C and CUDA (Compute Unified Device Architecture), a massive parallelization technique which runs on the Graphics Processing Unit (GPU). We focus on both, a new implementation approach using CUDA/GPU, as well as on the flow fields and mixing properties obtained in the million-particle range. We show that using CUDA and GPUs, we are able to simulate granular flows involving several millions of particles significantly faster than using currently available software. Our simulation results are intended as a basis for enhanced DEM simulations, where fluid spraying, wetting and fluid spreading inside the powder bed is considered. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction and motivation 1.1. DEMsome considerations In the last few years discrete element model (DEM) simulations have been increasingly used to study and analyze flows of particulate systems with the main emphasis on granular flows where the interaction between particles and a fluid phase may be neglected. Instead only particle-particle interactions are considered. DEM was introduced by Cundall and Strack (1979). However, due to limited computational resources it was a challenge to run even small-system simulations of granular assemblies of a few hundred disks in two dimensions. Currently, three-dimensional simulations in the range of 10,000 to 200,000 particles are standard and can be achieved on PC hardware, workstations or clusters, allowing for simulation times of only of a few minutes in the case of highly parallelized clusters. Nevertheless, users still have to choose between acceptable simulation times and the number of particles included in a DEM simulation. This is especially critical since in realistic systems the number of particles often surpasses 10 9 , which is way beyond current computational capacities. Another substantial restriction of current methods is the detailed geometrical modeling of the particles. Most DEM codes approximate arbitrary particle shapes by the common sphere- approach. However, the modeling of real particle shapes is critical for realistically simulating complex interaction phenomena in granular assemblies. For example, interlocking of particles cannot be reproduced well by adjusted friction or rolling resistance. Clustered sphere particles can be used to incorporate the interlocking effect but their kinetics are different from real particles and a detailed description of particle–particle interac- tions suffers from the raspberry-like surface. Furthermore, the computational overhead is huge and the number of simulated particles has to be divided by the number of sub-particles, typically in the range of 5–100. Some interesting approaches for true particle shape modeling are listed in Ketterhagen et al. (2009), Lee et al. (2009), Song et al. (2006), and Muth and Eberhard (2005). Large-scale industrial flow simulations, which require specia- lized CPU cluster and heterogeneous CPU/GPU cluster hardware can be found in Cleary (2009) and Chen et al. (2009). Xu et al. (2009) reported hybrid simulations of 60 million particles, where 480 GPUs were utilized in parallel. In contrast, in this study we present simulations of more than 7 million particles, running on a single GPU. This paper is intended to be a conceptual feasibility study to illustrate a new performance level of the Discrete Element Method. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/ces Chemical Engineering Science 0009-2509/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.ces.2010.09.035 Corresponding author. E-mail address: [email protected] (J.G. Khinast). URL: http://ippt.tugraz.at (J.G. Khinast). Chemical Engineering Science 65 (2010) 6435–6442

Large-scale powder mixer simulations using massively parallel GPU architectures

Embed Size (px)

Citation preview

Chemical Engineering Science 65 (2010) 6435–6442

Contents lists available at ScienceDirect

Chemical Engineering Science

0009-25

doi:10.1

� Corr

E-m

URL

journal homepage: www.elsevier.com/locate/ces

Large-scale powder mixer simulations using massively parallelGPU architectures

Charles A. Radeke a, Benjamin J. Glasser b, Johannes G. Khinast a,c,�

a Research Center Pharmaceutical Engineering GmbH, A-8010 Graz, Austriab Rutgers, The State University of New Jersey, Department of Chemical and Biochemical Engineering, Piscataway, NJ 08854, USAc Institute for Process- and Particle Engineering, Graz University of Technology, A-8010 Graz, Austria

a r t i c l e i n f o

Article history:

Received 17 June 2010

Received in revised form

10 September 2010

Accepted 23 September 2010Available online 1 October 2010

Keywords:

DEM

GPU

Mixing

Parallel processing

Pharmaceuticals

Powder blending

09/$ - see front matter & 2010 Elsevier Ltd. A

016/j.ces.2010.09.035

esponding author.

ail address: [email protected] (J.G. Khinast).

: http://ippt.tugraz.at (J.G. Khinast).

a b s t r a c t

Granular flows are extremely important for the pharmaceutical and chemical industry, as well as for

other scientific areas. Thus, the understanding of the impact of particle size and related effects on the

mean, as well as on the fluctuating flow field, in granular flows is critical for design and optimization of

powder processing operations.

We use a specialized simulation tool written in C and CUDA (Compute Unified Device Architecture),

a massive parallelization technique which runs on the Graphics Processing Unit (GPU). We focus on

both, a new implementation approach using CUDA/GPU, as well as on the flow fields and mixing

properties obtained in the million-particle range.

We show that using CUDA and GPUs, we are able to simulate granular flows involving several

millions of particles significantly faster than using currently available software. Our simulation results

are intended as a basis for enhanced DEM simulations, where fluid spraying, wetting and fluid

spreading inside the powder bed is considered.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction and motivation

1.1. DEM—some considerations

In the last few years discrete element model (DEM) simulationshave been increasingly used to study and analyze flows ofparticulate systems with the main emphasis on granular flowswhere the interaction between particles and a fluid phase may beneglected. Instead only particle-particle interactions are considered.DEM was introduced by Cundall and Strack (1979). However, due tolimited computational resources it was a challenge to run evensmall-system simulations of granular assemblies of a few hundreddisks in two dimensions. Currently, three-dimensional simulationsin the range of 10,000 to 200,000 particles are standard and can beachieved on PC hardware, workstations or clusters, allowing forsimulation times of only of a few minutes in the case of highlyparallelized clusters. Nevertheless, users still have to choosebetween acceptable simulation times and the number of particlesincluded in a DEM simulation. This is especially critical since inrealistic systems the number of particles often surpasses 109, whichis way beyond current computational capacities.

ll rights reserved.

Another substantial restriction of current methods is thedetailed geometrical modeling of the particles. Most DEM codesapproximate arbitrary particle shapes by the common sphere-approach. However, the modeling of real particle shapes is criticalfor realistically simulating complex interaction phenomena ingranular assemblies. For example, interlocking of particles cannotbe reproduced well by adjusted friction or rolling resistance.Clustered sphere particles can be used to incorporate theinterlocking effect but their kinetics are different from realparticles and a detailed description of particle–particle interac-tions suffers from the raspberry-like surface. Furthermore, thecomputational overhead is huge and the number of simulatedparticles has to be divided by the number of sub-particles,typically in the range of 5–100. Some interesting approaches fortrue particle shape modeling are listed in Ketterhagen et al.(2009), Lee et al. (2009), Song et al. (2006), and Muth andEberhard (2005).

Large-scale industrial flow simulations, which require specia-lized CPU cluster and heterogeneous CPU/GPU cluster hardwarecan be found in Cleary (2009) and Chen et al. (2009). Xu et al.(2009) reported hybrid simulations of 60 million particles, where480 GPUs were utilized in parallel. In contrast, in this study wepresent simulations of more than 7 million particles, running on asingle GPU.

This paper is intended to be a conceptual feasibility study toillustrate a new performance level of the Discrete ElementMethod.

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–64426436

1.2. Powders flows in pharmaceutical processing

Once the synthesis of an active pharmaceutical ingredient(API) is completed, crystallization and drying yield a powder thatis mixed with excipients that are used to formulate a drug productin the so-called secondary manufacturing process. Powderprocessing steps dominate during secondary manufacturing andinclude blending, granulation, drying, milling, roller compaction,tableting and other steps. In all these examples, the under-standing of powder flows is critical to design, optimization andscale-up the process with a focus on controlling particle size andshape, yielding acceptable blend quality and reducing segrega-tion, as well as unwanted agglomeration or attrition. Mixing ofgranular materials (powder blending) is of great importance forpharmaceutical manufacturing, and is also an important proces-sing step in such diverse areas such as food technology,biotechnology, mineral processing, nuclear engineering, detergentmanufacturing and in the coal, steel and agrochemical industry.

In the pharmaceutical industry, efficient mixing is of criticalimportance to achieve uniform concentration of the active ingre-dient, eliminating the risk of super-potent tablets or blanks. Currentmanufacturing protocols require mixing of a batch for a predefinedamount of time. Process understanding, i.e., the ability to predictmixedness or the mixing endpoint as function of revolutions or timemight drastically reduce processing time in many operations.

The acceptance of simulation tools depends on their capabi-lities to reflect realistic problems. Usually the grain size ofpowders in the pharmaceutical industry is between 10 and a few100mm. This means that there are billions of particles in only oneliter of powder. Thus, it is important to develop computationaltechnology that is capable of simulating realistic systems. This isthe goal of our current study.

1.3. The GRPD code

In order to be able to simulate real-life granular systems, a newtechnology was implemented and a high-performance DEM codewas developed based on the newly available ‘Compute UnifiedDevice Architecture’ (CUDA) technology NVIDIA (2008a, 2008b)and is called ‘Graz-Rutgers Particle Dynamics’ (GRPD). Codedevelopment started in 2008 and the code is still underdevelopment and will be implemented to serve as a custom-made code.

Currently, GRPD can simulate millions of particles on readilyavailable hardware. The simulations can be directly visualized bya GUI using OpenGL display features. Based on the computationalefficiency of our method the development of large-scale simula-tions is highly convenient. In a first step, only Oð104

Þ particles canbe used. Subsequently, once initial tests indicate successful setupof the problem, large-scale simulations can be started, wherethe maximum number of particles depends on the memory size ofthe equipped graphics card. The simulations presented in thispaper are executed on NVIDIA GT200-based graphics boards.At the present development stage of GRPD it is possible tosimulate about two million particles per Giga Byte of graphicsmemory.

2. Implementation

2.1. Numerical model

In this work a ‘soft’ particle approach is used where sphericalparticles exert forces due to their mass m and geometrical overlapd at the contact points c. The total force F and torque M acting on

a particle

F ¼X

c

f cþmg ð1Þ

M¼X

c

r � f c ð2Þ

is the summation of contact forces f c with other particles orboundaries and the gravitational force. Considering Newtonssecond law, at time t we obtain an acceleration a(t) which can beused for an explicit time step integration with a time step dt inorder to obtain new velocities v(t+dt) and positions x(t+dt) foreach particle in the system. Here, the classical Verlet integrationscheme is employed for positions (Verlet, 1967)

aðtÞ ¼ F=m ð3Þ

xðtþdtÞ ¼ 2xðtÞ�xðt�dtÞþaðtÞdt2ð4Þ

Velocities are obtained by central difference approximation

vðtþdtÞ ¼1

2dtðxðtþdtÞ�xðt�dtÞÞþaðtÞdt ð5Þ

The DEM accounts for translational as well as for rotationaldegrees of freedom. Both are handled separately by the applica-tion of the Verlet algorithm (3)–(5) and an Euler integration (7) inthe latter case. Total torque M and moment of inertia I are used toobtain the angular acceleration _oðtÞ

_oðtÞ ¼ M

Ið6Þ

and the new angular velocity at time t+dt

oðtþdtÞ ¼oðtÞþ _oðtÞdt ð7Þ

In order to obtain the new particle orientation q(t+dt) anintegration of the angular velocity is performed. Here, q(t+dt) isa quaternion representation of particle orientation. Details can befound in Radeke (2006). The implemented force laws are linear innormal and tangential direction and follow the implementationsuggested in Luding (1998):

fn ¼ kndn�Dn_dn�fch ð8Þ

ft ¼�minðktdt�Dt_dt ,mjfnjÞ sgnðdtÞ ð9Þ

where m is the static Coulomb friction coefficient, kn, kt are springconstants and Dn, Dt account for viscous damping in normal andtangential direction, respectively. The cohesive force fch is onlyactive, if cohesive forces are considered, e.g., in the case of liquidbridges. The tangential overlap

dt ¼ dtðtÞþdtðt�dtÞ ð10Þ

allows for restoring forces, if the contact had existed already inthe previous time step. This means that the evolution of thetangential overlap or spring dt can be written also as

dtðtÞ ¼

Z t

t0

_dtðtÞdt ð11Þ

In analogy to sliding friction, a rolling resistance is imple-mented. Collisions between particles and walls are handled inanalogy to the case of particle–particle collisions (Radeke, 2006).At this time, the implemented algorithm does not includefrictional resistance due to tangential in-plane spin about thenormal vector.

Fig. 1. Performance measurement of the GRPD DEM code. All jobs are executed on

single GT200 chip based cards. The largest simulation runs on a single Tesla C1060

device. Because of indirect visualization in this case, the performance drastically

slows down for the largest simulation. Up to one million particles the scaling is

almost linear in time.

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–6442 6437

2.2. Parallelization

Most physical problems can be implemented for serialexecution on a microprocessor, i.e., all steps and loops of theprogram are executed sequentially. If parts of such a problem canbe divided into independent tasks (independent of the order ofexecution), these tasks can be distributed on multiple micro-processors, where they can be executed simultaneously. Thisproperty is called the granularity of a numerical problem. In ourcase, the calculation of the inter-particle forces, as well as thetime-step integration, fall in the category of independent tasks.

Current multi-core microprocessors consist of up to 6 processorcores, each of them being a full CPU, coupled on-chip and connectedto the memory. Each of these cores is designed to speed up theexecution of sequential programs and each of the cores can handletwo execution threads. In contrast, many-core processors, such asGPUs, focus on the throughput of parallel applications. The NVIDIAGT200 series is equipped with 240 cores, each of which is heavilymultithreaded. Many-core processors have been out-performingmulti-core CPUs since 2003. As of 2009, the ratio of GPU and multi-core CPUs peak-calculation throughputs is about 10:1 (1TFLOPversus 100GFLOPS). Programming of NVIDIA GPUs has becomepossible due to the availability of the CUDA software environmentsince 2007, which is a package of drivers, toolkits and a softwaredevelopment kit, where some extensions to the C language enablemassively parallel programming (Kirk and Hwu, 2010).

The GRPD code uses the common cell-space logic, a regularsubdivision of the simulation volume into cubic cells (Allen andTildesley, 1987). This allows for fast detection of the nearestneighbors, because the search for particles is restricted to a volumewithin a cell. While the search time for selecting two entities froma set of N entities is of OðN2Þ, the cell space logic has OðNÞoperations only. This is essential for DEM simulations involving alarge number of particles. The cell-space implementation isflexible, such that particles with moderate size ratio of up to 1:5or arbitrary size distributions within this limit can be simulated. Byusing hierarchical search patterns the cell space is designed tosupport long-range potentials, as well as highly non-sphericalparticles for future extensions.

In contrast to MPI- or OpenMP-based parallelization approacheswhere large subsets of cells are distributed to (CPU) processorcores, in this case the parallelization is implemented on the particlelevel using the CUDA programming technique (William GroppEwing Lusk, 2007; Barbara Chapman and van der Pas, 2007). Thus,all Eqs. (1)–(9) for each single particle are executed simultaneously.One CUDA thread is assigned to exactly one particle.

Graphics Processing Units (GPUs) are combining a fewhundred so called SIMT (Single Instruction Multiple Threads) orstream processors which are executing these threads scheduledby an on-chip firmware. This coexistence of threads is used byCUDA to simulate massively parallel the trajectories of a largenumber of particles, while CUDA hides the complexity of theparallel programming effort of the explicit multi-thread handling.In order to utilize the SIMT processors, the software developermust provide a huge number of independent tasks, which can beassigned to the threads. The extremely high granularity of real-lifeDEM simulations fits this demand in an excellent way. The CUDAarchitecture currently supports about 32,000 active threads—themore threads, the higher the occupancy of the GPU.

Due to the still ongoing development of the GRPD code, noperformance comparisons in terms of speedup against CPU codesare provided. Preliminary measurements vary. As a guideline, thelatest version of our code (March 2010) is able to simulate oneminute real time with 100.000 particles in about four hours wall-clock time. GRPD needs about four days to perform a simulation ofone million particles for the same real time.

Performance data of GRPD for a broad range of particlenumbers are shown in Fig. 1. With increasing problem size, thewall clock time scales linearly up to one million particles. In thisrange, the performance decreases slightly and seems nearlyindependent of N. Up to four million particles, the performancedrops by a factor of 30. When GRPD executes a simulation ofnearly eight million particles the performance drops by a factor of100 compared to one million particles. Because of memorydemands, the largest case was executed on the NVIDIA TESLAcard with 4GB RAM. In this case, indirect visualization isperformed by another graphics card, which drastically slowsdown the performance (at this time GRPD cannot run withoutOpenGL display grahics, see Section 2.4). The smaller problemsran on real graphics cards with less memory but with a directgraphics output.

2.3. Numerical precision

High-performance GPUs originated from the field of computergraphics and gaming and became increasingly powerful to satisfythe demands of 3D games. In these games the performance isexpressed in terms of frame rates and smooth interactivehandling while a game is running. In order to display the result(the rendered frames) quickly, the rendered images are efficientlygenerated by processing the data with single precision (SP)numerics on SP hardware. Using this hardware for numericalsimulations leads to problems in the implementation of thealgorithms, caused by the lower amount of valid digits in thefloating point representation (Goldberg, 1991). Currently,implemented DEM algorithms fail when a time-step of less than10�4 s is used. As an intermediate solution the geometry of oursystem is scaled up internally according to a fixed particle size foreach of the four cases a–d (see Table 2).

Compared to the particle sizes in Table 2, this scaling increasesparticle size, as well as particle mass. As a consequence,circumferential speed and centripetal forces in the blendingprocesses would increase, while gravity remains constant. Inorder to keep the ratio of gravity to radial acceleration equal tothe ratio of the un-scaled problem, the revolution rate is adaptedby using Froude theory according to Eq. (12) (Pohlman et al.,2006). That is, the geometrical scaling accounts for the singleprecision issue and the Froude scaling corrects the revolution rate.

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–64426438

It should be kept in mind that the Verlet algorithm is usedwidely in molecular dynamics software packages which run atsingle precision with comparable accuracy to double precision(DP). There, different length scales and time scales apply (i.e., inthe nanometer and femtosecond scales) which are tackled by unittransformations, dimensionless time scaling or improved nume-rical techniques (Bowers et al., 2006).

In order to validate GRPD against experimental data, wesimulated a segregation process which was also carried out in thelaboratory (data not shown here). A closed horizontal rotatingcylinder, initially filled with left-right separated particles of a sizeof 4 and 6 mm, operates until the larger particles have beensegregated to the plane walls of the cylinder. In the experimentthis process lasts for about 350 revolutions until a stable patternis reached (larger particles can be found at both wall surfaces ofthe cylinder (i.e., the surfaces normal to the rotation axis), whilesmall particles move to the center). Our simulations showexcellent qualitative and quantitative consistency with theexperiments, i.e., after 350 (computational) revolutions a steadystate was reached.

The generations of NVIDIA GPUs utilized during the codedevelopment (i.e., G80, G92, GT200) are still single-precisionhardware. In contrast, the specifications for the recently availableFermi GPU will have DP support, as well as other highperformance computing (HPC) features like memory errorcorrection (ECC) and transparent host-memory-addressing cap-abilities. These GPUs will be used in the future.

Fig. 2. Mixing simulation of case d after one revolution (left) and schematic of the

pitched blade mixer geometry (right).

Table 1Dimensions of the reference device.

Parameter Symbol Value Unit

Blade width BW 25 mm

Blade diameter BD 95 mm

Blade angle BA 45 deg

Hub diameter HD 22 mm

Hub height HH 20 mm

Shaft diameter SD 12 mm

Stator diameter DS 100 mm

2.4. Visualization

DEM simulations in the million-particle range are a challenge.Similarly, visual analysis of a large number of particles during andafter the simulation is highly complex. Due to the dynamic natureof the investigated flow problems, a movie-like three-dimensionalinteractive visualization of the simulation time steps is desired.Some options to perform this task are commonly employed. Oneof them is to use 3D graphics post-processing software like VMDor MolMol, where the user has to translate the simulation resultsto an appropriate file format for further visualization. However,these methods fail when processing time and/or problem size areincreasing beyond standard dimensions. Thus, in the million-particle range the visualization technique becomes a challengeand has to be adapted to fit the requirements of huge data sets.

However, it is beneficial if simulations are executed directly onGPUs. Interoperable visualization and simulation concepts whichare linking CUDA and OpenGL allows for simultaneous access todata residing in the graphics memory. GRPD, as a GPU code,consequently utilizes this features in its visualization algorithms.As an example, the rotational orientation of each particle can bevisualized at run-time. During the whole simulation, at eachsingle time step the particle assembly can be displayed in the GUIand can be inspected visually with respect to physical propertiesof interest (e.g., velocity, fluid content, residence time, etc.)without significant loss of performance (about 10%, only whenenabled). This also helps in the preparation of simulations and forsoftware development, or for testing purposes. The selection ofindividual particle sets or other entities, clip-plane intersections,as well as transparent-property representation can be achievedwithin a second.

The GRPD GUI has been prepared and tested for theseoperations to remain in an interactive level for up to 40 millionparticles. Visualization performance is limited only by the GPUmemory. Fortunately, the use of separate GPUs for visualizationand simulation is possible in CUDA. Running both on the sameGPU results in faster execution speeds, due to trans-routing the

comparably slower PCI-express bus, which is connecting CPUsand GPUs. Nevertheless, CUDA and OpenGL can share memory ofthe visualized information, while double buffer and depth bufferdata are kept separate. Because of these restrictions, very largedata sets in the range of 20 million particles cannot be executedand visualized on the same GPU.

It should be noted that GRPD is theoretically limited to 2.2billion particles, which would require a 6 GB DRAM size of theOpenGL device only to visualize such an ultra-large data set.

3. Application to four-bladed mixer

3.1. Experimental device

The device which was used for our simulations is a bladedmixer as sketched in Fig. 2, which has been used extensively in ourrecent experimental and simulation studies (Koller et al., in press).Bladed mixers, in two-bladed and four-bladed configurations, havealso been extensively investigated both numerically andexperimentally by Chandratilleke et al. (2009), Zhou et al. (2004),Remy et al. (2010), and Radl et al. (2010). In our work, we focus ona four-bladed mixer. As a convective mixer, it consists of astationary vessel and a rotating impeller. These types of mixers(as they can expose the powder to significant shear), are used formaterials that tend to segregate or agglomerate, and hence, cannotbe mixed in tumbling blenders. In addition these systems areextensively used in high-shear wet granulation.

The inner diameter of this lab-size mixer is 2R¼0.1 m, theworking area of the impeller covers a volume of about 0.2 L(see Table 1 for details).

3.2. Simulation setup

The simulations were carried out in order to study theinfluence of particle resolution on the mixing statistic. Foursimulations, referred in the text as case a, b, c and d, ranging from

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–6442 6439

a few thousand up to 7.68 million particles, were carried out asshown in Fig. 3.

The parameters reflecting the four different cases a–d arepresented in Tables 2–4.

Initially after particles are ‘created’ inside the blade area, theparticles are allowed to settle under gravity. When the assemblyrests, the particles are marked as grey or red, according to theirposition in the mixer (left or right of the central vertical plane).Each particle keeps its initial marker during the whole simulation.By taking frequent snapshots of the positions, we are able toinspect how mixing evolves. All simulations were carried out for32 revolutions at a constant rotational speed of 30 rpm.

Because of the single-precision issues described in Section 2.3,internally the geometry was rescaled with respect to the scaleseparation ratio. That is, in all of the four cases a–d, a constantsimulation particle size of r¼0.04 m was chosen, and the mixersize was rescaled accordingly (i.e., to have the same fill-level). Forall simulations we selected a constant Froude number of both, thereference device geometry and the scaled geometry in oursimulation. The resulting revolution speed Osim is obtained by

Osim ¼Oref =ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2Rsim=2Rref Þ

qð12Þ

Here, Rsim and Rref are the radii used in the simulation and fromthe reference device, respectively, Oref ¼ 30 rpm is the referenceimpeller speed of the mixer. The Froude scaling was choosen inorder to keep the ratio of gravitational acceleration and radialacceleration identical, although it was shown by Remy et al.

Fig. 3. Four different cases from coarse to fine. Cases: (a) N¼7680, (b) N¼76,800,

(c) N¼768,000, and (d) N¼7,680,000 particles. View from the top, a short time

after simulation start.

Table 2Four different particle assemblies; referred in the text as cases a, b, c and d.

Number of

particles

Particle size

+ (mm)

a 7680 3.0

b 76,800 2.1

c 786,000 0.98

d 7,680,000 0.45

(2009) that at low speed the rotational rate does not impactmixing. Based on our results and the study of Remy et al. (2010),further work is needed to investigate the effect of rotational speedon mixing kinetics when flow regime transition occurs. For thesimulation of 32 revolutions the simulation time was extendedaccordingly, see Table 2.

In all cases, the simulations were completed in an acceptableamount of time, which is much less than the time needed usingcurrently available algorithms and codes. The largest simulationof nearly 8 million particles was finished after only one monthwall-clock time. Simulations with less particles of the cases a, band c are much faster (less than two days for case c, and hours forcases b and a).

4. Results

4.1. Sampling technique

In order to analyze the mixing kinetics and the mixing qualityof the powders, numerical sampling is performed. In contrast toreal experiments (e.g., thief probes) no disturbance of the particleassembly occurs, regardless of the frequency or location ofsampling. We sampled small cylinders parallel to the impelleraxis at locations which are evenly distributed at concentricalcircles covering the entire powder bed as shown in Fig. 4. Themass of sampled particles was kept constant for each of thedifferent cases as shown in Table 3.

Mixing quantities of the powder blend are characterized aftereach revolution. Both, the Lacey index and the relative standarddeviation (RSD) are calculated from the samples according toEqs. (13)–(15), where mS is the number of samples (Lacey, 1954).The quality of the statistical analysis depends on the number ofparticles in a sample. Pharmaceutical regulations prescribes

Scale separation

ratio

Simulation

time (s)

Froude rotat.

rate (rpm)

33.3 350.00 5.6695

47.6 545.00 3.7500

102.0 805.64 2.5355

222.2 1355.00 1.7039

Table 3Sampling parameters used or obtained from the four different simulation cases.

Case a b c d

Number of samples 14 44 204 889

Particles per sample 14 39 384 3840

Mass per sample (g) 0.459 0.472 0.473 0.458

Mean x 0.4992 0.5058 0.4995 0.4995

Table 4DEM simulation parameters.

Parameter Symbol Value Unit

Density r 2500 kg/m3

Normal stiffness kn 1.05�105 N/m

Tangential stiffness kt 8.25�104 N/m

Restitution coefficient (normal) e 0.78

Static friction (PP) m 0.5

Static friction (PW) m 0.2

Fig. 5. From coarse to fine, a, b, c and d. Top view snapshots of the blends after five

revolutions of mixing.

Fig. 6. From coarse to fine, a, b, c and d. Top view snapshots of the blends after 10

revolutions of mixing.Fig. 4. Schematic of sampling locations for case b. Samples are taken at

equidistant positions on concentrical circles.

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–64426440

sampling between one- to three-times of the final dosage in thefinal product (FDA, 2003).

4.2. Statistical analysis

The blend of powder considered in the simulations may serveas the API/excipient mixture for a capsule filled for example with200 mg of powder, corresponding to a small capsule. Thus, eachsample should contain 0.2–0.6 g of particles. Expressed in

numbers, samples should contain between 6 and 18, 17 and 51,163 and 489 and 1677 and 5031 grains for cases a, b, c and d,respectively. This shows that we cannot expect to have goodstatistical results for cases a and b, where only a small number ofparticles are contained in each sample. However, increasing thesample size would lead to a less accurate reflection of the localvariability and moreover, it would violate pharmaceuticalregulations.

In Figs. 5 and 6 snapshots of the powder bed are shown afterfive and 10 revolutions, respectively, for the four considered cases.What is surprising, is that already after five revolutions goodmixing is obtained, and after 10 revolutions an almosthomogeneous mixture is obtained.

Thus, granular mixing in a four-bladed mixer is very fast whichmay be due to a chaotic mixing process augmented by significantdispersive mixing. As can be seen in Fig. 6 the differences betweencase c and d are marginal. Thus, even studies with fewer particlesmay be sufficient for process design:

ML ¼s2

0�s2

s20�s2

r

ð13Þ

s¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1

mS�1

XmS

ðx�xmSÞ2

sð14Þ

RSD¼sx

ð15Þ

In order to compare the Lacey index ML and RSD, in the figures weplot 1�RSD. The Lacey index yields information on the averagelocal variability of the samples. The global state of mixing is betterreflected by the RSD, which is also the measure used in industrialsettings. The expectation value x is nearly 0.5 for each of the casesa–d. Because of the stochastic process of bed densification asdescribed above, not exactly 50% of particles are colored red orgrey, see Table 3.

In Figs. 7 and 8 comparisons of the mixing statistics i.e., theLacey index, between the four cases are depicted. Clearly, withincreasing problem size the quality of the statistical analysis

Fig. 8. Same as Fig. 7 in log representation. Lacey index comparison of the four

different simulations of cases: (a) N¼7680, (b) N¼76,800, (c) N¼768,000 and

(d) N¼7.68 mio; for 32 revolutions in the bladed-mixer.

Fig. 9. RSD comparison of the four different simulations of cases: (a) N¼7680,

(b) N¼76,800, (c) N¼768,000 and (d) N¼7.68 mio; for 32 revolutions in the

bladed-mixer.

Fig. 10. Same as Fig. 9 in log representation. RSD comparison of the four different

simulations of cases: (a) N¼7680, (b) N¼76,800, (c) N¼768,000 and (d) N¼7.68

mio; for 32 revolutions in the bladed-mixer.

Fig. 7. Lacey index comparison of the four different simulations of cases:

(a) N¼7680, (b) N¼76,800, (c) N¼768,000 and (d) N¼7.68 mio; for 32

revolutions in the bladed-mixer.

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–6442 6441

increases, as can be seen from the increasingly smooth signalfor larger granular assemblies. For the smallest case the quality ofthe statistics is not sufficient and cannot be used for thedetermination of the mixing end point and for following themixing kinetics. Only for the largest case d, a smooth curve resultsdue to the highly resolved powder data. According to Fig. 7, after10 revolutions the blend has achieved an acceptable state.Nevertheless, visual inspection of the snapshots in Fig. 6 showsclearly regions where mixing has not led to a uniformconcentration. The cross-shaped pattern of the blade can still beobserved in Fig. 6(d).

In Figs. 9 and 10 the evolution of the RSD is shown for the fourcases. It can be observed that global mixing still evolves after 10revolutions. The endpoint of mixing is achieved when the curvesare flattening and reaching a plateau. As the properties of the redand grey particles are identical to each other, in this casesegregation cannot occur. Thus, a lower RSD is exclusively dueto the dynamics of the particle mixing and the statistics of the

system. It is well-known that for a smaller number of particles,the final (stochastic) mixture has a lower value of 1-RSD. Thus, asobserved in our case the asymptote reached is lower for thesystem with fewer particles.

Interestingly, the mixing kinetics of cases b–d are very similarto each other, while mixing in case a is much slower. In general,mixing is based on two effects: (1) convective mixing and(2) diffusive mixing. Both contributions depend to a large extenton the interaction between the walls and the particles. Remyet al. (2009) showed that these effects are strong functions ofthe d/D ratio, i.e., the ratio of particle-to-mixer diameter.Specifically, they showed that above a critical d/D ratioconvective and diffusive mixing patterns remain unchanged.Only below this critical ratio, the convective flow changes(i.e., smaller convection cells are formed) and the diffusivecontribution is reduced (represented by a higher Peclet-number). Thus, mixing is impeded by a small d/D ratio, asobserved in the current study.

C.A. Radeke et al. / Chemical Engineering Science 65 (2010) 6435–64426442

5. Summary and outlook

We have investigated the influence of particle size on mixingstatistics, where particle assemblies were studied across fourorders of magnitude. Our approach of using massively parallelCUDA technique on GPUs for the implementation of DEMalgorithms enables for simulations of more than two millionparticles per Giga Byte of memory, executed on reasonablehardware. This allows for much more realistic resolution ofpowders used in the pharmaceutical industry where regulatoryguideline requirements have to be fulfilled. Future developmentwork of the GRPD code will address liquid bridges and non-spherical particles in order to achieve wet mixing capabilities aswell as more detailed particle shapes.

Nomenclature

F, f

force (N) M torque (N m) t time (s) dt time step (s) x position (m) r, R radius (m) r density (kg/m3) v velocity (m/s) a acceleration (m/s2) q quaternion ‘angle’ o,O angular velocity (rad/s) _o angular acceleration (rad/s2)

d

particle overlap (m)

I

moment of inertia (kg m2) m mass (kg) m friction (–) k stiffness of spring (N/m) D damping coefficient (kg/s2) ML Lacey index (–) s standard deviation (–) x mean value mS number of samples

Subscripts

n

denotes normal components t denotes tangential components c denotes property at contact ch cohesion property sim simulation ref reference

References

Allen, M.P., Tildesley, D.J., 1987. Computer Simulation of Liquids. Oxford UniversityPress, Oxford.

Barbara Chapman, G.J., van der Pas, R., 2007. Using OpenMP Portable SharedMemory Parallel Programming, ISBN 0-262-53302-2.

Bowers, K.J., Chow, E., Xu, H., Dror, R.O., Eastwood, M.P., Gregersen, B.A., Klepeis, J.L.,Kolossvary, I., Moraes, M.A., Sacerdoti, F.D., Salmon, J.K., Shan, Y., Shaw, D.E.,2006. Scalable algorithms for molecular dynamics simulations on commodityclusters. In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference onSupercomputing. ACM, New York, NY, USA, p. 84, ISBN 0-7695-2700-0/http://doi.acm.org/10.1145/1188455.1188544S.

Chandratilleke, G., Yu, A., Stewart, R., Bridgwater, J., 2009. Effects of blade rakeangle and gap on particle mixing in a cylindrical mixer. Powder Technology193 (3), 11–303.

Chen, F., Ge, W., Yuan, X., et al., 2009. Multi-scale hpc system for multi-scalediscrete simulation–development and application of a supercomputer with 1petaflops peak performance in single precision. Particuology 7 (4),332–335 . doi:10.1016/j.partic.2009.06.002.

Cleary, P.W., 2009. Industrial particle flow modelling using discrete elementmethod. Engineering Computations 26 (6), 698–743.

Cundall, P.A., Strack, O.D.L., 1979. A discrete numerical model for granularassemblies. Geotechnique 29 (1), 47–65.

FDA, 2003. Guidance for industry. Powder blends and finished dosage units—stra-tified in-process dosage unit sampling and assessment. Draft guidance,Pharmaceutical cGMPs.

Goldberg, D., 1991. What every computer scientist should know about floating-point arithmetic. Computing Surveys 23 (1), 5–48.

Ketterhagen, W.R., Ende, M.T.A., Hancock, B.C., 2009. Process modeling in thepharmaceutical industry using the discrete element method. Journal ofPharmaceutical Sciences 98 (2), 442–470. doi:10.1002/jps.21466.

Kirk, D.B., Hwu, W.W., 2010. Programming Massively Parallel Processors. MorganKaufmann, imprint Elsevier, 30 Corporate Drive, Suite 400, Burlington, MA01803, USA ISBN 978-0-12-381472-2.

Koller, D.M., Posch, A., Hoerl, G., Voura, C., Urbanetz, N., Fraser, S.D., Tritthart, W.,Reiter, F., Schlingmann, M., Khinast, J.G., in press. Continuous quantitativemonitoring of powder mixing dynamics by near-infrared spectroscopy.Powder Technology. doi:10.1016/j.powtec.2010.08.070.

Lacey, P.M.C., 1954. Developments in the theory of particle mixing. Journal ofApplied Chemistry 4, 257.

Lee, Y., Fang, C., Tsou, Y.R., Lu, L.S., Yang, C.T., 2009. A packing algorithm for three-dimensional convex particles. Granular Matter 11 (5), 307–315. doi:10.1007/s10035-009-0133-7.

Luding, S., 1998. Collisions & contacts between two particles. In: Herrmann, H.J.,Hovi, J.P., Luding, S. (Eds.), Physics of Dry Granular Media—NATO ASI SeriesE350. Kluwer Academic Publishers, Dordrecht, p. 285.

Muth, B., Eberhard, P., 2005. Investigation of large systems consisting of manyspatial polyhedral bodies. In: Proceedings of the ENOC-2005 Fifth EUROMECHNonlinear Dynamics Conference, Eindhoven, pp. 1644–1650.

NVIDIA, 2008a. On accelerating financial applications using cuda gpu technology,Hilton New York, U.S. The HP and NVIDIA Seminar at Security IndustryTechnology Show (SIFMA).

NVIDIA, 2008b. Cuda compute unified device architecture, programming guideversion 1.0.

Pohlman, N.A., Meier, S.W., Lueptow, R.M., Ottino, J.M., 2006. Surface velocity inthree-dimensional granular tumblers. Journal of Fluid Mechanics 560,355–368. doi:10.1017/S0022112006000437.

Radeke, A.C., 2006. Statistische und mechanische analyse der kraefte undbruchfestigkeit von dicht gepackten granularen medien unter mechanischerbelastung. Dissertation.

Radl, S., Kalvoda, E., Glasser, B., Khinast, J., 2010. Mixing characteristics ofwet granular matter in a bladed mixer. Powder Technology 200 (3), 171–189.

Remy, B., Khinast, J.G., Glasser, B.J., 2009. Discrete element simulation of freeflowing grains in a four-bladed mixer. AIChE Journal 55 (8), 2035–2048.

Remy, B., Glasser, B.J., Khinast, J.G., 2010. The effect of mixer properties and filllevel on granular flow in a bladed mixer. AIChE Journal 56 (2), 336–353.

Song, Y., Turton, R., Kayihan, F., 2006. Contact detection algorithms for demsimulations of tablet-shaped particles. Powder Technology 161 (1),32–40. doi:10.1016/powtec.2005.07.004.

Verlet, L., 1967. Computer experiments on classical fluids. I. Thermodynamicalproperties of Lennard-Jones molecules. Physical Reviews 159 (1), 98–103.

William Gropp Ewing Lusk, A.S., 2007. MPI - Eine Einfhrung - Portable paralleleProgrammierung mit dem Message-Passing Interface. Munchen, ISBN 978-3-486-58068-6.

Xu, M., Ge, W., Chen, F., Li, J., 2009. Simulation of the heterogeneous structure inparticle-fluid system with dpm by multi-level parallel computation. In: CDProceedings of the CSIRO Multi-Scale Modelling Symposium.

Zhou, Y., Yu, A., Stewart, R., Bridgwater, J., 2004. Microdynamic analysis of theparticle flow in a cylindrical bladed mixer. Chemical Engineering Science 59(6), 1343–1364. doi:10.1016/j.ces.2003.12.023.