Abelian-Higgs Cosmic String Evolution with CUDA

Abelian-Higgs Cosmic String Evolution with CUDA

J. R. C. C. C. Correiaa,b,c, C. J. A. P. Martinsa,b,∗

aCentro de Astrofısica da Universidade do Porto, Rua das Estrelas, 4150-762 Porto, PortugalbInstituto de Astrofısica e Ciencias do Espaco, CAUP, Rua das Estrelas, 4150-762 Porto, Portugal

cFaculdade de Ciencias, Universidade do Porto, Rua do Campo Alegre 687, 4169-007 Porto, Portugal

Abstract

Topological defects form at cosmological phase transitions by the Kibble mechanism,with cosmic strings—one-dimensional defects—being the most studied example. A rig-orous analysis of their astrophysical consequences is limited by the availability of accuratenumerical simulations, and therefore by hardware resources and computation time. Im-proving the speed and efficiency of existing codes is therefore important. All currentcosmic string simulations were performed on Central Processing Units. In previous workwe presented a General Purpose Graphics Processing Unit implementation of the evolu-tion of cosmological domain wall networks. Here we discuss an analogous implementationfor local Abelian-Higgs string networks. We discuss the implementation algorithm (in-cluding the discretization used and how to calculate network averaged quantities) andthen showcase its performance and current bottlenecks. We validate the code by directlycomparing our results for the canonical scaling properties of the networks in the radiationand matter eras with those in the literature, finding very good agreement. We finallyhighlight possible directions for improving the scalability of the code.

Keywords: cosmology: topological defects, field theory simulations, cosmic stringnetworks, methods: numerical, methods: GPU computing

1. Introduction

A generic cosmological prediction of many theories beyond the Standard Model is theformation of objects known as topological defects, by means of the Kibble mechanism(Kibble, 1976). Since properties of these objects and their astrophysical consequencesare intrinsically linked to the symmetry breaking patterns which produce them, one canthink of them as fossil relics of the physical conditions in the early Universe.

Recent constraints on these objects using cosmic microwave background and grav-itational wave data (Ade et al., 2014; Abbott et al., 2018) are mainly limited by theexistence of accurate high-resolution simulations of defect networks with a large dynamicrange, as well as full sky maps of the backgrounds produced by these networks: one

∗Corresponding authorEmail addresses: [email protected] (J. R. C. C. C. Correia),

[email protected] (C. J. A. P. Martins)

Preprint submitted to Astronomy and Computing April 29, 2020

arX

iv:1

809.

0099

5v2

[ph

ysic

s.co

mp-

ph]

28

Apr

202

0

contemporary example is the search for cosmic strings in cosmic microwave backgroundmaps using the Kaiser-Stebbins effect (for which one would ideally like to have thou-sands of statistically independent template sky maps, generated from a similar numberof independent simulations, as opposed to current analyses relying on one or a few simu-lations), and analogous searches will undoubtedly be done with gravitational wave mapsin the near future. The approximations currently being used to mitigate the absence ofsuch data clearly introduce systematic uncertainties that are comparable to the quotedstatistical uncertainties. This problem is even more severe for next-generation facilitiessuch as CORE (Finelli et al., 2018) or LISA (Binetruy et al., 2012). On the other hand,analytic studies of realistic defect networks can—at least in principle—include enoughdegrees of freedom to explicitly model the relevant dynamical properties of these net-works, but they will also be bottlenecked by the lack of high-resolution simulations, sincethese simulations are necessary to quantitatively calibrate the models. Resolving thisissue by traditional means would imply unrealistic hardware/compute time needs.

To alleviate this problem, one can attempt to exploit differing hardware architectureswith the onus of optimisation falling to the developers of the tool in question. In theliterature there are several examples of defect simulations optimised for Central Process-ing Units (CPUs), either assuming shared or distributed memory architectures. Severalexamples of Goto-Nambu cosmic string simulations can be found in the literature (Ben-nett and Bouchet, 1990; Allen and Shellard, 1990; Martins and Shellard, 2006; Olum andVanchurin, 2007; Blanco-Pillado et al., 2011), while examples of field theory simulationsof the two simplest types of defect are the WALLS code of Martins et al. (2016a) fordomain walls—also optimised for Intel Xeon Phi co-processors, as summarised in Briggset al. (2014)—and the cosmic string evolution codes of Bevis et al. (2007). More recently,field theory defect simulations have also been developed for monopoles (Lopez-Eigurenet al., 2017), global strings (Hindmarsh et al., 2020; Drew and Shellard, 2019), Type Istrings (Hindmarsh et al., 2019), semilocal strings (Achucarro et al., 2014), dual-higgsedstrings (Bevis and Saffin, 2008; Lizarraga and Urrestilla, 2016) and hybrid defect sim-ulations (Hindmarsh et al., 2017b; McGraw, 1998; Hindmarsh et al., 2018). Last butnot least, fully general relativistic treatments have also been recently introduced (Helferet al., 2019; Drew and Shellard, 2019).

Simulations which use Graphics Processing Units are far more scarce, with the onlyreported instance so far being Correia and Martins (2017) by the authors, for domain wallnetworks. This paper is a continuation of our previous study, which seeks to simulatelocal Abelian-Higgs strings. We first introduce the algorithm used to simulate fieldtheory cosmic strings (including the discretization used and how to calculate the relevantnetwork averaged quantities), then showcase the performance of the implementation, andfinally validate it by directly comparing with results found in the literature. We concludeby highlighting tentative directions to augment the scalability of this code. Meanwhile,the early results of the cosmological exploitation of this code can be found in Correiaand Martins (2019).

2

2. Discretization scheme

A U(1) local Abelian-Higgs string corresponds to a topological soliton that arises asa solution to the equations of motion of the Lagrangian

L = |Dµφ|2 −1

4FµνFµν −

λ

4(|φ|2 − σ2)2 , (1)

where φ is a complex scalar field, Fµν = ∂µAν − ∂νAµ a gauge field strength (Aµ cor-responds to the gauge field), Dµ = ∂µ − ieAµ denotes a covariant derivative and λ ande are two constants which set the values of the scalar and vector masses, respectivelymφ =

√λσ and mv = eσ.

We follow the same discretization procedure as Bevis et al. (2007), which requiresfirst writing the discrete Lagrangian,

L =1

2e2(η)a2(η)∆x2

(∑i

(Ex,η−1/2i )2

− 1

2∆x2

∑i

∑j

[1− cos(Ξxij)

])+ |Πx,η|2

− 1

∆x2

∑i

|e−iAx,η−1/2i φx+ki,η − φx,η|2

− 1

4a2(η)λ(η)(|φx,η|2 − σ2)2

(2)

where a = a(η) is the conformal factor for a Friedmann-Robertson-Walker space-time, φand Ai represent the complex scalar field and the spatial components of the gauge fieldat conformal time-step η and Π and Ei are conjugates of the aforementioned fields athalf-steps (φ and Ai, respectively). Note that vector fields are rescaled as Axi → e∆xAxiand Exi → e∆ηExi . Both scalar fields reside at lattice sites i, j, k → x, and the vector

fields at half-sites (for convenience however site Ax+ki/2i , where ki is some unit vector, is

written Axi ).The action of the gauge field on the lattice is then defined in terms of link variables,

as for instance exp(−iAxi ). These are a standard prescription of lattice gauge theory toconstruct gauge invariant quantities on a lattice, see (Wilson, 1974). They can then beused to define a plaquette operator

exp(iAxi ) exp(iAx+kij ) exp(−iAx+kji ) exp(−iAxj ) = exp(iΞij) , (3)

where Ξij/∆x2 represents the discrete version of Fij . The real part of the plaquette

operator in the small lattice spacing limit, along with the small-angle approximation (1−cos(x)→ 0.5x2) defines the discretized version of FijFij . The gauge covariant derivative

is written as exp(−iAx,η−1/2i )φx+ki,η − φx,η . Finally σ sets the vacuum expectationvalue, aη, eη, λη respectively denote the scale factor, the gauge and scalar couplingsat time η (to be defined below), and ∆x and ∆η are the lattice spacing and time-stepvalues.

3

Through variational principles, much as can be found in Bevis et al. (2007), one canwrite the equations of motion as a staggered leap-frog scheme

Πx,η+1/2 =

(aη−1/2

aη+1/2

)2

Πx,η−1/2 −∆η

(a2η

aη+1/2

)2λη2

(|φx,η|2 − σ2)φx,η

+∆η

(∆x)2

(aη

aη+1/2

)2∑j

(φx+kj exp(iAx,ηj )− 2φx,η + φx−kj ,η exp(−iAx−kj ,ηj )

)(4)

Ex,η+1/2i =

(eη+1/2

eη−1/2

)2

Ex,η−1/2i + 2∆ηa2ηe

2η+1/2Im

[(φx+ki,η)∗ exp(−iAx,η)φx,η

]− ∆η

(∆x)2

(eη+1/2

eη

)2∑j 6=i

[sin(Ξxij)− sin(Ξ

x−kjij )

] (5)

φx,η+1 = φx,η + ∆ηΠx,η+1/2 (6)

Ax,η+1i = Ax,ηi + ∆ηE

x,η+1/2i , (7)

which tells us how to update field variables at each time-step. The simulations are evolveduntil half-a-light-crossing time, since due to the periodic boundary conditions, evolvingany further would affect the dynamics (all the defects would effectively be inside thesame horizon which would not adequately mimic a network in an expanding universe).

There is a further subtlety in most field theory defect simulations: since the physicaldefect width is constant, it will shrink in comoving coordinates. This means that thetrue equations of motion result in strings that eventually fall through the lattice andcan no longer be resolved. A way to bypass this problem is to fix the comoving widthas originally done in Press et al. (1989) or to first apply a comoving core growth period(Bevis et al., 2007), such that by the end of the simulation defects can still be resolved.Since the defect width is inversely proportional to the scalar and vector masses, thismeans that one must change the way eη and λη behave as

eη = e0as−1η λη = λ0a

2(s−1)η (8)

where s is a parameter such that s = 0 will force constant comoving core width and s = 1recovers the original equations of motion (negative values imply core growth). As long asthese modified constants are used in the action (and consequently in the discrete versionof the equations of motion above), then this parameter can be controlled throughout thesimulation to either have defects with fixed comoving width or with growing width upto some specified timestep, and then physical width thereafter.

For our simulations scalar and vector masses are made equal by choosing e0 = 1 andλ0 = 2, and in what follows we will present results both with s = 0 and with s = 1. Forthe initial conditions, we choose to mimic a field configuration after a phase transition,that at the same time obeys the discretized form of Gauss’s law. As such, the scalarfield φ is set to have a random phase, its norm is set to unity (given that σ = 1) and

4

all other field variables are set to zero. For the random phase, the library cuRAND(NvidiaCorporation, b) is used.

In order to validate the simulations, two diagnostics are calculated, a mean stringseparation and a weighted mean squared velocity (taken from local gradient and rate ofchange of the scalar field),

ξL =

√−µL

〈v2〉X =2R

1 +R(9)

where

R =

∫|φ|2Xd3x∫|Dφ|2Xd3x

. (10)

and X is a weight function. The first estimator is taken from Bevis et al. (2007) whereit was shown that the discrete Lagrangian density peaks negatively at the strings (andtherefore its mean, L can be used for mean string separation estimation), and the secondmatches the corrected definition presented in Hindmarsh et al. (2017a). In what followswe will use two different weighting functions—either the potential or the Lagrangian—inorder to establish a comparison.

As a cross-check we use a further mean string separation estimator, ξW , where thetotal length of string is computed by finding all lattice plaquettes pierced by a string.The way to do this is to consider a gauge invariant winding around a plaquette,

Wij = (Y xi + Y x+ij − Y x+ji − Y xj )/2π , (11)

where Yi,x = [(φx+ki)arg − (φx)arg +Axi ]π −Axi , as introduced in Kajantie et al. (1998).Note that the argument of the scalar field is assumed to be between [−π, π] and theterm in [...]π has π factors added or subtracted in order to bind the result to the interval[−π, π]. If Wij is different from zero then a string is present, and a length of ∆x isassociated with each winding found. Note that we also multiply by a factor of π/6 toaccount for the Manhattan effect (Scherrer and Vilenkin, 1998).

A further cross-check can be done through an estimator of (Hindmarsh et al., 2017a)for the mean velocity squared, where the mean weighted pressure and density are usedto compute the equation of state parameter of the strings and from it the velocity,

〈v2〉ω =1

2

(1 + 3

pLρL

)(12)

Given that in Hindmarsh et al. (2017a) this estimator has been shown to be in betteragreement with the velocity of a standing wave string, this serves as a baseline for thecomparison between estimators.

3. Implementation and performance

We now describe how our implementation utilises an application programming inter-face named Compute Unified Device Architecture (CUDA, by NVIDIA Corporation) toevolve a network of Abelian-Higgs cosmic strings. The development and all benchmarkswere done on an NVIDIA Quadro P5000, with 2560 CUDA cores, a core clock of 1607MHz and 16384 MB of memory, clocked at 1126 MHz.

5

One of the main roles of CUDA is to provide a way to abstract some details of theunderlying hardware, while allowing some degree of optimisation. A relevant example inthe present case is that even though GPU’s are made of Streaming Multiprocessors (eachmade up of several cores which execute instructions in a 32-way lane—called a warp—inSingle Instruction Multiple Data fashion), we will not explicitly distribute threads acrossdifferent Streaming Multiprocessors, but spawn a number of threads equal to N2 in an N3

simulation box (the reason for this will become apparent in two paragraphs), subdividedinto groups of threads called thread blocks. Multiple thread blocks will be resident ateach Streaming Multiprocessor, assigned automatically without our intervention.

In CUDA, applications are subdivided into data parallel functions named kernels. Inour application there are three kernels that evolve the field configurations at every time-step: the first one corresponds to Eq. 4, the second to Eq. 5 and the third one to Eqs.6–7. These will be denoted stepA, stepB and stepC respectively. There are also kernelsassociated with computing useful quantities such as the mean string separations or thevelocities. Since these kernels implement essentially finite differences (and often these arememory-bound) one must exploit the memory hierarchy of a GPU. The abstract memorymodel of CUDA describes myriad types of memory and the ones relevant for the nextparagraph include: global memory (which corresponds to video memory), shared (a fast-on-chip memory available to groups of threads, known as thread blocks) and registers(per-thread memory, even faster on-chip memory).

For the first two kernels one loads relevant field quantities from global memory atzero height (k = 0) to shared memory. We denote these 2D (oriented along X andY) chunks of shared memory as tiles. Note that one would naively expect these tilesto have size equal to the number of threads along the x-direction times the number ofthreads in the y-direction in a given thread block, however, as can be seen from thediscretization scheme, there are terms which involve using field quantities in positionsx + kj , x − kj where j = 1, 2, 3. As such, the very frontiers of each tile require valuesfrom neighbouring tiles. Given that there is no communication between thread blocks,we must pad our XY-tiles by 2 along each direction and load appropriate boundary termsto these padding regions (commonly known as ghost cells or halos) prior to any actualcomputation. Afterwards we simply stream through the z-direction (Zhang and Mueller,2012; Micikevicius, 2009; Nguyen et al., 2010; Phillips and Fatica, 2010). The mainadvantage of doing so is to enable software pre-fetching: load only the field at the nextz-position (into registers) and when streaming up the z-direction this value is loaded intothe current shared memory tile. Similarly the previously current shared memory tile isloaded into the bottom registers (see Fig. 1 for a schematic representation). In stepB, andin the kernels which calculate average network quantities, instead of temporary variablesabove and below, we use shared memory tiles for the top and bottom (complete withhalos). There are two reasons to do so: for convenience (some calculations may requirevalues on the top/bottom tile’s halos) and to reduce register over-use. Note that if theamount of necessary registers exceeds what can be provided by on-chip memory, thecompiler will ’spill’ some of these variables to the slower global memory. This is knownas register spilling.

There is another advantage to loading field values into shared memory tiles and/orregisters. The field variables are given by the aligned vector types defined in CUDA

6

Figure 1: Schematic representation of the stepA kernel (which updates the conjugate momentum Π):the tile in the middle represents a 2D shared memory tile where current values in the z-direction (site k)are loaded together with halos, and register values (blue pinheads) hold field values directly above andbelow (k − 1 and k + 1).

Table 1: The effective Global Load and Store bandwidth (in units of GB/s), the number of FloatingPoint Operations per second (in Teraflops) and the achieved occupancy, for a 2563 simulation in theradiation era and for constant comoving width.

Kernel GLS (GB/s) TFLOPs Occupancy (%)stepA 245.92 1.59 48.0%stepB 271.60 0.80 47.8%stepC 264.83 0.04 90.8%

VelRVLag 212.88 1.68 48.3%VelRWLag 217.09 1.67 48.3%VelEoSLag 193.70 1.80 48.3%Winding 133.72 1.26 48.3%

(float2 and float4) and while the vector loads ensure coalesced1 memory reads, somecomputations which require specific components of each field would cause un-coalescedreads. This bottleneck is circumvented by using shared memory and registers. The thirdkernel is more straightforward, since software pre-fetching cannot be implemented. Itsimply reads the fields and their conjugates from global memory and writes the updatedfield values again.

The three kernels are limited by memory bandwidth, in particular when readingfrom global memory, as indicated by the NVIDIA Visual Profiler. As such, the mostrelevant performance metric is the effective bandwidth (bytes loaded and stored from/intoglobal memory per second) and how it compares to the peak bandwidth of the GDDR5X

1Coalesced memory accesses are defined as accesses where multiple reads or writes into one singletransaction. In general this requires that such reads/writes are not sparse, misaligned or non-sequential.In NVIDIA GPU’s every successive 128 bytes can be loaded by 32 threads (denoted a warp) in one singletransaction (32× 4 bytes then yields 128 bytes).

7

memory present in the test-bench graphics card. The average bandwidth reached for eachkernel (together with additional metrics) can be found in Table 1, for box size 2563, in theradiation era and for constant comoving width. It is seen that that we are close to peakbandwidth (288.5 GB/s). Additionally the peak single precision is of 8.876 TFLOP/son this particular card. Both peak throughput and bandwidth are expected theoreticalvalues, reported by the NVIDIA Visual Profiler (ie. we did not perform measurementsfor peak bandwidth/throughput using custom kernels).

An important detail is the chosen size of thread block, in particular for the first twokernels. In general, kernels which perform finite difference methods (such as those of a7-point stencil) prefer a larger thread block size in order to mitigate the performancehit from loading tile halos. However, in our case, due to the data-reuse pattern aboveone must also consider if the thread-block size will not result in register/shared memoryoveruse. With the help of the online CUDA Occupancy Calculator (Xmartlabs, 2012),the thread-block size that seemed to yield best performance was (32,4) at 2563 box size.The main limiting factor for the occupancy per Streaming Multiprocessor (the ratio ofwarps being executed upon each Streaming Multiprocessor to the theoretical maximumnumber of warps per Streaming Multiprocessor) seems to be register pressure, as shownby the NVIDIA Visual Profiler. In all cases the occupancy is large enough that increasingit might not yield better performance: as previously stated, the Visual profiler does notindicate latency as the main performance bottleneck.

The kernels which calculate the mean string separation ξL (recall Eq. 9) and each ofthree mean velocity squared estimators 〈v2〉V , 〈v2〉W or 〈v2〉ω (hereinafter named Vel-RVLag, VelRWLag and VelEoSLag, respectively), as well as the kernel which computesthe winding (named Winding) operate by using the memory pattern described above toload data into shared memory tiles. They then compute the Lagrangian, and either thenumerator and denominator of R, or the ratio between ρL and pL for each thread (cf.Eqs. 10 and Eq. 12, respectively). Each result is stored in a register, and the CUDAUnbound library (NvidiaResearch-NVLabs, 2018) is finally used to compute a threadblock-wide sum. Since each block computes a partial sum, we then transfer these backto the host and after summing we write to disk.

The partial sums are calculated on the GPU in order to avoid becoming IO-bound(PCI-E buses could be easily saturated by transferring the values of each field variableto the host). The three velocity estimator kernels are bottlenecked by both computeand memory requirements, and end up having reasonably similar performances. Thememory requirements are in part explained by the excessive register spilling that occurs(this can be avoided by not limiting the maximum number of registers to 64 per threadwith the compiler option --maxrregcount, but the side-effect is that it significantlyreduces the occupancy, and this heavily impacts performance). The impact of spilling ismitigated by turning on the compiler flag --Xptxas dlcm=ca which caches these spillsin L1. Improving the compute part however is more challenging: many of the compilerflags which attempt utilisation of hardware intrinsics, or reduce the precision of certainoperations often affect the quality of the diagnostics, either changing the asymptoticquantities themselves or increasing uncertainties.

Still there is one simple optimisation that reduces runtimes: avoid executing thiskernel at every timestep. In other words, we calculate the diagnostic quantities every ntimesteps only (hereinafter we take n = 5), and reduce statistical uncertainties by doingmultiple runs. This effectively reduces the time spent in the calculation kernels, as can

8

Table 2: Total elapsed times, in seconds, on the three evolution kernels plus estimator kernels (whichcalculate averaged network quantities) of one 2563 and one 5123 run. The total time is computed bytaking the average runtime and multiplying by the number of times a kernel is executed in a single run,i.e. 〈T ime〉 × ncalls. The first three kernels are executed every timestep, while the others are executedonly every 5 timesteps.

Kernel 2563 5123

stepA 2.29 36.86stepB 3.10 50.02stepC 2.89 46.30

VelRVLag 0.57 8.38VelRWLag 0.56 8.45VelEoSLag 0.64 8.88Winding 0.87 11.92

be seen in Table 2 for an example run where all estimators are run. Note however thatin typical production runs one will select only one of the of the velocity estimators andoptionally the Winding estimator. The total run time will therefore depend on whatdiagnostics one chooses to output, and how often this is done.

One final remark about the time spent in Input/Output operations (transferringpartial sums, computing the final sum on the host, cf. Table 2) is that we can speedup the simulation further by overlapping compute on the GPU with the aforementionedoperations: however, how much can be gained in terms of speed will also depend on howoften we choose to calculate useful quantities. For now, given that a reduced number ofcalls to estimator kernels diminishes the need for such an optimisation and that we canventure into multi-GPU territory, we keep everything non-overlapped.

The main motivation for writing GPU-optimised applications is based on the highertheoretical bandwidth and throughput ceilings. Based on typical figures for current high-end top-of-the-line multicore CPUs, memory and GPUs, one can stipulate a speed-upof one order of magnitude for bandwidth or compute-bound applications (NvidiaCorpo-ration, a) assuming of course both applications are fully multi-threaded and optimised(and thus reach close-to-peak throughput and bandwidth). Of course this will dependon the underlying hardware where each simulation is executed and on the optimisationapplied, so the precise number may vary. In our previous work on domain walls (Correiaand Martins, 2017) a speed-up of about two orders of magnitude was found in a typicaldesktop computer when comparing to a single-thread implementation. In the presentcase we don’t have a CPU code that enables a direct comparison (and the publishedwork of other authors does not provide useful benchmarks), but we could conservativelyestimate a speed-up of at least one order of magnitude.

We can also compare the performance in time to full evolution multiplied by thenumber of processors per number of sites (in either gpu-sec/site vs core-sec/site) for theevolution update and winding outputs with the cosmic string simulation of (Bevis et al.,2007; Hindmarsh et al., 2017a) (hereby referred to as Lattice Abelian-Higgs - LAH) mea-sured in the Monte Rosa supercomputer at 40963 box size with 32768 cores (Hindmarshand Daverio). Note that before such a comparison can be made there are some caveats:Hindmarsh and Daverio remarked that their simulation is not too optimised. While we

9

Table 3: The performance of each of our kernels given in gpu-sec/site. Note that in order to comparewith the LAH performance, provided by Hindmarsh and Daverio, we present the performance of all ofthree update kernels together (stepA+B+C, computed by summing the time for each update kernel fromTable 2 and then dividing by the number of sites). These numbers can be obtained from the times at5123 in Table 2 by dividing by the number of calls of each kernel in a run (1280 for steps A and B andC, 256 for estimators) and by dividing over the size of the lattice 5123.

Kernel Performance GPU-AH Performance LAH(gpu-sec/site at 5123) (core-sec/site at 40963)

stepA+B+C 7.75 · 10−10 8 · 10−7

VelRVLag 2.43 · 10−10 Not availableVelRWLag 2.46 · 10−10 Not availableVelEoSLag 2.58 · 10−10 Not availableWinding 3.47 · 10−10 1.3 · 10−6

were provided the performance of the winding update, note that most of time spent onwindings is due to output (writing to disk), not due to the computation of windings (incontrast with our case, as we sum the windings and output the mean string separationestimator only). As such it is not entirely correct to compare the winding performancedirectly with ours. LAH for the evolution update (that is, the analogous computationto our stepA + stepB + stepC), for 10903 timesteps, 40963 box size and for the com-putation of winding (for 1300 of the timesteps) has the following performance figures:8 · 10−7 core-sec/site and 1.3 · 10−6 core-sec/site, respectively. Evolving a 5123 latticefrom start to finish (stepA + stepB + stepC for 1280 timesteps) reveals a performance of7.75 ·10−10 gpu*sec/site. Compared with the evolution figures from LAH, our simulationtherefore spends about three orders of magnitude less time updating fields on a givenlattice site, showing that our estimate in the previous paragraph is indeed conservative.We present the figures for all of the other kernels in Table 3. Note that in general GPUcores are much slower than traditional processor cores, even though this last table seemsto suggest they are only 2.5 times slower. We are not sure of the reason for this behavior,but we may speculatively suggest that it reflects the different levels of optimization ofthe two codes.

4. Scaling Validation

We have checked that the discretized form of Gauss’s law is preserved to machineprecision. Additionally, inspecting iso-surfaces of the scalar field provides visual confir-mation that a network of strings is formed and evolves as expected—some examples canbe seen in Fig. 2.

For the domain walls GPU code (Correia and Martins, 2017) we had a serial versionof the simulation which had been previously tested and validated (Press et al., 1989;Martins et al., 2016a,b), and could directly compare outputs. In the present stringscase both the serial and parallel versions are completely new to the authors, so we willvalidate them by evaluating the asymptotic scaling values and comparing them with theresults in the literature (which come from CPU codes). We have performed simulationsin the two canonical cosmological epochs, the radiation and matter eras, for which the

10

Figure 2: Isosurfaces of the absolute value of the complex scalar field with the value of 0.5, showinga network of Abelian-Higgs cosmic strings in the radiation and matter eras (left and right side panelsrespectively). All pictures are from simulations with box size 5123; the top panels correspond to timestep60, while the bottom panels correspond to timestep 128.

11

Table 4: Numerical results for asymptotic scaling quantities ξ (calculated using the Lagrangian or thewinding estimator) and the three velocity estimators, for s = 0 and s = 1 (where applicable), from oursimulations and from the literature. All quantities were measured in simulations with box sizes of 5123,except where otherwise noted. The ext. and asy. denote values that were extrapolated (rather thandirectly measured from the simulations) and inferred by visual inspection of Fig.9 of Hindmarsh et al.(2017a); see the main text for further discussion of these.

Epoch s ξL ξW 〈v2〉V 〈v2〉L 〈v2〉ω Reference

Radiation 1 0.33 ± 0.02 - - - - Bevis et al. (2007)

Radiation 1 - - - - 0.37 ± 0.01 (ext.) Hindmarsh et al. (2017a)@40963

Radiation 1 - - - - 0.30 ± 0.01 (asy.) Hindmarsh et al. (2017a)@40963

Radiation 1 0.254 ± 0.005 0.265 ± 0.005 - - - Daverio et al. (2016)@40963

Radiation 1 0.32 ± 0.01 0.32 ± 0.03 0.34 ± 0.01 0.31 ± 0.01 0.31 ± 0.01 This work

Radiation 0 0.31 ± 0.02 - - - - Bevis et al. (2007)

Radiation 0 - 0.26 ± 0.02 - - - Bevis et al. (2010)@10243

Radiation 0 0.234 ± 0.006 0.244 ± 0.005 - - - Daverio et al. (2016)@40963

Radiation 0 0.30 ± 0.02 0.32 ± 0.03 0.34 ± 0.01 0.31 ± 0.01 0.32 ± 0.01 This work

Matter 1 - - - - 0.31 ± 0.01 (ext.) Hindmarsh et al. (2017a)@40963

Matter 1 - - - - 0.26 ± 0.01 (asy.) Hindmarsh et al. (2017a)@40963

Matter 1 0.261 ± 0.008 0.277 ± 0.008 - - - Daverio et al. (2016)@40963

Matter 0 0.30 ± 0.01 - - - - Bevis et al. (2007)

Matter 0 - 0.28 ± 0.01 - - - Bevis et al. (2010)@10243

Matter 0 0.235 ± 0.008 0.247 ± 0.008 - - - Daverio et al. (2016)@40963

Matter 0 0.29 ± 0.01 0.29 ± 0.02 0.26 ± 0.01 0.27 ± 0.01 0.25 ± 0.01 This work

scale factor respectively evolves as a(η) ∝ η and a(η) ∝ η2. Snapshots of the simulationsin the two eras can be seen in Figure 2. This comparison is summarised in Table 4.The scaling quantities obtained in the present work are the averages of the velocity andslope of the mean string separation, in the dynamic range in which the networks havereached scaling. In each case we average 5 different runs (with random initial conditions)to obtain a statistical error.

Comparing directly our results for the slope of the mean string separation with thevalues of ξL in Bevis et al. (2007) we find excellent agreement for both matter and ra-diation era simulations. Our other length estimator, ξW , is also in excellent agreementwith the results of the first, but in mild disagreement (about 1.5 standard deviations, ifone assumes Gaussian errors) with the value found in Bevis et al. (2010). The discrep-ancy increases if we compare with the even larger simulations of Daverio et al. (2016).As explained in Bevis et al. (2010) this is a consequence of the fact that these works(Bevis et al., 2010; Daverio et al., 2016) include an early period of cooling (introducedby modifying the equations of motion, which effectively changes the initial conditions),which is done with the goal of reaching scaling as quickly as possible. The combina-tion of this choice and the extended dynamic range then leads to a slow drift in thedξ/dη value (changing the ξ from the 0.3 value to about 0.28 at 10243 and then about0.24 at 40963). Thus in this particular case the statistical disagreement is at about 3standard deviations, but this also highlights the fact that these simulations also includesystematic uncertainties due to the numerical implementation itself, which must be takeninto account as improvements in hardware and software gradually reduce the statisticaluncertainties.

Figure 3 depicts the evolution of the Lagrangian-based mean string separation ξL andthe winding based mean string separation ξW for our 5123 runs, in both the radiationand matter eras. Qualitatively, the approach to scaling is clearly visible, and this isconfirmed by the quantitative analysis described in the previous paragraph.

We note that in the case of the Lagrangian-based mean string separation some os-

12

cillations can be seen, signalling the presence of some radiation in the box. This is wellunderstood from previous work with high-resolution field theory simulations of domainwalls (Martins et al., 2016a,b), which shows that the presence of this radiation does notprevent scaling. We note that it would be possible to artificially suppress this radiationby numerically implementing an ad hoc period of cooling, as is sometimes done in theliterature. However, we have not done this: it is not necessary for our purposes (i.e., forvalidating the code through its diagnostics of scaling properties of the string networks).Indeed, since part of our goal is to demonstrate that scaling is reached, we should not useany evolution period which might artificially facilitate the approach to scaling. Moreover,previous work on domain walls shows that this would erase relevant information for thepurpose of modelling of network evolution. Indeed, one can numerically separate theenergy in defects from that in radiation, and analytically model the evolution of both:for domain walls this has been done in Martins et al. (2016a,b), and for strings earlyresults demonstrating that this is possible can be found in Correia and Martins (2019).

As for the velocity estimator, the comparison has to be more qualitative since thereare fewer measurements of velocities reported in previous field theory simulations. Themost recent work is Hindmarsh et al. (2017a), which only tabulates values obtained fromextrapolating the results of their simulations to infinite string separation—a processwhose physical meaning is not entirely clear. We do present these values in Table 4(denoting them with ext.), but we also note that a more meaningful comparison is likelyto be with the asymptotic values (denoted asy. in the table). The reason is simplythat these asymptotic values were measured directly from simulations—indeed they canbe visually read off from the top and bottom panels of figure 9 of Hindmarsh et al.(2017a)—while the others were extrapolated from the simulations with some additionalassumptions.

That being said, our analysis shows that all velocity estimators are in reasonableagreement. For radiation the potential-weighted estimator yields a slightly higher valuethan the others, but this is not statistically significant. Note that in the matter erawe cannot evolve the true equations of motion, i.e. the case s = 1 (one needs a largerdynamic range in order to successfully use core-growth), though this can be resolvedby running larger simulation boxes as discussed in the next section. As such, we havecompared our s = 0 case to the s = 1 value of Hindmarsh et al. (2017a), for mattervelocities; previous work, including that of Hindmarsh et al. (2017a) itself, suggests thatthis is not a significant issue. Plots of all the velocity diagnostics throughout the durationof the simulation in both radiation and matter era (with and without core growth, whereapplicable) can be found in Figure 4. As expected there are very large oscillations atearly times (which the network relaxes from the choice of numerical initial conditions),but the approach to the constant-velocity scaling solution is clear at late times.

5. Conclusions and outlook

We have implemented field theory cosmic string evolution for the U(1) model usingthe Compute Unified Device Architecture, such that it uses Graphics Processing Units asaccelerators. We summarised the main implementation steps in terms of the performanceof each kernel and showcased the achievable performance. In addition we compared thekey physical diagnostic parameters for the mean string separation and the mean velocity

13

0 20 40 60 80 100 120

0

10

20

30

40

50

60 = 0.3±0.02, s=0, = 1/2= 0.32±0.01, s=1, = 1/2= 0.29±0.01, s=0, = 2/3

0 20 40 60 80 100 120

0

10

20

30

40

50

60 W = 0.32±0.03, s=0, = 1/2W = 0.32±0.03, s=1, = 1/2W = 0.29±0.02, s=0, = 2/3

Figure 3: The evolution of the mean string separation ξL (left panel) and the winding based mean stringseparation ξW (right panel) for 5123 runs, in the radiation era (β = 1/2, blue lines without core growthand green lines with core growth) and in the matter era (β = 2/3, red lines without core growth). Thevalues of the mean string separation slopes, ξ, inferred after the networks have reached scaling, are alsoadded in the figure legends. These slopes are an average from the slopes of 5 different runs.

squared to those previously reported in the literature, finding very good agreement andthus providing a preliminary validation of the code.

Compared to our previous GPGPU application—GPUwalls, see Correia and Martins(2017)—the main bottlenecks in the present one are the evolution kernels, since one canforce the calculation of useful quantities to occur every few timesteps and not every singletimestep. This means that in contrast to the previous code we evade being compute-bound completely. The walls code also does not use software pre-fetching as was usedhere. Implementing these strategies is a task left for subsequent work.

The main challenges regarding the scalability of this code lie not only in beingmemory-bound but also in its memory requirements: given that two vector fields (float4’s)and two complex scalar fields (float2’s) are stored in 48 bytes per lattice site, the largestbox one could possibly simulate with one GPU is, at the time of writing, 5123 (thelargest GPU memory in a commercial GPU is around 16 GB). This brings us to our nextstep: to extend this simulation with multi-GPU support. In principle, given the largenecessary number of GPU’s required, the most natural way to implement multi-node,multi-GPU support would be through the Message-Passing-Interface. Note that in themulti-GPU case, the two main performance bottlenecks will be the presence of commu-nications at every timestep (which can limit weak and strong scaling) or a small boxsize per GPU (where not enough threads are spawned to hide latency successfully, theexpected main limiter of strong scaling). In the future we will report on the performanceof such a version and show that near perfect weak scaling to thousands of GPU’s canbe achieved as long as one hides the communication cost properly, ie. by overlappingcompute with communication at every timestep. For now there is no easy solution to thesecond bottleneck, and, as such strong scaling is less than ideal. In any case, we currently

14

0 20 40 60 80 100 1200.0

0.2

0.4

0.6

0.8

1.0

v2

v2 = 0.34±0.01, s=0, = 1/2v2 = 0.34±0.01, s=1, = 1/2v2 = 0.26±0.01, s=0, = 2/3

0 20 40 60 80 100 1200.0

0.2

0.4

0.6

0.8

1.0

v2

v2 = 0.31±0.01, s=0, = 1/2v2 = 0.31±0.01, s=1, = 1/2v2 = 0.27±0.01, s=0, = 2/3

0 20 40 60 80 100 1200.0

0.2

0.4

0.6

0.8

1.0

v2

v2 = 0.32±0.01, s=0, = 1/2v2 = 0.31±0.01, s=1, = 1/2v2 = 0.25±0.01, s=0, = 2/3

Figure 4: The evolution of the mean square velocity, estimated in three different ways: by using theestimator of Eq. 9 weighted by the potential 〈v2〉V (top left panel) or the Lagrangian 〈v2〉L (top rightpanel), or by using the equation of state parameter 〈v2〉ω (bottom panel). In all cases the results arefrom 5123 runs, in the radiation era (β = 1/2, blue lines without core growth and green lines with coregrowth) and in the matter era (β = 2/3 red lines without core growth). The asymptotic values of thevelocities, inferred after the networks have reached scaling, are also depicted.

15

do not foresee any impediment to doing 81923 simulations in existing high-performancecomputing facilities.

Overall we conclude that there is a tangible performance benefit to using GPUs infield theory defect simulations, enabling the possibility of running thousands or tensof thousands of high-resolution field theory simulations of Abelian-Higgs strings in ac-ceptable amounts of time. This opens several interesting possibilities for the furtherexploration of the cosmological consequences of these networks. For the future, the con-tinued increase in memory bandwidth of global memory of GPUs (NvidiaCorporation,a) is promising for our application, as it is memory bound. Even in the case of existingGPU’s, there are cards with higher memory bandwidth (such as the case of the TeslaP100, with 732GB/s).

In the short term, it will be possible to provide a more quantitative calibration of thevelocity-dependent one-scale model (Martins and Shellard, 1996, 2002; Martins, 2016), aswas recently done for domain walls (Martins et al., 2016a,b); early results can already befound in Correia and Martins (2019). The comparative analysis of the evolution of fieldtheory cosmic string and domain wall networks is itself interesting, since one expectsthat different energy loss mechanisms (specifically loop or blob production and scalarradiation) play different roles. A calibration with maximum box sizes of 5123 in therelativistic regime has been performed by the authors (Correia and Martins, 2019).

We also note that a long-term open issue in the cosmic strings literature is under-standing the different results obtained in Goto-Nambu simulations, for which there areseveral independent codes (Bennett and Bouchet, 1990; Allen and Shellard, 1990; Mar-tins and Shellard, 2006; Olum and Vanchurin, 2007; Blanco-Pillado et al., 2011), andin field theory simulations, for which all recent results ultimately stem from a singlecode (Bevis et al., 2007; Hindmarsh et al., 2017a). To the extent that comparisons canalready be made, our results are consistent with these. The availability of an improved(better calibrated) velocity-dependent one-scale model can also enable a more detailedcomparison between the results of the two types of codes.

In the longer term, an optimised multi-GPU can be used to yield thousands of ac-curate full-sky maps of cosmic microwave or gravitational wave backgrounds which canbe used in the data analysis of forthcoming experiments, such as CORE or LISA. Thiswill eliminate the current bottleneck in this analysis (so far one can only generate a fewfull-sky maps, or many maps of small sky patches) thus leading to more robust as well asmore stringent constraints. In conclusion, we expect that GPU-based defect codes willin the medium term become the gold standard in the field.

Acknowledgements

This work was financed by FEDER—Fundo Europeu de Desenvolvimento Regionalfunds through the COMPETE 2020—Operational Programme for Competitiveness andInternationalisation (POCI), and by Portuguese funds through FCT - Fundacao para aCiencia e a Tecnologia in the framework of the project POCI-01-0145-FEDER-028987.J.R.C. is supported by an FCT fellowship (grant reference SFRH/BD/130445/2017).We gratefully acknowledge the support of NVIDIA Corporation with the donation of theQuadro P5000 GPU used for this research.

16

References

Abbott, B.P., et al. (Virgo, LIGO Scientific), 2018. Constraints on cosmic strings using data from thefirst Advanced LIGO observing run. Phys. Rev. D97, 102002. doi:10.1103/PhysRevD.97.102002,arXiv:1712.01168.

Achucarro, A., Avgoustidis, A., Leite, A.M.M., Lopez-Eiguren, A., Martins, C.J.A.P., Nunes, A.S.,Urrestilla, J., 2014. Evolution of semilocal string networks: Large-scale properties. Phys. Rev. D89,063503. doi:10.1103/PhysRevD.89.063503, arXiv:1312.2123.

Ade, P.A.R., et al. (Planck), 2014. Planck 2013 results. XXV. Searches for cosmic strings and other topo-logical defects. Astron. Astrophys. 571, A25. doi:10.1051/0004-6361/201321621, arXiv:1303.5085.

Allen, B., Shellard, E.P.S., 1990. Cosmic string evolution: A numerical simulation. Phys. Rev. Lett. 64,119–122.

Bennett, D.P., Bouchet, F.R., 1990. High resolution simulations of cosmic string evolution. 1. networkevolution. Phys. Rev. D41, 2408.

Bevis, N., Hindmarsh, M., Kunz, M., Urrestilla, J., 2007. CMB power spectrum contribution fromcosmic strings using field-evolution simulations of the Abelian Higgs model. Phys. Rev. D75, 065015.doi:10.1103/PhysRevD.75.065015, arXiv:astro-ph/0605018.

Bevis, N., Hindmarsh, M., Kunz, M., Urrestilla, J., 2010. CMB power spectra from cosmic strings:predictions for the Planck satellite and beyond. Phys. Rev. D82, 065004. doi:10.1103/PhysRevD.82.065004, arXiv:1005.2663.

Bevis, N., Saffin, P.M., 2008. Cosmic string Y-junctions: A Comparison between field theo-retic and Nambu-Goto dynamics. Phys. Rev. D78, 023503. doi:10.1103/PhysRevD.78.023503,arXiv:0804.0200.

Binetruy, P., Bohe, A., Caprini, C., Dufaux, J.F., 2012. Cosmological Backgrounds of GravitationalWaves and eLISA/NGO: Phase Transitions, Cosmic Strings and Other Sources. JCAP 1206, 027.doi:10.1088/1475-7516/2012/06/027, arXiv:1201.0983.

Blanco-Pillado, J.J., Olum, K.D., Shlaer, B., 2011. Large parallel cosmic string simulations: New resultso n loop production. Phys. Rev. D83, 083514. doi:10.1103/PhysRevD.83.083514, arXiv:1101.5173.

Briggs, J., Pennycook, S.J., Shellard, E.P.S., Martins, C.J.A.P., Woodacre, M., Feind, K., 2014. Unveilingthe Early Universe: Optimizing Cosmology Workloads for Intel Xeon Phi Coprocessors in an SGIUV20 00 System. Technical Report. SGI/Intel White Paper.

Correia, J.R.C.C.C., Martins, C.J.A.P., 2017. General purpose graphics-processing-unit implementationof cosmological domain wall network evolution. Phys. Rev. E 96, 043310. URL: https://link.aps.org/doi/10.1103/PhysRevE.96.043310, doi:10.1103/PhysRevE.96.043310.

Correia, J.R.C.C.C., Martins, J.A.P., 2019. Extending and Calibrating the Velocity dependent One-Scalemodel for Cosmic Strings with One Thousand Field Theory Simulations. Phys. Rev. D100, 103517.doi:10.1103/PhysRevD.100.103517, arXiv:1911.03163.

Daverio, D., Hindmarsh, M., Kunz, M., Lizarraga, J., Urrestilla, J., 2016. Energy-momentum correlationsfor Abelian Higgs cosmic strings. Phys. Rev. D93, 085014. doi:10.1103/PhysRevD.95.049903,10.1103/PhysRevD.93.085014, arXiv:1510.05006. [Erratum: Phys. Rev.D95,no.4,049903(2017)].

Drew, A., Shellard, E.P.S., 2019. Radiation from Global Topological Strings using Adaptive MeshRefinement: Methodology and Massless Modes arXiv:1910.01718.

Finelli, F., et al. (CORE), 2018. Exploring cosmic origins with CORE: Inflation. JCAP 1804, 016.doi:10.1088/1475-7516/2018/04/016, arXiv:1612.08270.

Helfer, T., Aurrekoetxea, J.C., Lim, E.A., 2019. Cosmic String Loop Collapse in Full General Relativity.Phys. Rev. D99, 104028. doi:10.1103/PhysRevD.99.104028, arXiv:1808.06678.

Hindmarsh, M., Daverio, D., . Private communication, 20 December 2019.Hindmarsh, M., Kormu, A., Lopez-Eiguren, A., Weir, D.J., 2018. Scaling in necklaces of monopoles and

semipoles. Phys. Rev. D98, 103533. doi:10.1103/PhysRevD.98.103533, arXiv:1809.03384.Hindmarsh, M., Lizarraga, J., Lopez-Eiguren, A., Urrestilla, J., 2020. The scaling density of axion

strings. Phys. Rev. Lett. 124, 021301. doi:10.1103/PhysRevLett.124.021301, arXiv:1908.03522.Hindmarsh, M., Lizarraga, J., Urrestilla, J., Daverio, D., Kunz, M., 2017a. Scaling from gauge and

scalar radiation in Abelian Higgs string networks. Phys. Rev. D96, 023525. doi:10.1103/PhysRevD.96.023525, arXiv:1703.06696.

Hindmarsh, M., Lizarraga, J., Urrestilla, J., Daverio, D., Kunz, M., 2019. Type I Abelian Higgs strings:evolution and Cosmic Microwave Background constraints. Phys. Rev. D99, 083522. doi:10.1103/PhysRevD.99.083522, arXiv:1812.08649.

Hindmarsh, M., Rummukainen, K., Weir, D.J., 2017b. Numerical simulations of necklaces in

17

SU(2) gauge-Higgs field theory. Phys. Rev. D95, 063520. doi:10.1103/PhysRevD.95.063520,arXiv:1611.08456.

Kajantie, K., Karjalainen, M., Laine, M., Peisa, J., Rajantie, A., 1998. Thermodynamics of gaugeinvariant U(1) vortices from lattice Monte Carlo simulations. Phys. Lett. B428, 334–341. doi:10.1016/S0370-2693(98)00440-7, arXiv:hep-ph/9803367.

Kibble, T.W.B., 1976. Topology of Cosmic Domains and Strings. J. Phys. A9, 1387–1398. doi:10.1088/0305-4470/9/8/029.

Lizarraga, J., Urrestilla, J., 2016. Survival of pq-superstrings in field theory simulations. JCAP 1604,053. doi:10.1088/1475-7516/2016/04/053, arXiv:1602.08014.

Lopez-Eiguren, A., Urrestilla, J., Achucarro, A., 2017. Measuring Global Monopole Velocities, one byone. JCAP 1701, 020. doi:10.1088/1475-7516/2017/01/020, arXiv:1611.09628.

Martins, C.J.A.P., 2016. Defect Evolution in Cosmology and Condensed Matter: Quantitative Analysiswith the Velocity-Dependent One-Scale Model. Springer.

Martins, C.J.A.P., Rybak, I.Y., Avgoustidis, A., Shellard, E.P.S., 2016a. Extending the velocity-dependent one-scale model for domain walls. Phys. Rev. D93, 043534. doi:10.1103/PhysRevD.93.043534, arXiv:1602.01322.

Martins, C.J.A.P., Rybak, I.Yu., Avgoustidis, A., Shellard, E.P.S., 2016b. Stretching and Kib-ble scaling regimes for Hubble-damped defect networks. Phys. Rev. D94, 116017. doi:10.1103/PhysRevD.94.116017,10.1103/PhysRevD.95.039902, arXiv:1612.08863. [Erratum: Phys.Rev.D95,no.3,039902(2017)].

Martins, C.J.A.P., Shellard, E.P.S., 1996. Quantitative string evolution. Phys. Rev. D54, 2535–2556.arXiv:hep-ph/9602271.

Martins, C.J.A.P., Shellard, E.P.S., 2002. Extending the velocity-dependent one-scale string evolutionmodel. Phys. Rev. D65, 043514. arXiv:hep-ph/0003298.

Martins, C.J.A.P., Shellard, E.P.S., 2006. Fractal properties and small-scale structure of cosmic stringnetwork s. Phys. Rev. D73, 043515. arXiv:astro-ph/0511792.

McGraw, P., 1998. Evolution of a nonAbelian cosmic string network. Phys. Rev. D57, 3317–3339.doi:10.1103/PhysRevD.57.3317, arXiv:astro-ph/9706182.

Micikevicius, P., 2009. 3d finite difference computation on gpus using cuda, in: Proceedings of 2NdWorkshop on General Purpose Processing on Graphics Processing Units, ACM, New York, NY, USA.pp. 79–84. URL: http://doi.acm.org/10.1145/1513895.1513905, doi:10.1145/1513895.1513905.

Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P., 2010. 3.5-d blocking optimization for stencilcomputations on modern cpus and gpus, in: 2010 ACM/IEEE International Conference for HighPerformance Computing, Networking, Storage and Analysis, pp. 1–13. doi:10.1109/SC.2010.2.

NvidiaCorporation, a. Cuda programming guide. https://docs.nvidia.com/cuda/

cuda-c-programming-guide/index.html.NvidiaCorporation, b. curand. https://developer.nvidia.com/curand.NvidiaResearch-NVLabs, 2018. Cub - cuda unbound v1.8.0. https://nvlabs.github.io/cub/.Olum, K.D., Vanchurin, V., 2007. Cosmic string loops in the expanding universe. Phys. Rev. D75,

063521. arXiv:astro-ph/0610419.Phillips, E.H., Fatica, M., 2010. Implementing the himeno benchmark with cuda on gpu clusters, in:

2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–10. doi:10.1109/IPDPS.2010.5470394.

Press, W.H., Ryden, B.S., Spergel, D.N., 1989. Dynamical Evolution of Domain Walls in an ExpandingUniverse. Astrophys. J. 347, 590–604. doi:10.1086/168151.

Scherrer, R.J., Vilenkin, A., 1998. ’Lattice-free’ simulations of topological defect formation. Phys. Rev.D58, 103501. doi:10.1103/PhysRevD.58.103501, arXiv:hep-ph/9709498.

Wilson, K.G., 1974. Confinement of Quarks. Phys. Rev. D10, 2445–2459. doi:10.1103/PhysRevD.10.2445. [,319(1974)].

Xmartlabs, 2012. Cuda occupancy calculator. https://github.com/xmartlabs/cuda-calculator.Zhang, Y., Mueller, F., 2012. Auto-generation and auto-tuning of 3d stencil codes on gpu clusters, in:

Proceedings of the Tenth International Symposium on Code Generation and Optimization, ACM,New York, NY, USA. pp. 155–164. URL: http://doi.acm.org/10.1145/2259016.2259037, doi:10.1145/2259016.2259037.

18

Documents

Abelian-Higgs Cosmic String Evolution with CUDA