Monte Carlo Integration Two Major Problems in Monte Carlo€¦ · The algorithm is then de ned by the matrix ˘A, which you can readily see is a matrix with mostly zeroes, and ones

Monte Carlo Integration

Two Major Problems in Monte Carlo:

The Two Major Problems in Monte Carlo concern the generation of the”random numbers” that aren’t random:

1. Why do pseudorandom numbers appear to be random, and if weunderstood this, would we be able to know when a given generatorwould or would not have detectable defects?

2. Is it possible to generate other sequences (quasirandom) that arebetter than truly random sequences for a well-defined problem likemultidimensional integration?

F. James (CERN) Statistics for Physicists, chap. 9 March-June 2007 1 / 37

Monte Carlo Integration Pseudorandom Number Generation

Two gentlemen from ArmeniaI received one day a visit from two Armenian physicists who tried to showme that the right way to generate independent sequences was to make useof Kolmogorov mixing.

Representing the state of the generator at the i th step by the contents ofthe array Xi , their generator proceeds to the next step by performing thelinear transformation

Xi+1 =∼A Xi

and they said that to obtain Kolmogorov mixing, the matrix ∼A had tohave

I Determinant = 1

I All eigenvalues are complex and distinct

I No eigenvalue has modulus 1

I didn’t understand any of this, so I thought I would first try to find out ifthe generator seemed to work.



MIXMAX

So I asked some practical questions.

1. How big should the array be? They didn’t think it mattered verymuch. Maybe 10 or 20.

2. How long was the period? No idea.

3. How fast was it? That wasn’t important.

We did some tests and it was very slow (matrix multiplication isnotoriously slow, and this generator was hundreds of times slower than anyI had seen).

We gave their generator the name MIXMAX, and I told them they shouldfind a way to make it faster if they wanted anyone to use it.



The Crisis in Ising Model Simulations

Meanwhile a crisis was developing in Condensed Matter Physics where allthe ”best” generators were giving wrong results. The failures were beingrevealed by a new algorithm invented by Ulli Wolff to overcome the criticalslowing down that had made it nearly impossible to simulate across phasetransitions.

Ulli told me there was a theoretician at DESY, Martin Luscher, who hadfound a way to make a better generator, but it was coded only for thespecial-purpose computer APE designed for Lattice Gauge calculations.

So I coded Luscher’s algorithm in Fortran (that was 1994) and called itRANLUX.He has since coded it also in C, and in both single- and double-precision.

Here is how it works.



Luscher’s algorithm

To my great surprise, he had decided to base his new generator on theRCARRY algorithm. This was surprising because RCARRY was alreadyknown to be defective, but Martin knew exactly what he was doing.

If you consider the RCARRY algorithm as a sequence of points in24-dimensional space, then it is of the form

Xi+1 =∼A Xi

where Xi is the array of 24 24-bit numbers that determines (along withthe carry bit) the state of the generator.

The algorithm is then defined by the matrix ∼A, which you can readily seeis a matrix with mostly zeroes, and ones along the diagonal, with −1 and+1 at the positions corresponding to the values of r and s.The determinant is 1 and none of its eigenvalues are of modulus 1.



Luscher’s algorithmHe had seen that the RCARRY algorithm was in fact the discrete analogof a continuous classical dynamical system with Kolmogorov mixing.And he knew what that meant:

1. Zero-mixing is the same as ergodic motion. It means that the systemwill asymptotically sweep out all the available states.

2. One-mixing means that the coverage will be uniform, in the sensethat asymptotically, the probability of finding the system in any givenvolume of state space is proportional to the volume.

3. Two-mixing means that the probability of the system being in onevolume of state space at one time, and another volume at anothertime, is proportional to the product of the two volumes.

4. N-mixing means that the probability of finding the system in Ndifferent volumes at N different times is proportional to the productof the N volumes.

5. k-mixing is N-mixing for arbitrarily large N.



Luscher’s algorithm: Mixing

Systems with the highest level of mixing are called k-systems. The lowestlevel contains ergodic systems, which sweep out all possible states, but notnecessarily in a totally chaotic manner.

Kolmogorov mixing is exactly what we mean by independence of states.

Furthermore, the hierarchy of mixing is real: It is possible to find systemswhich exhibit N-mixing but not N+1 -mixing.Such systems must have mixing for all n < N.

Luscher had seen that the RCARRY algorithm had the required propertiesto be a perfect random number generator. But unlike Akopov and Saviddiwho started from theory looking for a good practical generator, Luscherstarted from a generator with good practical properties (speed and longperiod) and found in it the theoretical properties he was looking for.



Luscher’s algorithm: The Lyapunov ExponentNow the big question: If RCARRY has the right mixing properties, whydoes it fail the tests of randomness?

Answer: Because Kolmogorov mixing is only an asymptotic property ofclassical dynamical systems. Such systems forget their past history, butonly after a certain time.

K-systems diverge exponentially, in the following sense:If a k-system is initiated from two different but arbitrarily close initialstates, the two trajectories will diverge exponentially. The exponentdescribing the speed of divergence is called the Lyapunov exponent.

For the matrix ∼A, there are 4 eigenvalues with maximum absolute value|λ|max = 1.04299.., and the Lyapunov exponent is ln |λ|max = 1.01027..

For this system, that means that full mixing (where the deviation betweenthe trajectories exceeds 1/2) is attained only after 16 applications of ∼A.



Luscher’s algorithm: Decimation

This figure shows the ”experimental” divergence of nearby trajectoriesafter the number of time steps shown on the x-axis.

Averagespacingbetweentrajectories

2018161412108642

10^-1

10^-2

10^-3

10^-4

10^-5

10^-6

time steps

After generating 24 numbers, you must reject 15× 24 numbers beforeusing the next 24, in order to attain complete mixing, or independence.



RANLUX: the Luxury LevelWhen I coded Luscher’s algorithm in Fortran, I called it RANLUX becauseit offered several different luxury levels:

1. Level Zero with no decimation,corresponds to RCARRY: fast, andgood enough for many applications, but known to have easilydetectable defects.

2. Level One with half the numbers rejected, is almost as fast asRCARRY, but with considerably better randomness.

3. Level Two with acceptance 1/4 is about half the speed of RCARRY,and much more random, but still has detectable defects.

4. Level Three with acceptance 1/8 has theoretically detectable defects,but none has yet been detected.

5. The Highest Level, corresponding to rejecting 15 steps, should haveno defects detectable by any possible tests. Unfortunately it is about8 times slower than RCARRY.

The idea was that if you can afford the luxury of a provably good PRNG,you can use RANLUX at the highest luxury level. Otherwise you shoulduse the highest luxury level you can afford.



Continuous and discrete systems

The major theoretical problem with Luscher’s algorithm is that theKolmogorov k-systems are continuous dynamical systems, but we areapplying the theory to discrete systems.

An important difference between discrete and continuous systems is that acontinuous system can follow an infinitely long trajectory in state spacewithout ever encountering a previously occupied state.

A discrete system, on the other hand, has only a finite number of possiblestates, so it eats up its state space as it moves through it, in the sensethat all those states it has already visited are removed from the space ofstates it has available for future visits. Otherwise it gets into a loop.

However it is easily seen that, since the period of RANLUX is ≈ 10171,even if it was used to generate all the random numbers that have everbeen generated plus all those that will ever be generated on the earth, itwould not eat up more than 10−130 of its total state space.



RANLUX: The spectral test

There are many advantages to working with a well-known PRNG likeRCARRY. For example, it was already known that RCARRY was closelyrelated to a linear congruential generator with an immense but knownmultiplier. Thus it was possible to apply the spectral test, a full-periodtest that essentially measures the distance between Marsaglia hyperplanesdiscovered by George Marsaglia many years earlier.

It turns out, as we should expect, that RCARRY fails the spectral testrather badly, but RANLUX at higher luxury levels passes.



RANLUX in double precision

Since we often need more than 24 bits of precision in PRNG’s, Luscher ofcourse set about making a double-precision version of RANLUX.Unfortunately it was not possible to find an algorithm using mantissas inthe range 48-53 bits for which a single trajectory would sweep out a largeportion of the total state space.

Therefore he had to settle for a program based on the original algorithm,joining two 24-bit mantissas to form a nearly-double-precision version. Itseems that Marsaglia, when he proposed RCARRY, had indeed found avery rare combination of ”magic numbers”.

Both the 24-bit and 48-bit version of RANLUX now exist in the Clanguage, distributed in a convenient zipped file that also contains all therelevant documentation. These new versions as distributed by MartinLuscher re-define the luxury levels so that the lowest two levels of theoriginal Fortran version are no longer proposed.



The speed of RANLUXMany people consider RANLUX to be too slow. These people don’t seemto be worried about getting the right answer to their calculation.Marsaglia himself once stated (in a published paper):

Random numbers are like sex: Even if they are not very good,they are still good.

I don’t understand this attitude, since I can think of nothing worse thanspending weeks or months on a calculation and getting the wrong answerbecause of a defective PRNG. And not even discovering it is wrong untilmuch later, if at all. So I published my viewpoint on the matter:

Random numbers are more like wine: It is better to serve onethat is too good for a given occasion, than one that is not goodenough.

Even if some consider RANLUX too slow, most physicists are happy thatthey finally have access to a provably good PRNG.Especially now that the Mersenne Twister is known to fail some tests.



Beyond RANLUX

Several questions remain:

1. Is it possible to map some other known generator onto a dynamicalsystem in such a way as to show it to be a k-system?So far, no such system has been identified. In particular, the famousMersenne Twister does not obviously map onto a k-system, and evenif it did, it would not not be a good candidate for chaos because itfills only a tiny part of its state space.

2. Is it possible that a quite different theory could explain why someother generator is random?Yes, there is nothing in the theory of dynamical systems that makes itthe only way to understand deterministic randomness.

3. Should it not be possible to find another k-system with a largerLyapunov exponent, requiring less decimation?That is certainly a possible line for future research, but I guess it willnot be easy.


Monte Carlo Integration Quasi-Monte Carlo

Quasi-Monte CarloIf we perform numerical multidimensional integration as in the method ofMonte Carlo, but:

I instead of choosing the points randomly or (pseudorandomly),I we choose the points according to some algorithm designed to give

the smallest possible uncertainty in the integral,

then we are doing quasi-Monte Carlo.

The integration points are then called quasirandom points.

Note that quasirandom points are not intended to be random or even toappear random.They should be better than random points.In particular, they should be more uniformly distributed than truly randompoints.

Remember that uniformly distributed random points are uniform only inprobability. The actual number of points fluctuates according to thePoisson distribution in any given subvolume.



Looking at Random Point Sets

1000RandompointsgeneratedwithRANLUX

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1



Looking at Quasirandom Point Sets

1000Quasi-RandompointsgeneratedwithCorput(1,2)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1



The Koksma-Hlawka InequalityQuasi-Monte Carlo (QMC) has been studied for a long time, and there aremany elegant theorems and other theoretical results which are believed toindicate how QMC integrals converge toward the true value.Probably the most important of these is the Koksma-Hlawka Inequalitywhich gives an upper bound on the error in QMC integration over thehypercube, in terms of the properties of the function f and the point set X.

QMC :

∣∣∣∣∣ 1

N

N∑i=1

f (Xi )−∫∫ 1

0f (u) du

∣∣∣∣∣ < Vn(f )× D∗∞(X)

where Vn(f ) is the variation of the function f [to be defined below],and D∗∞(X) is the discrepancy of the point set X [to be defined below].

Notice the difference compared with ordinary Monte Carlo:

MC : P

(∣∣∣∣∣ 1

N

N∑i=1

f (ri )−∫∫ 1

0f (u) du

∣∣∣∣∣ <√

V (f )× 1

N

)= 0.68



The Variation of the Function fIn one dimension,we imagine the domain of integration partitioned into mintervals. Then let Dm be the sum of the absolute differences of thefunction values at neighbouring points

Dm =m∑1

|f (xi )− f (xi−1)|

Then the variation of f is the supremum of Dm over all possible ways ofpartitioning into m regions, and over all possible values of m.

In one dimension, the variation can be calculated rather easily (but not byusing the above definition). However, in the multidimensional extension(which is the only case of interest!) the simple difference |f (xi )− f (xi−1)|becomes |f (xi )− f (xi−1) + f (xj−1)− f (xj)|and the different possible ways of partitioning the space become verynon-trivial, so the calculation of Vn(f ) becomes far more difficult than thecalculation of its integral.



The Variation of f in many dimensions

The mathematicians continue to produce theorems about QMCintegration using the concept of variation even though it is essentiallyincalculable in many dimensions, the only case of interest.

In fact, the mathematicians are really interested only in whether thevariation is finite, since this allows them to produce theorems about theasymptotic convergence of QMC.

Unfortunately, for the functions of interest to physicists for simulation, thevariation is not even finite, since any function with a discontinuity in morethan one dimension will typically have an infinite variation, even thoughwe know we can integrate it with either MC or QMC.

It is not surprising that physicists interested in QMC stopped worryingabout the variation, and concentrated instead on the discrepancy.



The Local Discrepancy of a Point Set

Consider a set of N points X in a d-dimensional hypercube. At anyposition y in the hypercube, we define the counting function K (X ; y) asthe number of points in X that lie in the hyperrectangle (0, y), that is thenumber of points in X all of whose coordinates are less than thecorresponding coordinate of the position y .

Then the local discrepancy of X at the position y is

g(X ; y) =1

N|K (X ; y)− Vol(0, y)| =

1

N

∣∣∣∣∣K (X ; y)−∏µ

yµ

∣∣∣∣∣Vol(0, y) is the volume of the hyperrectangle with corners at 0 and y .The local discrepancy of the point set is thus seen to be the differencebetween the number of points in a subvolume and the number expectedfor a uniform density of points.



Global Measures of Discrepancy.

Having defined the local discrepancy of a point set at any position in thehypercube, there are several ways to arrive at a global measure ofdiscrepancy.

The most commonly used are

I The quadratic discrepancy D2 =∫

Ω g2(X ; y)dy

I The extreme discrepancy D∞ = supΩ g(X ; y)dy

where Ω denotes the full hypercube.D∞ has been most widely studied, but D2 is both easier to evaluate andapparently more relevant for our purposes.

Now we will look at the discrepancies of some quasirandom pointgenerators.



Richtmyer sequences

Probably the first quasi-random point generator was that attributed toRichtmyer and defined by the formula for the µth coordinate of the kthpoint:

xµk = [kSµ] mod 1 ,

where the Sµ are constants which should be irrational numbers in order forthe period to be infinite. Since neither irrational numbers nor infiniteperiods exist on real computers, the original suggestion was to take Sµ

equal to the square root of the µth prime number, and this is the usualchoice for these constants.

The Richtmyer generator i the simplest and least-studied QMC generator,but according to our studies, it may be the best.



Van der Corput (or Halton) sequencesAn interesting low-discrepancy sequence can be found as followsLet us start, in one dimension, by choosing a base, an integer b. Anyinteger n can then be written in base b:

n = n0 + n1b + n2b2 + n3b

3 + · · · .

The radical-inverse transform (to base b) is defined by

φb(n) = n0b−1 + n1b

−2 + n2b−3 + n3b

−4 + · · · .

The van der Corput sequence to base b is then simply the following:

xk = φb(k) , k = 1, 2, . . . .

Note that, as k increases, the digit n0 changes most rapidly, the digit n1

the next rapidly, and so on: so that the leading digit in xk changes themost rapidly, the next-to-leading one less rapidly, and so on. It is thereforeclear that this sequence manages to fill the unit interval (0, 1) quiteefficiently.



Van der Corput Point Sets


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1



Further Van der Corput Point Sets


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1



The Discrepancy of Van der Corput Point Sets

In one dimension, it can be shown that the N-dependence of the extremediscrepancy of the Corput sequences is

D∞ ≤ Cblog N

N

For the d-dimensional case, this generalizes to

D∞ ≤d

N+

(log N)d

N

∏µ

(bµ − 1

2 log bµ+

bµ + 1

2 log N

).

As a function of d , therefore, the constant factor grows faster than anexponential of d , and in high dimensions the superiority of thevan der Corput sequence over a purely random one should appear only forimpractically large N.



Sobol and Niederreiter Point Sets

More sophisticated and complicated quasirandom sequences have beenproposed by several other researchers. Two of these have been studied indetail and promise discrepancies whose N-dependence is similar to that ofvan der Corput but as a function of dimension d , they should improveexponentially instead of getting worse.

Since the only theoretical results are asymptotic in N, and appear to holdonly for astronomically large values of N, we decided to calculate as manydiscrepancies as we could.

In all the plots I show here, the discrepancy is D2/N, so that it can easilybe compared with the discrepancy of a truly random sequence (for whichD2/N would be constant with both N and d .)



D2/N for Pseudorandom and Quasirandom Point Sets

D2/Nin 8dimensions

0.1

1

10

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Random, s=8Random, s=8Random, s=8

Richtmyer, s=8van der Corput, s=8

Niederreiter, s=8Sobol, s=8

1



D2/N for Sobol Quasirandom Point Sets

D2/Nin 9-17dimensions

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 20000 40000 60000 80000 100000

s = 9s = 10s = 11s = 12s = 13s = 15s = 17



D2/N for Richtmyer Quasirandom Point Sets

D2/Nup to 20dimensions

1e-05

0.0001

0.001

0.01

0.1

1

10

5 10 15 20

N = 50000 N = 100000N = 150000



D2/N for van der Corput Quasirandom Point Sets


1e-05

0.0001

0.001

0.01

0.1

1

10

5 10 15 20

N = 50000 N = 100000N = 150000



D2/N for Sobol Quasirandom Point Sets


1e-05

0.0001

0.001

0.01

0.1

1

10

5 10 15 20

N = 50000 N = 100000N = 150000



D2/N for Niederreiter Quasirandom Point Sets


1e-05

0.0001

0.001

0.01

0.1

1

10

5 10 15 20

N = 50000 N = 100000N = 150000



Discrepancy and Goodness-of-fit TestingIf discrepancy is not the right measure of non-uniformity of a point set,what is the right measure?Clearly, what we want is exactly the P-value of the goodness-of-fit of thepoint set to a uniform density.In one dimension, the discrepancy is simply related to the standard Goftest statistics based on the empirical distribution function. If f (x) is theuniform density, and F (x) its integral, then

I The Smirnov-Cramer-von Mises test statistic W 2 is just the quadraticdiscrepancy

W 2 =

∫ ∞−∞

[SN(X )− F (X )]2f (X )dX ,

I The Kolmogorov test statistic DN is just the extreme discrepancy

DN = max |SN(X )− F (X )| for all X

but we know these tests cannot be extended to many dimensions becausepoints can be ordered only in one dimension.



Suspicion about the whole Quasirandom TheoryAbout this time, physicists like Christoph Schlier (Freiburg) began tosuspect the whole theory was irrelevant.The observed error in QMC integration (when it could be measured)seemed to have nothing to do with either the variation of the function,or the discrepancy of point sets.

In dimensions 15 to 30, the empirically observed error was much smallerthan that predicted by the theory.It seemed to depend on the variance of the function rather than thevariation.

The situation now is

I There is new hope, since the pessimistic theoretical results seem to beirrelevant.

I There is no more theory, so people like Ronald Kleiss (Nijmegen) arelooking for a new one.


Documents

Monte Carlo Integration Two Major Problems in Monte Carlo€¦ · The algorithm is then de ned by the matrix ˘A, which you can readily see is a matrix with mostly zeroes, and ones