53
1 Random Number Generation and Testing 随随随随随随随随随 W. W. Tsang 随随随 Department of Computer Science The University of Hong Kong 随随随随随随 , 随随随随 http://www.cs.hku.hk/~tsang/RNGT.ppt

1 Random Number Generation and Testing 随机数的生产及检验 W. W. Tsang 曾衛寰 Department of Computer Science The University of Hong Kong 计算机科学系, 香港大学

Embed Size (px)

Citation preview

1

Random Number Generation and Testing随机数的生产及检验

W. W. Tsang 曾衛寰Department of Computer Science

The University of Hong Kong

计算机科学系 , 香港大学

http://www.cs.hku.hk/~tsang/RNGT.ppt

2

Henan and Hong Kong

3

Hong Kong

4

Hong Kong

5

The University of Hong Kong 香港大学

6

Sun Yat Sen 孫中山

The University of Hong Kong 香港大学

7

Mathematics数学

Computer Science

计算机科学

Statistical Computing, Computational Statistic

Random Number Generation and Testing is an interdisciplinary ( 跨学科的 ) area

RNGT

8

Overview

1. Random numbers and their applications 应用 2. Early random number generators (RNGs) in

computers 早期的随机数生产器3. Criteria of good RNGs 标准4. Good RNGs 优秀的生产器5. Goodness-of-fit tests 拟合优度检验6. Statistical tests for RNGs 统计检验7. Conversions of uniform random integers to variates

of other distributions 随机数的变换

9

Entertainment 娱乐Gambling 赌博 Lottery, lucky draw 抽奖Games 遊戏

Cryptography 密码学Key generation

Computer simulation 模拟Software testing 软体测试

Generating testing data

Randomized algorithms 随机化算法 Avoiding worst cases

1. Random numbers and their applications

10

2. Early RNGs in computers

Reading a large file of random numbers 阅读随机数档案 Deterministic 预决的

A 10 billion bit file is available atDiehard Battery of Tests of Randomness v0.2 beta

http://www.csis.hku.hk/~diehard/

Reading of the last few bits of a fast ticking clock 阅读时钟 Unpredictable 不可预测的

11

2. Early RNGs in computers

Mid-square method, 1940s Suggested by John von Neumann in the

development of the first atomic bomb 应用在原子弹的开发

Xn+1 = middle_digits(Xn × Xn )

X = 45086273

X × X = 2032772013030529

new X = 77201303Deterministic Period ( 周期 ) depends on the seed and

is hard to determine

1903-1957 obsolete!

12

2. Early RNGs in computers

Congruential generator, 1951, most commonly usedSuggested by Lehmer Xn+1 = (a X n + c) mod m .

X = 45086273

(X×7654321 + 1) mod 108 = 345104806235634 mod 108

new X = 06235634

Simple, fastestFor 32-bit words, the period can reach 232 Insecure, the formula can be worked out from outputFails in many tests

Sufficiently random for many applications

最简单 , 最常用

13

2. Early RNGs in computers

3D points generated using a congruential RNG

Points fall on planes Ideal random points有模式 , 不够乱

14

2. Early RNGs in computers

Lagged Fibonacci generator, 1958suggested by Mitchell and Moore

Xn = (Xn-24 + Xn-55) mod 232,

n ≥ 55

The period is 231(2551)   长周期 !

Fails in the birthday spacing test

Knuth, The Art of Computer Programming, vol 2, 1998.

mXXX pnqpnn mod

01

31

55

. . .. . .

+

: exclusive-or 异

15

3. Criteria of good RNGs 标准Fast, especially in simulation 快Well distributed 分布正确

pass all statistical tests knownIndependent 独立的Portable and reproducible 在不同的电腦能重複生产的

(for verifying simulation results)Long periods (for deterministic RNGs) 長周期Unpredictable and irreproducible (for cryptography)Security 保密 (for cryptography)Large seed spaces (for deterministic RNGs) 种子的选择要够多

16

4. Good RNGsMersenne Twister, 1988

Makoto Matsumoto &

Takuji Nishimura

Output xk+624T

Period: 2199371Evenly distributed in high dimension Fast, pass all tests, insecure

Matsumoto, M., and Nishimura, T., 1998, Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Trans. Model. Comput. Simul. 8, No. 1, 3-30.

http://www.math.sci.hiroshima-u.ac.jp/%7Em-mat/MT/emt.html

))|(( 1397624 Alk

ukkk xxxx

01

397

624

T

A

. . .

. . .

17

4. Good RNGs

Combined Generators 组合的生产器Combine the outputs of two or more RNGs, eg, using . More evenly distributed, more independent, longer period,

more secureThe universal generators 通用生产器

Combine 2 generators: a Lagged Fibonacci, and Xn+1 = (X n k) mod 1677213

Portable, pass all tests

Marsaglia, G., 1984, A current View of Random Number Generators, Keynote Address, Computer Science and Statistics: 16th Symposium on the Interface, Atlanta.

G. Marsaglia, A. Zaman and W.W. Tsang, Toward a universal random number generator, Letters in Statistics and Probability, 9 (1), 35-39, January 1990.

G. Marsaglia and W.W. Tsang, The 64-bit universal RNG, Letters in Statistics and Probability, Vol. 6, Issue 2, pp. 183-187, January, 2004.

1924 -

18

4. Good RNGs

Combined GeneratorsThe KISS generator (Keep It Simple, Stupid) 简单生产器

Suggested by George MarsagliaCombine three simple generators

A congruential generatorA 3-shift generatorA Multiply-with-carry generator

Pass all tests, popularPeriod: ~2124

http://oldmill.uchicago.edu/~wilder/Code/random/Papers/Marsaglia_2003.html

unsigned long KISS() { static unsigned long x=123456789, y=362436, z=521288629, c=7654321; unsigned long long t, a=698769069LL; x=69069*x+12345; y^=(y<<13); y^=(y>>17); y^=(y<<5); t=a*z+c; c=(t>>32); return x+y+(z=t); }

19

4. Good RNGsTuring Award winner in 2000

Andrew Chi-Chih Yao 姚期智

Contributions 貢献Theory of computation

Complexity

Theory of RNGs 随机数理论

1946 -

Alan Turing, 1912 -1954

If there is no practical way to predict the next bit of an RNG with more than 50% chance, the RNG will pass all statistical tests.

20

4. Good RNGs

Blum-Blum-Shub (BSS) generators, 1986 First generator that fulfills Andrew Yao’s RNG theory

m=pq, | p | = | q |

p and q are distinct primes ( 质数 ) of the form 4z+3

m has 1024 to 4096 bitsOutput the last bit of Xn+1

Well distributed: pass all tests in theory 可以通过所有检验Secure but very slow

The period depends on the seed and can only be worked out using an algorithm.

mXX nn mod21

21

4. Good RNGs

The HAVEGE generatorHArdware Volatile Entropy Gathering and ExpansionAndré Seznec Read the fast changing states in the computer in real time,

eg, cache, Pipeline states, TLB, etc.

阅读电腦內迅速轉变的数据Hardware dependentUnpredictable Irreproducible

http://www.irisa.fr/caps/projects/hipsor/HAVEGE1.0.html

22

5. Goodness-of-fit tests 拟合优度检验The following shows 178 outcomes of a dice ( 骰子 ). Is

the dice honest?

Face values 1 2 3 4 5 6

Observed 14 35 28 25 39 35

1 2 3 4 5 6

A goodness-of-fit test measures the discrepancy between the sample distribution and the purported distribution.

23

5. Goodness-of-fit tests

Pearson’s chi-square test

Compute a statistic, X2, that summarizes the difference

Face values 1 2 3 4 5 6

Expected 29.3 29.3 29.3 29.3 29.3 29.3

Observed 14 35 28 25 3935

985.01.14)(

1

22

pe

eok

i i

ii

24

5. Goodness-of-fit testsThe chi-square test

If the samples are distributed as expected, X2 follows the Chi-square distribution of 5 degrees of freedom.

The p-value is the chance that X2 is smaller then 14.1.

p = Pr[ X2 ≤ 14.1] = 0.985

If the p-value is greater than a pre-determined threshold (eg, 0.95), rejected.

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.514.1

25

5. Goodness-of-fit tests

Let X be a random variable that is uniformly distributed in [0,1). The cumulative distribution function (CDF) 累积分布函数 of X is the diagonal from (0,0) to (1,1). 对角线

Suppose x(1), x(2),…, x(n) are samples of X.

x1, x2,…, xn be the ordered x(i)’s.

The Empirical distribution ( 经验分布 )

nxxxx ....321

n

x xxxF ii

n

such that s' ofnumber )(

is the staircase 楼梯

1x 2x 3x 4x 5x

5

11

n

26

5. Goodness-of-fit tests

If the samples truly follows the uniform distribution, the staircase will be close to the diagonal most of the time.

1x 2x 3x 4x 5x

5

11

n

Good fit符合

Bad fit

27

5. Goodness-of-fit tests

The Kolmogorov-Smirov (KS) testMost commonly used 最常用 Measure the maximum absolute distanceDn = max | F(x) –Fn(x) | 1903 - 1987

28

5. Goodness-of-fit tests

The KS testThe p-value is the CDF of Dn. The

CDF of Dn is difficult to evaluate. 很难算

In 2003, we found the long forgotten matrix formula derived by Durbin. It is computationally stable and efficient except in the extreme right tail. We fixed the problem using an approximation. The resulting program evaluates the CDF with 13-digit accuracy for 2 ≤ n ≤ 16000.

G. Marsaglia, W.W. Tsang and J. Wang, Evaluating Kolmogorov's distribution, Journal of Statistical Software, Vol. 8, Issue 18, Pages 1-4, November, 2003. (available at http://www.jstatsoft.org/ )

W. W. Tsang and J. Wang, Evaluating the CDF of the Kolmogorov statistic for normality testing, Proceedings of the COMPSTAT 2004, 16th Symposium of IASC, Prague, August 23-27, 2004, 1893-1900.

29

5. Goodness-of-fit tests

The Anderson and Darling (AD) testSummation of the weighted squares of the vertical

differences 加权面积的平方

More powerful than the KS test

( 比 KS 检验强 )

The CDF of An is harder to evaluate than the CDF of Dn 更难算

dx

xxxFxnA nn )1(

1)}({ 2

安徒生 , 心愛的人

30

The AD test In 2004, Marsaglia published a recursive procedure for

computing the CDF of A∞ with 13-digit accuracy. He also gave an approximation formula for evaluating the An with 3-digit accuracy for n > 35.

G. Marsaglia and J. Marsaglia, Evaluating the Anderson-Darling distribution, Journal of Statistical Software, 9(2): 1-5, 2004.

5. Goodness-of-fit tests

31

6. Statistical tests for RNGs

Statistical tests are used to reject poor RNGsThe collision test 碰撞检验 (An example)

Suppose we throw n balls at random into m cells. A collision occurs when a ball falls into a cell that is occupied. The test counts the no. of collisions (c). A generator passes this test if it doesn’t induce too many or too few collisions.

32

6. Statistical tests for RNGs

The collision test The prob. that c collisions occur is

where is a Sterling no. of 2nd kind.

Knuth, The Art of Computer Programming, vol 2, 1998.

W.W. Tsang, L.C.K. Hui, K.P. Chow, C.F. Chong, and C.W. Tso, Tuning the collision test for power, Conferences in Research and Practice in Information Series, Vol. 26. No. 1, pp. 23-30, 2004. (Proceedings of the 27th Australasian Computer Science Conference, Dunedin, New Zealand, 2004.)

m m m n c

m

n

n cn

( ) ( )

1 1

n

k

n n

n

n

kk

n

k

n

k

10 1

1 1

1

; ;

, otherwise

A p-value is computed from c. If it is greater than a threshold (eg, 0.99), rejected.

33

6. Statistical tests for RNGs

Criteria of good testsPowerful 能力Efficient 效率The experiment is similar to certain important

applications. Eg, the collision test is similar to the insertion of a hash table.

34

6. Statistical tests for RNGs

Knuth’s collection

The most well-known collection of tests for RNGs is the one compiled by Knuth. It comprises 11 tests.

Knuth, The Art of Computer Programming, vol 2,

1938 -

1998.

35

6. Statistical tests for RNGs

The National Institute of Standards and Technology (NIST) of USA has suggested 16 statistical tests for checking cryptographic RNGs ( 美国国家科技標準局 )Frequency (Monobit) TestFrequency Test within a BlockRuns TestTests for the longest Run of Ones in a BlockBinary Matrix Rank TestDiscrete Fourier Transform (Spectral) TestNon-overlapping Template Matching TestOverlapping Template Matching Test

36

6. Statistical tests for RNGsThe NIST collection

Maurer’s “Universal Statistical” TestLampel-Ziv Compression TestLinear Complexity TestSerial TestApproximate Entropy TestCumulative Sums (Cusum) TestRandom Excursions TestRandom Excursions Variant Test

Official Website: Random number generation and testing

<http://csrc.nist.gov/rng/>.

37

6. Statistical tests for RNGs

Diehard is the most widely used testing package for examining RNGs. 最常用的

Developed by George Marsaglia Birthday SpacingsGCDGorilla 大猩猩Overlapping PermutationsBinary Rank nn Binary Rank 68Monkey Tests OPSO, OQSO, DNACount the 1’sCount the 1’s specific

Most powerfulAn RNG passes these tests passes all other tests

38

6. Statistical tests for RNGs

DiehardParking LotMinimum DistanceRandom SpheresThe SqueezeOverlapping SumsRuns Up and DownThe Craps

Diehard Battery of Tests of Randomness v0.2 beta http://www.csis.hku.hk/~diehard/

G. Marsaglia and W.W. Tsang, Some difficult-to-pass tests of randomness, Journal of Statistical Software, Vol. 7, Issue 3, Pages 1-8, January, 2002. (available at http://www.jstatsoft.org/ ).

39

7. Conversions of uniform random integers to variates of other distributionsAn RNG outputs random integers that are uniformly

distributed ( 均匀分布 ), eg, in [0, 232-1]In applications, we often needs random numbers of

other distributions, eg, Uniform in [0, 1)Normal ( 正态分布 ) Exponential ( 指数分布 )Gamma ( 伽玛分布 )PoissonBinomial ( 二项式分布 )

Fast methods are needed for the conversions

40

7. Conversions

Given I, a random integer uniformly distributed in [0, 232-1], generate U that is a uniform random number in [0,1).

U = I / 232

Generate points that are uniformly distributed in a rectangleX = a UY = c U

0 a

c

0

X

Y

41

7. Conversions

If we generate points that are uniformly distributed under a density function ( 密度函数 ), the x-coordinates of the points follow the density distribution

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

42

7. Conversions

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

The acceptance-rejection ( 接受 - 拒收 ) methodTo generate X with the density f(x), 0 x a1. X = a * U2. Y = b * U3. if (Y < f(X)) return X4. Go to Step 1.

c

f(x)

00

a

AccRej

43

o a b

c

7. Conversions

The Monty Python Method 拼凑法Put a unit rectangle on top of a density (blue area). Flip the cap onto the empty area in top-right. To generate X with the density

1. X = b U2. If (X < a) return X3. Y = c U4. If (Y < f(X) return X5. If (Y > f’(X) return b-X6. Sample from the tail

f’ (x) is f(x) after flipping over.

f’(x)=c-[f(b-x)-c]

f(x)f’(x)

44

7. Conversions

g(x)

f(x)

The Monty Python MethodA tail can be sampled using the acceptance-rejection method. Instead of using a rectangle, use an easy-to-sample density function g(x) that dominates and close to the tail, f(x)

45

7. Conversions

The Monty Python method can be used to generate variates of various distribution, including normal, exponential, gamma, student-t, etc.

G. Marsaglia and W.W. Tsang, The Monty Python method for generating random variables, ACM Transactions on Mathematical Software, Vol. 24, No. 3, Pages 341-350, September, 1998.

G. Marsaglia and W.W. Tsang, The Monty Python method for generating gamma variables, Journal of Statistical Software, Vol. 3, Issue 3, Pages 1-8, January 1999. (available at http://www.jstatsoft.org/ )

G. Marsaglia and W.W. Tsang, A simple method for generating gamma variables, ACM Transactions on Mathematical Software, Vol. 26, No. 3, Pages 363-372, September, 2000.

46

7. Conversions

The Ziggurat method is a sophisticated version of the Monty Python method. Instead of using a unit rectangle, it uses a staircase curve that is close to the density being sampled.

The method leads to the fastest way to sample from normal and exponential. It is used in Matlab and other softwarewww.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html

47

7. Conversions The Ziggurat method

References

G. Marsaglia and W.W. Tsang, The ziggurat method for generating random variables, Journal of Statistical Software, Vol. 5, Issue 8, Pages 1-7, October, 2000. (available at http://www.jstatsoft.org/ )

G. Marsaglia and W.W. Tsang, A fast, easily implemented method for sampling from decreasing or symmetric unimodal density functions, SIAM J. Sci. Stat. Comput., Vol. 5, No. 2, June 1984.

48

7. Conversions

The alias method for generating discrete variates 别名法

Suggested by A. J. Walker in 1970sFirst convert a histogram into a rectangle.

Then Stack up the bars into a unit rod

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4

不连续的

0.2

49

7. Conversions

The alias method Sample from the rod using a single U.

First find out which bar the U lands.Then determine whether it lands on the upper or the lower segment

0

0.2

0.4

0.6

0.8

1

Rtn 0

Rtn 4

//First set up V[ ] and K[ ]u = U;L = 1 + 5*u ;If (u > V[L]) then return(K[L]); else return( L );

0 1 2 3 4

50

7. Conversions

The straightforward table look up method ( 查表法 ) for generating discrete variatesLet the distribution of Y be

Pr[Y=1] = 0.345

Pr[Y=2] = 0.103

Pr[Y=3] = 0.276

Pr[Y=4] = 0.050

Pr[Y=5] = 0.226

Generation

Y = V[1000*U]

11.....122...233........344..455........5

0 999

345 103 276 50 226

V

51

7. Conversions

The Table Look-Up methodPr[Y=1] = 0.3 + 0.04 + 0.005

Pr[Y=2] = 0.1 + 0.00 + 0.003

Pr[Y=3] = 0.2 + 0.07 + 0.006

Pr[Y=4] = 0.0 + 0.05 + 0.000

Pr[Y=5] = 0.2 + 0.02 + 0.006

0.8 0.18 0.020

1 1 1 2 3 3 5 5

0 1 2 3 4 5 6 7

V1

1 1 1 1 3 3 ... 3 4 ... 4 5 ... 5

0 1 2 17

V2

1 1 ... 1 2 2 2 3 3 ... 3 5 5 ... 5

0 1 2 19

V3

u = U;

If u < 0.8 return( V1[ 10*u ] );If u < 0.98 return(V2[ 100*(u-0.8) ] );Otherwise return(V3[ 1000*(u-0.98) ])

G. Marsaglia, W. W. Tsang and J. Wang, Fast Generation of Discrete Random Variables, Journal of Statistical Software, Volume 11, Issue 3, July, 2004.

52

8. A summary For simulation, software testing, randomization or games, use

Mersenne twister or KISS. For cryptography, use the recommended RNGs in the standards.

To boost security, combine the RNGs with the HAVEGE or BBS generator, or both.

For goodness-of-fit testing, use the AD test For testing RNGs, use the new version of Diehard For generating continuous variates, use the Monty Python

method For generating discrete variates, use the alias method or the

table lookup methods.

RNGs Mersenne Twister KISS HAVEGE

Times for generating 100M numbers (sec)

3.296 3.235 2.281

53

John von Neumann 1903 -1957

Alan Turing 1912 -1954

Andrew Yao 1946 -

Donald Knuth 1938 -

George Marsaglia 1924 -

Manuel Blum 1938 -

Andrey Kolmogorov 1903 - 1987

Makoto Matsumoto 1965 -