17
Computational Statistics (2006) 21:103-119 DOI 10.1007/s00180-006-0254-y 9 Physica-Verlag 2006 Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical Institute 203 B T Road Kolkata, India 700108 Summary Generating random samples from multivariate distributions is a common re- quirement in many fields of study. Often the complete joint distribution is not specified to the scientist. This paper addresses the situation where only the marginals and the correlation matrix are specified. We suggest a de- terministic algorithm, PERMCORR, to approximately achieve the required correlation structure that can be used to get good initial values to standard stochastic algorithms. In many situations the output of PERMCORR is al- ready accurate enough to preempt any need for running an expensive stochas- tic algorithm. We provide some theoretical justification for our method as well as simulation studies. We also provide a bootstrap technique based on PERMCORR. Keywords: Correlation, permutation, simulation, bootstrap

Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

Computational Statistics (2006) 21:103-119 DOI 10.1007/s00180-006-0254-y 9 Physica-Verlag 2006

Generating Multivariate Correlated Samples

Arnab Chakraborty

Stat-Math Unit Indian Statistical Institute 203 B T Road Kolkata, India 700108

S u m m a r y

Generating random samples from multivariate distributions is a common re- quirement in many fields of study. Often the complete joint distribution is not specified to the scientist. This paper addresses the situation where only the marginals and the correlation matrix are specified. We suggest a de- terministic algorithm, PERMCORR, to approximately achieve the required correlation structure that can be used to get good initial values to standard stochastic algorithms. In many situations the output of PERMCORR is al- ready accurate enough to preempt any need for running an expensive stochas- tic algorithm. We provide some theoretical justification for our method as well as simulation studies. We also provide a bootstrap technique based on PERMCORR.

K e y w o r d s : Correlation, permutation, simulation, bootstrap

Page 2: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

104

1 I n t r o d u c t i o n

Generating multivariate samples with a given structure is an integral part of modern scientific studies. Monte Carlo methods are used for quantitative risk analysis in many sciences. Through Monte Carlo methods the investigator seeks to simulate the real situation as closely as possible. As Vose (1996) points out, it is duty of the scientist to ensure that the simulated data closely approximates all known distributional properties of the real data. In many situations involving multivariate data, however, complete information about the joint distribution is not available. Marginals and some low order joint moments often constitute the sole information, as discussed by Charmpis and Panteli (2004). In this paper we shall deal with one such case. Here is a mathematical description of the specific class of problem that we shall address in this paper.

We are given k univariate distributions T~I, ..., T~k. We assume that we already have some way to simulate iid data from each of these. We also have a correlation matrix Ck• = ((c4j)). We want to simulate a random matrix

Y n • ... Y k ] ,

where y~ consists of n iid samples from ~ , and cor r (y i ,y j ) = c4j. Here, for two n • 1 vectors x, y, corr(x, y) is the sample correlation defined by

r(x, y) = 5: - 5: x, - = , / n ) ( E - E

Charmpis and Panteli (2004) has given one heuristic scheme using simulated annealing to tackle this problem. Their approach is a heuristic one based on finding an optimal permutation of the generated data to achieve the spe- cific correlation structure. In the present paper we shall systematically in- vestigate this approach and try to develop a better method, which we call PERMCORR, to find the best permutation. Our method is based on com- binatorial optimisation. We also present some theoretical results about our algorithm. Finally, we show how to use PERMCORR for bootstrapping.

In section 2 we shall review the relevant literature. In section 3 we shall present the simulated annealing method. Section 4 is aimed to provide some theoretical insight. Our main algorithm appears in section 5. After discussing some simulation results in section 6, a couple of real life applications are mentioned in section 7.

Page 3: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

105

2 Br ie f l i terature rev iew

The most popular and well-studied case of multivariate random number gen- eration is doubtlessly generation of data from the multivariate normal dis- tribution Nk(#,E). See, for instance, Cheng (1985) or Gentle (2003). A commonly employed method proceeds by first generating an iid sample of size k

Y = (Yl,..., y,~)T. Then ~l/2y has the required multivariate normal distribution. Here E1/2 denotes the Cholesky square root of ~. Golub and van Loan (1983) provide details on Cholesky decomposition of matrices. Gentle (2003) gives sampling techniques for multivariate random number generation from non-normal dis- tributions.

More relevant to this paper are the methods designed to simulate from in- completely specified multivariate distributions, eg, based on moments and marginals only. Parrish (1990) proposed a general method for generating samples from multivariate Pearson family of distribution. His method is ap- plicable when the joint product moments upto the fourth order are known. Developing an idea of Fleishman (1978) for generating multivariate non- normal distributions, Vale and Maurelli (1983) have a method that requires prior knowledge about the first four moments. They first generate multinor- mal data with a specific correlation matrix. This matrix is obtained by solving nonlinear equations involving the supplied moments. Then they transform the intermediate multinormal data to the final sample. Li and Hammond (1975) have designed an algorithm based on only the marginal distributions. Their method, like the method of Vale and Maurelli (1983) works with an intermediate multinormal data. However, it requires numerically evaluating a double integral. Lurie and Goldberg (1998) has provided a modified ver- sion of this method where the numerical integration has been replaced by nonlinear optimisation routines to iteratively refine the correlation matrix in Cholesky form. Gentle (2003) has a good survey of methods to generate multivariate data from discrete distributions.

Taylor and Thompson (1986) present a situation where we already have some data from some unknown underlying distribution, and we want to generate more data from the same distribution. One method is to estimate the un- derlying distribution based on the already available data. However, Taylor and Thompson (1986) provide a way to do so without explicitly doing the intermediate estimation step. The method is sensitive to the choice of a smoothing parameter. When the parameter is set to unity, we get usual bootstrap resample, while a moderately large value of the parameter pro- duces approximately normal data. For values in between the generated data is a sort of "smoothed bootstrap" and retains some distributional properties of the original sample. Iman and Conover (1982) give a permutation algo-

Page 4: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

106

rithm to approximate rank correlations. Their method requires computation of "van der Waerden scores" using the inverse Gaussian cdf.

Particularly relevant to the present paper is a heuristic simulated annealing method proposed by Charmpis and Panteli (2004). They first generate iid data from the specified marginals and try to permute the generated data to approximate the required correlation structure. Their paper uses simulated annealing to find a suitable permutation. However, no theoretical basis be- hind the algorithm is given. In this paper we shall present a more systematic method.

3 A s i m u l a t e d a n n e a l i n g m e t h o d

The simulated annealing method, which is the foundation of our work, is due to Charmpis and Panteli (2004). In this section we give a summary of this approach.

Let y -- (gl, ...,Yn) T be any n-dimensional vector, and let ~ be any permu- tation of {1, ..., n}. We shall denote by ~(y) the permuted vector

~(y) = (~(1), ..., ~(~))%

Now suppose that we are given p + 1 vectors xl, ...,x v and y. Suppose also that the correlation matrix of xl, ..., x v is C. Let cl, ..., c, be such that

c T 1

is a correlation matrix where e = (ct,..., cp) T. Charmpis and Panteh (2004) uses simulated annealing to find a permutation ~" such that

eorr(x.~r(y)) ~ ~ Vi.

Using this repeatedly they seek to approximate the entire correlation matrix step by step as follows.

They first generate xl, ..., x& where x~ consists of n iid samples from T~i. Then for i -- 2, ..., k they compute permutation 7r E Sn such that

carr(x~,~(xi)) ~ c~j Vj < i.

Their suggested method to find ~r is as follows. Consider the error function

P err(~) = ~ (eo~r(x,, ~(y)) -- ~i)2.

i----1

Page 5: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

107

Charmpis and Panteli (2004) actually use its scaled version following the suggestion of Lurie and Goldberg (1998). The aim is to minimise err(Tr) wrt 7r. They s tar t from 7r = id and pick a random swap. If this reduces err(Tr) then this move is accepted, else it is accepted with some probabil i ty following a simulated annealing scheme.

Simulations show tha t this method works satisfactorily in practice. However, Charmpis and Panteli (2004) provide no justification behind why the method works, eg, some bound on how close to the target correlations we can get.

4 T h e o r e t i c a l R e s u l t s

The err(Tr) function is defined on a finite set, Sn. So it can take a t most n! distinct values. We are t rying to minimise it, ie, looking for the value closest to 0. The all impor tan t question here is "How close to zero can we get?" Secondly, wha t is our chance to reach tha t point by pairwise t ransposi t ion steps?

To this end we first investigate the problem where p = 1. Here we have two vectors x and y and are given a number c E [ -1 , 1]. We are looking for a permuta t ion rr such tha t

corr(x , Tr(y)) ~ c.

Let #= = ~-~xi/n and a= = Y ~ x ~ / n - #~. Similarly for #~,au.

L e m m a 1 corr(x , 7r(y)) is maximised when x and 7r(y) are concordant, ie,

Vi, j xi < x j ~ y~(i) <_ Y~O).

Similarly, it is minimised when they are discordant, ie,

Vi, j xi < x j ~ y,(i) >_ y~o).

Proof: Correlations are translat ion- and scale-invariant. So we may assume without loss of generality tha t

# z = # ~ = 0 a n d a x = a ~ = l -

Since there are only finitely many permutat ions, there must exist a pe rmuta - t ion 7r tha t maximises corr(x , 7r(y). Let, if possible, there be i, j with xi < xj and Y~(O > Y~r(j). Let T be the permuta t ion obtained from 7r by swapping i, j. Then

co r r (x ,~ (y ) ) - c o r r ( x , T ( y ) ) - - - - ( x ~ - x j ) ( y ~ - y j ) < 0 .

Page 6: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

108

So 7r cannot be a maximiser, which is a contradiction. Similarly, for the other extreme. [QED]

Let us call these extremes c,mn and Cr~,~. Obviously, if e r [c~in, cm~] then we can never hope to achieve c. It is easy to compute cmin and cmo~. I f for a vector v we denote by i n c ( v ) (or dec(v)) the same vector sorted in increasing (or decreasing) order, then

Cmi n = corr( i n c ( x ) , i n c ( y ) ) and Cmin = corr( i n c ( x ) , dec(y)).

Let c~ = corr(x, ~r(y)). We shall call these as achievable values for correlation. These values define a partit ion of [ -1 , 1]. The accuracy with which we can approximate a given c depends on the mesh size, ie, the maximum possible possible difference between any achievable value and its nearest neighbour. The following lemma puts an upper bound on this quantity.

Let us order x and let 6~ denote the largest difference between two consecutive ordered xi ' s scaled by as . Thus, if

Z(1) ~-~ " ' " __> X(n)

represent the ordered xi 's then

1 5z = - - max (x~i~ - x(i+x)).

ffz l < i < n " "

Similarly define 1

Jr = - - max (Yti~ - Y ( i + I ) ) - Gy l < _ i < n " "

Suppose tha t we are at c~, and we want to swap i and j. This would change corr(x, 7r(y)) by (x i ~ x j ) (Y~r(i) - Y~(j)). We are interested in the distance to the nearest neighbour. So we want

m i n ](xi - x i ) ( y ~ ( i ) - y~(j))[.

This is the distance from c~ to the nearest achievable point that we can reach by a single swap. Note that this need not be the distance to the nearest achievable point. However it is an upper bound for that . This upper bound still depends on ~r. We want an upper bound tha t works for all 7r, ie, we want to find a (f such tha t from any achievable point we are always guaranteed to find a pairwise transposition tha t will change the correlation by at most 5. Such a 6 is given by

5 = maxmin I(xi - xj)(Y~(i) - Y~(j))I. i , j

Note tha t this is the distance

max rain ]cw - c7" I, *R" ye~*ff

Page 7: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

109

where r ,~ 7r means tha t ~- may be arrived at from Ir by a single pairwise transposition.

L e m m a 2 6 < 6=6 u.

Proof: Since correlations are scale and translation invariant we can assume without loss of generality that /~= = p~ = 0 and a= = a u = 1. Let ~ be a permutat ion such that ~(x) = inc(x), and let A = 7r o ~, ie, Vi, A(i) = 7r(~(i)). Note tha t as 7r ranges through Sn, so does A. Now,

6 = maxmi. 'n I(=~ - z j ) (y ,~ (o - y,,( , )) l 11" l~J

< max min I(xE~O - xr - Yx(i+l))[ - - ~ l _ < i < n - "

min < max l<i<n[6=lyx(o - Yx(i+a)l]

= 6=max m i n IrA(i) - YA(i+I)[ l < i < n

< 5=~ v

[QED].

L e m m a 3 I f Xnx 1 eon$ists of n iid samples from an underlying distribution with a bounded interval as its support, then

Proof: Let [a, b] denote the closure of the support. Fix any arbi t rary e > 0. Let 7 9 be a parti t ion of [a, b] into equal subintervals each of which is less than e/2 in length. Consider the two events

An = {each subinterval has at least one xi in it)

and B . = {6~ < ~}.

I t is easy to see tha t An C_ B,,. Also, since [a, b] is the closure of the support , P(An) ~ 1. So it follows that P(Bn) ~ 1, as well, proving the lemma.[QED]

While discussing our algorithm in the next section, we shall need two direc- tional versions of 6, viz, 6 + and 6 - , which we define now. 5 + is the maximum possible distance to the nearest right neighbour of any achievable point. Sim- ilarly, 5 - is the maximum possible distance to the nearest left neighbour of any achievable point. More specifically,

5 + = ma,x mi'.n I(zi - :~J)(Y~'(O - Y~'O))I,

Page 8: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

110

where the min imisa t ion is over only those i , j ' s for which (xi - x j ) ( Y ~ r ( i ) - y.(j)) > 0. Similarly, if we restrict the minimisation to only those i , j ' s for which (xi - x j)(y~(i) - Y~(2) < 0 we shall have 5 - . The following result is not difficult to prove.

L e m m a 4 5 > min{5- ,5+} .

5 Our algorithm: P E R M C O R R

Charmpis and Panteli (2004) tries to find the best 7r by making random pairwise transpositions and then accepting it by using a simulated annealing scheme. This often requires long random walks tha t waste time near other local optima. Based on the theoretical results from the last section, we sug- gest a deterministic initialisation algorithm, which we call P E R M C O R R , to come near the global minimum. Then we may start a MCMC algorithm to home in on our target much as Charmpis and Panteli (2004) have done.

To illustrate the idea let us consider first the case when p -- 1. Here is the P E R M C O R R algorithm in this scenario.

I n p u t : x, y , c O u t p u t : 7r(y) so tha t corr(x, 7r(y)) ~, c. se t 7r = id cu r ren t = corr(x, y) loop:

s e t

e n d

(i, j ) = argmin~j (xi - x j ) (Y~O) - Y~(J) ) se t proposed = cur ren t + (xi - xj)(Y~(i) - Y~(j)) i f ( Iproposed - target] < ]current - target])

se t cur ren t = proposed se t 7r = (i, j) o 7r (ie, swap i, j in 7r) g o t o loop

else r e t u r n 7r

e n d i f

Here is a brief description of the above algorithm in plain language. Consider Sn, the set of all possible permutations, as a graph with n! nodes. Each node corresponds to a permutation. Two nodes are connected by an edge if and only if they differ by a single pairwise transposition. At each node we have a correlation value. At every stage in the algorithm we are at some node. To go to the next stage we look for an adjacent node whose correlation is closer

Page 9: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

111

to the target. This greedy algorithm stops when no such node can be found, which must eventually happen due to finiteness of Sn.

The lemma below shows how close to c we can get with this algorithm.

L e m m a 5 Let 7r(y) be the output o] the above algorithm.

(i) I f c >_ c ~ then c~ = c ~ .

5 0 I f c < e m ~ then e , = c~i~.

(iii) I f c 9 (Cmin,Cmaz) then

c~ 9 c - - ~ , c + .

In particular, in this case

1 + 14-el < ~m~x{~ ,~-}.

Proof: (i) Our algorithm is a greedy one, and always makes a move if it can get closer to the target using a single pairwise transposition. Until 7r is such that x and ~r(y) are in concordance, there always exists at least one pair i , j with

(x~ - xj)(y,~(~) - Y,~u)) < O.

So transposing i , j will take us closer to cmax. (ii) Similar to (i). (iii) Let, if possible, e~r < c - 5+/2. By definition of 5 + we know that we can always find a pairwise transposition that will increase the correlation by a positive amount at most 5 +. Clearly this new correlation is closer to the target, c, and hence our greedy algorithm cannot stop at c~. Similarly for the other side. [QED]

The algorithm for general p is very similar. Only here we need to compare among more correlation vectors. Lurie and Goldberg (1998) and Charmpis and Panteli (2004) suggest using sum of squared differences. Accordingly we seek to minimise

E ( c o r r ( x , , lr(y)) " ci) 2.

Here after a pairwise transposition we do not need to recompute the correla- tions from the scratch. Rather, we may update each of them by just adding a term like (xi - xj)(y~r(~) - y~(j)) as in the p -- 1 case.

The next section will present the simulation results to show that the algorithm often brings us quite close to the minimum effectively reducing the length of the random walk. We shall assume that the algorithm has been coded

Page 10: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

112

Table 1: The case p -- 1 and n = 10 (Unif(O, 1) distribution)

c,nin cma= c c~ time (sec) -0.963815 0.942865 -0.213381 -0.212641 0.000000 -0.931834 0.893190 -0.578464 -0.578150 0.000000 -0.958086 0.925458 0.617129 0.616872 0200000 -0.961242 0.947030 0.931326 0.931412 0.000000 -0.973110 0.914786 0.807629 0.807378 0.000000

into a program called Pt~RMCORR(X, C), where X~• is the original data matrix, and C is the target correlation matrix. The program returns a matrix Yn• such that each column of Y is a permutation of the correponding column of X, and the corrrelation matrix for the columns of Y is "close" to C, in the sense mentioned earlier in this paper.

6 S i m u l a t i o n R e s u l t s

In this section we shall see how the algorithm behaves in practice. First, we shall work with the case where p = 1. For a reasonably large sample size (eg, n = 100) the deterministic initialisation often produces highly accurate result, sometimes obviating the need for going into any stochastic iteration at all.

6.1 D i s t r i b u t i o n w i t h b o u n d e d support

Lemma 2 indicates that we should expect good asymptotic result with our algorithm if the supports of the underlying distributions are bounded. To check this we proceed as follows. We generate data from Unif(O, 1), compute cmin and cmaz as outlined in the last section, generate a target correlation c uniformly from [ c ~ i n , c ~ ] , and finally run our algorithm with these to get output c~. Tables 1, 2 and 3 show the results for n = 10, 50 and 100, respectively.

6.2 D i s t r i b u t i o n w i t h u n b o u n d e d s u p p o r t

Next we work with a situation where the data come from the Exp(1) distribu- tion which has unbounded support. Lemma 2 showed that having bounded support was a sufficient condition for convergence. Tables 4,5 and 6 show

Page 11: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

113

Table 2: p -- 1 and n = 50 (Uni](O, 1) distribution)

Cm{,~ Cmax C C~ time (sec) -0.986090 0.982294 -0.774877 -0.774872 0.000000 -0.989970 0.986744 0.535716 0.535711 0.000000 -0.973255 0.992901 0.568413 0.568410 0.010000 -0.994812 0.964264 0.638225 0.638223 0.000000 -0.989681 0.983002 -0.813243 -0.813242 0.000000

Table 3: p = 1 and n = 100 (Unif(O, 1) distribution)

c,nin C-max C C~ time (sec) -0.795122 0.981651 -0.573812 -0.574266 0.000000 -0.683781 0.932370 -0.153419 -0.153449 0.000000 -0.749166 0.903410 -0.585691 -0.585699 0.000000 -0.825801 0.965593 0.545090 0.545403 0.000000 -0.880129 0.959873 0.136010 0.131549 0.000000

that even though the sufficient condition is violated here, the convergence is still good.

6 .3 T h e c a s e p > 1

In this case we try to minimise the squared error loss, err(~r). Simulation result with Unif(O, 1) distribution are shown below. Observe that the greedy initialisation algorithm is again often enough to guarantee high precision for n _> 100.

To test our algorithm in this situation we generate a target correlation matrix

Table 4: p = 1 and n = 10 (Exp(1) distribution)

cm~n Cma| c c,r time (sec) -0.609121 0.932132 0.523796 0.523796 0.000000 -0.736642 0.940886 0.365131 0.365131 0.000000 -0.803038 0.983633 -0.515975 -0.515975 0.000000 -0.699106 0.986755 -0.093075 -0.093075 0.000000 -0.842728 0.966235 -0.310125 -0.310126 0.000000

Page 12: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

114

Table 5: p = 1 and n = 50 (Exp(1) distribution)

Cmin C,na= C ~ time (sec) --0.665535 0.970804 0.181599 0.181599 0.000000 --0.556157 0.981738 0.034583 0.034583 0.000000 --0.699596 0.993405 --0.448631 --0.448631 0.010000 --0.700185 0.987629 0.544073 0.544073 0.000000 --0.697147 0.991110 0.087065 0.087065 0.010000

Table 6: The case p -- 1 and n = 100 (Exp(1) distribution)

Cm~n Cma= C C~ time (sec) -0.995209 0.995579 0.540953 0.540953 0.010000 -0.995550 0.996569 0.747632 0.747632 0.010000 -0.983801 0.988954 -0.914379 -0.914379 0.020000 -0.981186 0.993599 0.054684 0.054684 0.000000 -0.989049 0.993993 0.140847 0.140847 0.000000

as follows. First we simulate a p x p matrix A with entries lid Unif(O, 1). This matrix is nonsingular with probability 1. Then we compute the positive definite matrix ~ -- ATA. We treat this as a variance-covariance matrix and compute the correlation matrix from it. The correlation matrix is generated independently of the data.

Here is a typical output for p = 4 and n = 10

Target matrix 0.563 0.754 0.696 0.879 0.755 0.734 0.634 0.917 0.713 0.780

Output matrix 0.564 0.745 0.691 0.873 0.716 0.729 0.698 0.918 0.710 0.737

The err( r ) values at each step are given in Table 7. Time taken is 0.01 sec. We also tried with p = 4 and n = 100, and the following target correlation matrix (generated randomly as above).

Target matrix 0.831 0.918 0.759 0.834 0.524 0.798 0.853 0.587 0.893 0.712

Output matrix 0.831 0.918 0.759 0.834 0.524 0.798 0.852 0.587 0.893 0.712

Page 13: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

115

Table 7: err(n) at various steps

p err(.) 1 0.000002 2 0.000090 3 0.001580 4 0.005865

7 R e a l life a p p l i c a t i o n s

In this section we present two applications of our algorithm. The first is in bootstrapping, the second in improving the output of the popular Crystal Ball package for Microsoft Excel.

7 .1 Bootstrapping Bootstrap is a widely used nonparametric technique. Its main idea may be summarised as follows. Any statistical data set contains both signal and noise. Different data sets from the same random experiment are supposed to contain the same signal, while the noises differ. Bootstrapping seeks to sim- ulate fresh data from the same random experiment by reshuffling the noise while retaining the signal. Efron and Tibshirani (1993) is a good reference on the subject. For univariate iid data sets this reshuffling is usually done by performing simple random sampling with replacement from the original data set. For more general data sets various methods have been suggested in the literature, eg, resampling residuals for (generalised) linear models, block- bootstrap for time series. Using our algorithm we propose a bootstrapping technique suitable for multivariate data sets consisting of n iid observations of a random vector (X1, ..., Xp). Two commonly used methods are row re- sampling and residual resampling. The latter is believed to reshuffle the noise better, though it requires some kind of (generalised) linear model assump- tion. See Efron and Tibshirani (1993) for details. Here we propose a new bootstrapping scheme based on our algorithm that preserves the correlation structure as well as the marginais of the X~'s. I t does not require any model assumption, hence is more general than residual resampling, and yet does more reshuffling than the row resampling bootstrap. Here is the algorithm.

Page 14: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

116

I n p u t : A matrix Xn• where the i th column stores the n values of Xi, for 1 < i < p.

O u t p u t : A resampled version, X~xv, of X, that preserves the column marginals and the correlation structure.

b e g i n s e t C = correlation matrix of X. for i = 1, ...,p do

s e t Y~ = resample of i th column of X. end for s e t Y = [ V l Y2 "'" Yv]n• s e t X* = PERMCORR(Y, C) end

In plain English, we first resample the columns of the original data set in- dependently and then permute them by PERMCORR to approximate the original correlation structure.

We shall apply it to a typical image processing problem. Most image pro- cessing algorithms rely on the local correlation structure of the pixels as in Geman and Geman (1984). We expect to predict the value, ]I, of the pixel based on the values of its 8 neighbouring pixels, X1,..., Xs. Suppose that we want to fit a linear regression of the form

8

Y = 3o + Z fliXi + e, L----1

where e has some unknown distribution F. We want to get confidence intervals for the 3i's.

For this we used a USB camera to take 50 snapshots at 0.5 second intervals, and picked a 3 • 3 region near the centre. This resulted in a 50 x 9 data matrix. We bootstrapped this using our method. matrix.

Here is the original correlation

XI X2 X3 )(4 X~ X6 )(7 Xs 1.000 0.848 0.362 0.716 0.224 0 .146 0.147 0.080 0.848 1.000 0.499 0.594 0.321 0 .108 0.I00 0.083 0.362 0.499 1.000 0.196 0.595 -0.019 -0.013 0.041 0.716 0.594 0.196 1.000 0.438 0 .614 0.599 0.562 0.224 0.321 0.595 0.438 1.000 0 .453 0.531 0.608 0.146 0.108 -0.019 0.614 0.453 1 .000 0.918 0.815 0.147 0.I00 -0.013 0.599 0.531 0 .918 1.000 0.810 0.080 0.083 0.041 0.562 0.608 0 .815 0.810 1.000 0.511 0.536 0.266 0.857 0.681 0 .674 0.619 0.716

Y 0.511 0.536 0.266 0.857 0.681 0.674 0.619 0.716 1.000

Page 15: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

117

Figure 1: Histogram of 1000 bootstrapped 85.

Here is the achieved correlation matrix for the first resample.

X1 Xz X3 X4 X5 Xe X7 )(8 1.000 0.848 0.362 0.716 0.219 0.138 0.132 0.062 0.848 1.000 0.499 0.595 0.325 0.111 0.112 0.102 0.362 0.499 1.000 0.196 0.593 -0.017 -0.010 0.043 0.716 0.595 0.196 1.000 0.441 0.623 0.594 0.563 0.219 0.325 0.593 0.441 1.000 0.452 0.522 0.574 0.138 0.111 -0 .017 0.623 0.452 1.000 0.890 0.797 0.132 0.112 -0.010 0.594 0.522 0.890 1.000 0.798 0.062 0.102 0.043 0.563 0.574 0.797 0.798 1.000 0.517 0.534 0.266 0.852 0.672 0.672 0.627 0.698

Y 0.517 0.534 0.266 0.852 0.672 0.672 0.627 0.698 1.000

The agreement is pretty close. The histogram in Figure 1 is based on 1000 resamples.

7 .2 P o s t p r o c e s s i n g C r y s t a l B a l l o u t p u t

Crystal Ball (version 7, standard edition) is an add-on package for Microsoft Excel for performing risk analysis using simulation. The software allows the user to simulate data by specifying the marginal as well as a correlation with another existing variable. Their algorithm is different from ours, and often only crudely approximates the specified correlation. We can feed the Crystal Ball output to P E R M C O R R to achieve closer approximation to a specified correlation. This postprocessing boosts the performance of Crystal Ball even for sample sizes as small as 50. Here are two typical cases for the target correlation coefficient p = 0.8.

Page 16: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

118

Marginals n p Crystal Ball P E R M C O R R Normal and Triangular 50 0.8 0.8282 0.8000 Exponential and Poisson 50 0.8 0.8903 0.8000

8 C o n c l u s i o n

In this paper we have presented a deterministic combinatorial algorithm, PERMCORR, to approximate a given correlation structure. This algorithm is easy to implement, and often produces surprisingly accurate results even for samples of size less than 100. For any sample size, it produces a correlation matrix in the vicinity of the required one, and hence this algorithm is a useful one to speed up convergence of stochastic optimisation algorithms whose performance is often sensitive with respect to starting position. We have also proposed a bootstrapping scheme based on PERMCORR, that is more general than residual resampling and produces better reshuffling than row resampling.

References Charmpis, D., Panteli, P., (2004), A heuristic approach for the generation

of multivariate random samples with specified marginal distributions and correlation matrix. Computational statistics 19(2), 283-300.

Cheng, R. C. H., (1985), Generation of multivariate normal samples with given sample mean and covariance matrix. Journal of Statistical Compu- tation and Simulation 21, 39-49.

Efron, B., Tibshirani, R., (1993), An introduction to the bootstrap. Chapman 8~Hall Ltd (London; New York)

Fleishman, A. I., (1978), A method for simulating non-normal distributions. Psychometrika 43, 521-532.

Gentle, J., (2003), Random number generation and monte carlo methods. Springer- Verlag.

Golub, G. H., Van Loan, C. F., (1983), Matrix Computations. Johns Hopkins University Press.

Iman, R. L., Conover, W. J., (1982), A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics, Part B - Simulation and Computation [Split from: @J(CommStat)] 11, 311- 334.

Page 17: Generating multivariate correlated samplesRAB/Backup Oct 22 2011/f_3... · 2011-10-22 · Generating Multivariate Correlated Samples Arnab Chakraborty Stat-Math Unit Indian Statistical

119

Li, S. T., Hammond, J. L., (1975), Generation of pseudo-random numbers with specified univariate distributions and correlation coefficients. IEEE Transactions on Systems, Man, Cybernetics 5,557-560.

Lurie, P., Goldberg, M., (1998), An approximate method for sampling corre- lated random variables from partially specified distributions. Management science 44(2), 203-218.

Parrish, R. S., (1990), Generating random deviates from multivariate earson distributions. Computational Statistics and Data Analysis 9, 283-295.

Taylor, M. S., Thompson, J. R., (1986), A data based algorithm for the generation of random vectors. Computational Statistics and Data Analysis 4, 93-101.

Vale, C. D., Maurelli, V. A., (1983), Simulating multivariate nonnormal dis- tributions. Psychometrika 48, 465-471.

Vose, D., (1996), Quantitative Risk Analysis: a Guide to Monte Carlo Sim- ulation Modelling. John Wiley f~ Sons.