Upload
erick-farmer
View
231
Download
0
Embed Size (px)
DESCRIPTION
Central Limit Theorem If the n mutually independent random variables x 1, x 2, …, x n have the same distribution, and if their mean and their variance 2 exist then …
Citation preview
AN INTRODUCTION TO STATISTICAL ANALYSISOF SIMULATION OUTPUTS
Sample Statistics If x1, x2, …, xn are n observations of the
value of an unknown quantity X, they constitute a sample of size n for the population on which X is defined.
Sample mean
Sample variance
n
iixn
x1
1
n
ii xx
ns
1
22 )(1
1
Central Limit Theorem If the n mutually independent
random variables x1, x2, …, xn have the same distribution, and if their mean and their variance 2 exist then …
Central Limit Theorem The random variable
is distributed according to the standard normal distribution (zero mean and unit variance).
n
xn
n
ii
1
1
Estimating a mean Assume that we have a sample x1, x2, …,
xn consisting of n independent * observations of a given population
The sample mean xbar is an unbiased estimator of the mean of the population
* This is the critical assumptionWithout it, we cannot apply the formula
Large populations For large values of n, the (1-)%
confidence interval for is given by
with for the standard normal distribution
nzx
nzx 2/2/ ,
21)( 2/
zF
Explanations 1 – expressed in percent is the level of
the confidence interval 90% means= 0.10 95% means= 0.05 99% means= 0.01
is the error probability Probability that the true mean falls
outside the confidence interval
95% Confidence Intervals For =.05, z/2 =1.96
Example: If = 35, = 4 and n = 100 The 95% confidence interval for is
35 ± 1.96x4/10 = 35 ± 7.84
x
When is not known We can replace in the preceding formula
by the standard-deviation s of the sample When n < 30, we must read the value
of z/2 from a table of Student's t-distribution with n - 1 degrees of freedom
When n 30, we can use the standard values
Confidence Intervals in CSIM CSIM can automatically compute confidence
intervals for the mean value of any table, qtable and so on. For everything but boxes
xyx->confidence(); For the elapsed times in a box
bd->time_confidence(); For the population of a box
bd->_number_confidence();
We get 90, 95 and 99 percent confidence intervals
Confidence Intervals in CSIM We get 90, 95 and 98 percent confidence
intervals Computed using batch means method
See next section But only for the mean
Estimating a proportion A proportion p represents the
probabilityP(X) for some fixed threshold 97% of our customers have to wait
less than one minute Confidence intervals for proportions are
much easier to compute than confidence intervals for quantiles
Assumptions Assume that we have n independent
observations x1, x2, …, xn of a given population variable X and that this variable has a continuous distribution
Let p represent the proportion we want to estimate, say P(X)
Let k represent the number of observations that are .
Basic property If p represent the proportion we want to
estimate, say P(X), and k represent the number of observations that are The rv k is distributed according to a
binomial distribution with mean np and variance p(1 – p)
The rv k/n is distributed according to a binomial distribution with mean p and variance p(1 – p)/n
The formula When n > 29, we can use the Wilson’s
interval
where z = 1.96 for a 95% C.I.
1
1
4)ˆ1(ˆ
2ˆ
1
4)ˆ1(ˆ
2ˆ
22/
2
22/
2/
22/
22/
2
22/
2/
22/
nz
nz
nqqz
nzq
q
nz
nz
nqqz
nzq
P
Advantages Produces tighter confidence intervals
then the Central Limit Theorem Works when is equal to zeroq̂
10 2
2/
22/
znz
qP
Example Assume we have 400,000 independent
observations without noting any failure The 95% confidence interval for the
probability that the system could fail is
)10,0()96.1000,400
96.1,0( 52
2
AUTOCORRELATION
The problem All statistical analysis techniques we
have discussed assume that sample values are mutually independent
Generally false for quantities such as Waiting times, response times, …
Tend instead to be autocorrelated When the waiting lines are long,
everybody wait a long time
Traditional solution Keep the measurements sufficiently
apart Sample them every T minutes apart Standard solution for collecting
observations on a running system Not practical
Would require very long simulations
Three good solutions Batch means Regenerative method Time series analysis
Batch means We group consecutive observations into
“batches” We compute the means of these batches We observe that autocorrelation among
batch means decreases with size of batches When size increases, each batch
includes more observations that are far apart
Example We collected the following values:
4, 3, 3, 4, 5, 5, 3, 2, 2, 3 We group them into two batches of five
observations: 4, 3, 3, 4, 5 and 5, 3, 2, 2, 3
The batch means are: 3.8 and 3
Batch means in CSIM CSIM uses fixed-size batches
To compute confidence intervals To control the duration of a simulation
(run-length control)
Regenerative method Most systems with queues go through
states that return it to an state identical to its original state The system regenerates itself
Examples: Whenever a disk array is brought back
to its original state Whenever a camper rental agency has
all its campers available
Key idea Define a regeneration interval as an
interval between two consecutive regeneration points: Observations collected during the
same regeneration interval can be correlated
Observations collected during different regeneration intervals are independent
Application We group together observations that
occur within the same regeneration interval
We compute the means of these groups of observations
These group means are independent from each other
Limitations of the approach Not general:
System must go through regeneration points
System must be idle
Leads to complex computations We rarely have exactly the same
number of observations in two different regenerations intervals
Time series analysis Treats consecutive observations as
elements of a time series Estimates autocorrelation among the
elements of a time series Includes this autocorrelation in the
computation of all confidence intervals
RUN LENGTH CONTROL
Objective Accuracy of confidence intervals
increases with duration of simulation The 1/n factor
We would like to be able to stop the simulation once a given accuracy level has been reached for the confidence interval of a specific measurement
The CSIM solution (I) Specify
Quantity of interest Mean value from a table, a qtable, …
A relative accuracy Maximum relative error
A confidence level (say, 0.90 to 0.99) A maximum simulation duration
In seconds of CPU time
The CSIM solution (II) void table::run_length(double accuracy,
double conf_level, double max_time) void qtable::run_length(double accuracy,
double conf_level, double max_time) void meter::run_length(double accuracy,
double conf_level, double max_time) void box::time_run_length(double accuracy,
double conf_level, double max_time) void box::number_run_length(double accuracy,
double conf_level, double max_time)
The CSIM solution (III) Example:
thistable->run_length(.01, .95, 500) Specifies than we want an error less
than 1 percent for 95% confidence interval of mean of thistable
Stops simulation after 500 seconds thisbox->:time_run_length(.01, .99,
500) Same for time average of a box
The CSIM solution (IV) Replace termination test by
converged.wait();
Example (I) The campers We want
A maximum error of 1 percent (0.01) For the 95 percent confidence interval Of average number of rented
campers A maximum simulation time of 100 s
Will use number aspect of agency box
Example (II) Add
agency->number_confidence()after activation of box agency in csim process
Example (III) Put main loop
while(simtime() < DURATION) { hold(exponential(MIART); customer();}
in a separate arrivals process Make it an infinite loop
Example (IV) Add
converged.wait();after call to arrivals process arrivals():
converged.wait(); Best way to let sim process generate
customers and wait for terminationin parallel
Example (V) The new arrivals process
void arrivals() {process(“arrivals”); // REQUIRED for(;;) { // forever loop hold(exponential(MIART)); customer(); } // forever} // arrivals
Warnings Confidence intervals do not take into account
model inaccuracies While the batch means method eliminates
most effects of measurement autocorrelation, it is not always 100% effective
The max_time parameter of the run_length() will not necessarily stop the simulation just after the specified CPU time Like the emergency brake of a train
Objective Partition processes into different classes
Low priority High priority
Obtain separate statistics for each process priority class
Declaring and initializing process classes To declare a dynamic process class:
process_class *c; To initialize a process class before it
can be used in any other statement. c = new process_class("low
priority")
To assign a priority class Add inside the process
c->set_process_class(); Processes that have not been
assigned a process class belong to the “default” process class
Reporting Can use
report() report_classes()
Other options Can change the name of a process class:
c->set_name("high priority"); Can reset statistics associated with a
process c ->reset();
Can do the same for all process classes: reset_process_classes();
Can delete a dynamic process class: delete c;
RANDOM NUMBERS
Background We need to distinguish between
Truly random numbers Obtained through observations of a
physical random process Rolling dices, roulette Atomic decay
Pseudo-random numbers Obtained through arithmetic operations
Example Linear Congruential Generators (LCG)
Easy to implement and fast Defined by the recurrence relation:
rn+1 = (a rn + c ) mod mwhere
r1, r2, … are the random values m is the "modulus“ r0 is the seed
Two realizations GCC family of compilers
m = 232 a = 69069 c = 5 Microsoft Visual/Quick C/C++
m = 232 a = 214013 c = 2531011
Problems with pseudorandom number generators Much shorter periods for some seed
states Lack of uniformity of distribution Correlation of successive values …
Better RNGs Use the Mersenne twister
Period is 219937 - 1 Blum-Blum-Schub
A quote "Any one who considers arithmetical
methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number– there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.“
John von Neumann
CSIM RNGs BY default, CSIM uses a single stream of
random numbers Can reset the seed using
void reseed(stream *s, long n)as in reseed(NIL, 13579)
Continuous distributions supported by CSIM (I) double uniform(double min, double max) double triangular(double min, double max,
double mode) double beta(double min, double max, double
shape1, double shape2) double exponential(double mean) double gamma(double mean, double stddev) double erlang(double mean, double var)
Continuous distributions supported by CSIM (II) double hyperx(double mean, double var) double weibull(double shape, double scale) double normal(double mean, double stddev) double lognormal(double mean, double stddev) double cauchy(double alpha, double beta) double hypoexponential(double mn, double var)
Continuous distributions supported by CSIM (III) double pareto(double a) double zipf(long n) double zipf_sum(long n, double *sum)
Discrete distributions supported by CSIM long uniform_int(long min, long max) long bernoulli(double prob_success) long binomial(double prob_success, long
num_trials) long geometric(double prob_success) long negative_binomial(long success_num,
double prob_success) long poisson(double mean)
Empirical distributions supported by CSIM ???
Using multiple streams In campers example, sequence of RNs
used to generate arrivals is affected by the numbers of campers If agency has less campers
More customers will be lost Lost customers do no generate any
RN Better to have separate random number
streams for arrivals and service times
Declaring and initialing random number streams Can use:
stream *s;s = new stream();
By default, streams are created with seeds that are spaced 100,000 values apart
Reseeding a stream Use:
s->reseed(24680); where the new seed is a long integer
Other functions Inspect the current state of a stream
i = s->state(); Delete a stream
delete s;
Using a specific seed Prefix RNG function with name of seed:
s->uniform (3.0, 7.0)
A CASE STUDY
RAID array revisited Reliability of a RAID array Reliability R(t) of a system is the
probability that will remain operational over a time interval [0. t ] given that it was operational at time t = 0 Not the same as availability Our focus is evaluating the risk of a
data loss during array lifetime
Executing multiple runs Multiple runs provide statistically
independent repetitions of original simulation Useful for
Collecting more accurate results Constructing confidence intervals
Use rerun() function within a loop create("sim") call must be inside that
loop
Overview of sim function extern "C" void sim() { // sim process runcount = 0; while(runcount < NRUNS) { create("sim"); // make it a process // usual contents of sim process
rerun(); runcount++; } // while report_hdr(); // produce statistics report} // sim
Global includes and defines
#include <cpp.h> // CSIM C++ header file#define NDISKS 5 // number of disks in
array#define NYEARS 5 // useful lifetime of array#define MTTF 300000.0 // disk MTTF#define MTTR 24.0 // disk MTTR #define NRUNS 100000 // no of runs
void disk(int i);
Global declarations
int nfailed; // number of failed disksint runcount = 0; // number of runsint ndatalosses = 0; //counterdouble lifetime = NYEARS * 365 * 24;// simulation duration
Sim function (I)extern "C" void sim() { // sim process int i; runcount = 0; while(runcount < NRUNS) { create("sim"); // make this a process dataloss = new event("dataloss"); dataloss->clear(); nfailed = 0;
Sim function (II) // create NDISKS disk processes for (i=0; i < NDISKS; i++){ disk(i); } // for dataloss->timed_wait(lifetime); if (simtime() < lifetime) { ndatalosses++; } // if rerun(); runcount++; } // while
Sim function (III) report_hdr(); // produce statistics report printf("Array lifetime %fd years", NYEARS); printf(“or %f hours\n", lifetime); printf("Sim time %f\n", simtime()); printf("Disk MTTF %f hours\n", MTTF); printf("Disk MTTR %ff hours\n", MTTR); printf(“Completed runs %d\n", runcount); printf(“Data lossese%d\n", ndatalosses); } // sim
Disk process (I)void disk(int i) { create("disk"); while(simtime() < lifetime ||
dataloss) { hold(exponential(MTTF));
// disk failed nfailed++;
Disk process (II) if (nfailed == 2) { dataloss = 1; failtime = simtime(); finish->set(); terminate(); } // if hold(MTTR); // repair process nfailed--; // disk is replaced } // while} // disk
Simulation Outcome C++/CSIM Simulation Report (Version 19.0 for Linux x86) Tue Apr 22 10:13:14 2008
Ending simulation time: 0.000 Elapsed simulation time: 0.000 CPU time used (seconds): 0.000
Array lifetime 5 yearswhich corresponds to 43800 hoursSimulated time 0Disk MTTF 300000 hoursDisk MTTR 24 hoursNumber of runs completed 100000Number in runs ending in data loss 17
Discussion Whole program took less than 98 seconds
on linux03 server Simulated time should be equal to
43,800 hours, not zero. rerun() artifact?
We observe 17 failures out of 100,000 runs Data survival rate is 99.983 percent
Three nines
Confidence intervals Data loss rate and survival rate are
distributed according to a binomial law Since n = 100,000, the distributions of both
proportions are approximately normal Will use
1
1
4)ˆ1(ˆ
2ˆ
1
4)ˆ1(ˆ
2ˆ
22/
2
22/
2/
22/
22/
2
22/
2/
22/
nz
nz
nqqz
nzq
q
nz
nz
nqqz
nzq
P
CI for data survival rate s ŝ =99.983% or 0.99983 95% CI is
0.999849 ± 0.000083 [0.999766, 0.999932]between three and four nines
What are nines? A measure of system reliability/availability
99% is two nines 99.9% is three nines
More formally, we compute-log10 (1.0 – x)
Our confidence interval could then be expressed as
[3.63 nines, 4.17 nines]
CI for data loss rate0.0000189 ± 0.000088[0.0000106, 0.000273]
Analytical approach Will use a Markov model:
Works since failure rates and repair rates are distributed according to exponential laws
Obviously not true for repair rates
The model
State label indicates number of failed drives
Failure state is an absorbing state
Using ergodic hypothesis
Assume that array is returned to original state after each data loss
Failure rate will be 4p1
Model equations 5λp0=(μ+ 4λ)p1
po + p1 = 1 Solution is
Failure rate is
9
5,94
10 pp
9204
2
1pL
Computing five year survival Decay is
For MTTF = 300,000 hours,MTTR = 24 hours and t = 5 years
Data survival rate is 0.999767 or3.632 nines
Barely inside the C.I.
)920exp(
2
te Lt
Why? Markov model assumed repair times
were exponentially distributed Required by the technique
Simulation model assumed deterministic repair times Somewhat more realistic
Difference illustrates a major limitation of stochastic approach
Checking the explanation (I) We repeat the simulation using repair
times that are exponentially distributed We observe 19 data losses out of
100,000 runs qhat is 0.99981 Confidence interval is
(3.588 nines, 4.080 nines)
Checking the explanation (II) We repeat a second time the simulation
using repair times that are exponentially distributed and collecting more data We observe 95 data losses out of
400,000 runs qhat is 0.999763 Confidence interval is
(3.552 nines, 3.734 nines) Markov model predicted 3.632 nines
Comparing both approaches Which is the best approach in terms of
Generality and flexibility?
Provided results?
Discrete simulation and stochastic modeling complement each other
Extensions (I) Other repair time distributions
Exponential ( to compare with results of stochastic analysis)
Ad hoc (80% of repairs within one day, 20% within two days)
Taking accounts of differences between day and night, weekdays and weekends
Will map simtime() into a calendar
Extensions (II) Other disk arrays
Mirrored disks: NDISKS = 2 RAID level 6: tolerate two disk failures Triplicate disks: tolerate failure of two
out of three disks SSPiRAL arrays
See next slide Will require more work
SSPiRAL arrays
A
ABC
B
DAB
C
BCD
D
CDA
Resists all triple failures and most quadruple failures
CONCLUSION
CONCLUSION Statistical analysis of outputs is an
important aspect of simulation Standard statistical techniques require
independent observations w/o autocorrelation Simplest solution is to use batch
means CSIM tools simplify constructions of
confidence intervals for all sorts of means