Probability distribution functions

PowerPoint Presentation

Probability distribution functionsNormal distributionLognormal distributionMean, median and modeTailsExtreme value distributions

1Normal (Gaussian) distributionProbability density function (PDF)

What does figure tell about the cumulative distribution function (CDF)?

2

More on the normal distribution

3

Estimating mean and standard deviationGiven a sample from a normally distributed variable, the sample mean is the best linear unbiased estimator (BLUE) of the true mean.For the variance the equation gives the best unbiased estimator, but the square root is not an unbiased estimate of the standard deviation

For example, for a sample of 5 from a standard normal distribution, the standard deviation will be estimated on average as 0.94 (with standard deviation of 0.34)

4

Lognormal distribution

5

Question

What is your estimate of the mode (that is the most common income)? The median?

6Mean, mode and median

7Light and heavy tails

8Fitting distribution to dataUsually fit CDF to minimize maximum distance (Kolmogorov-Smirnoff test)Generated 20 points from N(3,12).Normal fit N(3.48,0.932)Lognormal lnN(1.24,0.26)Almost same mean andstandard deviation.

Given sampling data we fit a distribution by finding a CDF that is close to the experimental CDF. Usually, we use the Kolmogorov-Smirnoff (K-S) criterion, which is the maximum difference between the two CDfs. Here this is illustrated by first generating a sample of twenty points from N(3,12) 3.4263 4.0990 3.6194 2.2412 3.0901 2.5178 3.1540 5.3013 4.0712 5.5182 2.6944 2.9772 3.8018 2.6601 3.1646 3.7553 3.2361 3.5960 2.0353 4.6775The figure shows in blue the experimental CDF and the lower and upper 90% confidence bounds (in blue). Then the normal fit in red and the lognormal fit in green. The normal fit, N(3.48,0.932) indicates 16% error in the mean and 7% in the standard deviation compared to the distribution used to generate the data. The lognormal fit has almost the same mean and standard deviation, but it is substantially different in the tail. Surprisingly it is a better fit to the data using the K-S test than the normal. However, in view of the large uncertainty bounds on the experimental CDF this is clearly believable.The Matlab sequence used to generate the fit and plot isx=randn(20,1)+3; [ecd,xe,elo,eup]=ecdf(x);pd=fitdist(x,'normal') mu = 3.481862 sigma = 0.927932pd=fitdist(x,'lognormal') mu = 1.21473 sigma = 0.262613xd=linspace(1,8,1000); cdfnorm=normcdf(xd,3.4819,0.92793); cdflogn=logncdf(xd,1.2147,0.26261)plot(xe,ecd,'LineWidth',2); hold on; plot(xd,cdflogn,'g','LineWidth',2)plot(xd,cdfnorm,'r','LineWidth',2); xlabel('x');ylabel('CDF')legend('experimental','lognormal','normal','Location','SouthEast')plot(xe,elo,'LineWidth',1); plot(xe,eup,'LineWidth',1)9Extreme value distributionsNo matter what distribution you sample from, the mean of the sample tends to be normally distributed as sample size increases (what mean and standard deviation?)Similarly, distributions of the minimum (or maximum) of samples belong to other distributions.Even though there are infinite number of distributions, there are only three extreme value distributions.Type I (Gumbel) derived from normal.Type II (Frechet) e.g. maximum daily rainfallType III (Weibull) weakest link failure

10Maximum of normal samplesWith normal distribution, maximum of sample is more narrowly distributed than original distribution.

Max of 10 standard normal samples. 1.54 mean, 0.59 standard deviation

Max of 100 standard normal samples. 2.50 mean, 0.43 standard deviation

The normal distribution decays exponentially, that is, has a light tail. Therefore when you take the maximum of a set of samples, its distribution is narrower than the original distribution. This is illustrated here for the case of 10 samples and 100 samples drawn from the standard normal distribution. The left histogram and the values of the mean and standard deviation are obtained with the Matlab sequence;x=randn(10,100000); maxx=max(x);hist(maxx,50)mean(maxx)std(maxx)

We see that by the time weruse 100 samples the maximum has a standard deviation of only 0.43 compared to 1 for the original distribution.11Gumbel distribution.Mean, median, mode and variance

For large number of samples, the minimum of normal samples converges to a distribution called Type 1 Extreme Value Distribution or the Gumbel distribution. The slide provides its PDF CDF and its mean, median and mode.

Note that the distribution is defined for the minimum of a sample. If we desire the distribution for the maximum of a sample, we need to look for the minimum of the negative. This was done in fitting a distribution to the maximum of samples of size 10 and 100 drawn from the standard normal distribution. 10,000 such sets of samples were drawn, and the negative of their maxima were fitted to the Gumbel distribution. The left figure shows that the CDF of samples of 10 is markedly different from the Gumbel, but for 100 they agree quite well.

The Matlab sequence for the 100 samples was as follows (the information on mu and sigma was output, and then it was input to define them).x=randn(100,100000); maxx=max(x); fitdist(-maxx','ev')extreme value distribution mu = -2.30676 sigma = 0.36862[F,X]=ecdf(-maxx);plot(X,F,'r'); hold onxd=linspace(-5.3,-1,1000); evcd=evcdf(xd,mu,sigma);plot(xd,evcd); legend('fitted ev1','-max100 data')12Weibull distributionProbability distributionIts log has Gumbel dist.

Used to describe distribution of strength or fatigue life in brittle materials.If it describes time to failure, then k1 indicates increasing rate.Can add 3rd parameter by replacing x by x-c.

The Gumbel distribution, being the limiting case of the normal, is fairly light tail. Weibul is a heavier tailed limiting distribution. This can be also shown by the fact that its logarithm obeys the Gumbel distribution. It is also called Type 3 Extreme Value distribution. The specific equation for the PDF and the figure are taken from Wikipedia.

The figure on the right shows that variety of the PDF for Weibull. The figure on the left compares the experimental CDF of the logarithm of a sample generated from the Weibull distribution (Matlab wblrnd) and the Gumbel distribution (Matlab ev) to the sample. The excellent agreement confirms the relation between Weibull and Gumbell.13ExercisesEstimate how much rain will Gainesville have in 2014 as well as the aleatory and and epistemic uncertainty in your estimate.Find how many samples of normally distributed numbers you need in order to estimate the mean with an error that will be less than 5% of the true standard deviation 90% of the time. Use the fact that the mean of a sample of a normal variable has the same mean and a standard deviation that is reduced by the square root of the number of samples.Both the lognormal and Weibull distributions are used to model strength. Fit 100 data generated from a standard lognormal distribution by both lognormal and Weibull distributions. Repeat with 5 randomly generated samples. In each case measure the distance using the KS distance, and translate the result to a sentence of the following format: The maximum difference between the two CDFs is at x=2, where the true probability of x

Documents

Probability distribution functions