13
Probability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions. 2. Use Mathematica to call up the PDF and CDF for a variety of distributions, and use them to answer questions regarding probabilities. 3. Identify situations in which the Poisson distribution is relevant. 4. Given a data set, compute the three measures of central tendency: mean, median and mode. 5. Given a data set, compute two measures of dispersion: variance (and standard deviation), and interquartile range. 6. Given a data set, compute skewness and kurtosis, and interpret the results. 7. Use DistributionFitTest to test whether a data set is drawn from a normal distribution. Overview The field of data analysis is divided into two broad areas: descriptive statistics and statistical inference. Descriptive statistics involves techniques for summarizing data sets so as to highlight their salient features. Statistical inference, on the other hand, involves drawing conclusions from data in a precise and systematic way. In a traditional statistics course, statistical inference is the primary focus. An example of statistical inference would be in determining if there was a difference in two populations given limited sampling from the two populations. A null hypothesis that there is no difference is tested. The answer often takes the form of a p-value which would be the probability that the difference between the two samples is due to chance. A small p-value, thus, allows one to reject the null hypothesis that there is no difference. In fields in which data is scarce, sophisticated methods of statistical inference are of the utmost importance. For example, in testing a new cancer drug, the trials are very expensive so the researchers will look for reliable techniques that allow them to draw conclusions with as little data as possible. Another example of an area in which statistical inference is important is quality control. One of the pioneers of statistical inference, William Gosset, was a chemist in the Guiness brewery who was responsible to ensuring a consistent quality of the beer. He needed to find a way determining the characteristics of a batch without testing every bottle. Gosset wrote a paper in 1908, under the pseudonym Student to describe his method. His paper includes the sam- pling distribution which has been come to be known as Student’ s t-distribution. In the modern business environment in which transactions and other bits of informaton are recorded electronically, the role of statistical inference is of less importance than it was previously. In many contexts, data is plentiful rather than scarce. A common lament is that businesses now have more data than they know what to do with. In this new environment, descriptive statistics and its ally, data visualization, come to the fore. With so much data available it is necessary to sift through it in a systematic way in order to narrow down a set of questions to investigate further. The term exploratory data analysis is sometimes used to describe this activity. Probability distributions come into play in both descriptive statistics and statistical inference. Distributions can be used to describe data sets. For example, if a relative frequency histogram takes on a bell shape, we might infer that the distribution was drawn from a normal distribution. In this chapter, we will learn how to use a built-in function to check. Correctly identifying the probability distribution underlying a data set is important for statistical inference. With limited data, an analyst might need to make an assumption regarding the probability distribution. If the assumption is incorrect the conclusions drawn from even the most sophisticated techniques of statistical inference, will be off the mark, sometimes drastically so. For example, errors made by financial analysts in valuing mortgage securities during the run-up to the crisis, have been attributed to their tendency to assume losses follow a normal distribution. If the distribution was normal, the actual losses experiences would have been virtually imposssible. To explain what happpened, financial analysts now talk about fat tails--meaning that the tail of the distribution showed a higher probability that what would be expected under a normal distribution. Familiarity with non-

Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

  • Upload
    others

  • View
    23

  • Download
    14

Embed Size (px)

Citation preview

Page 1: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Probability Distributions and Descriptive Statistics

Learning Objectives1. Distinguish between discrete and continuous distributions.

2. Use Mathematica to call up the PDF and CDF for a variety of distributions, and use them to answer questions regarding probabilities.

3. Identify situations in which the Poisson distribution is relevant.

4. Given a data set, compute the three measures of central tendency: mean, median and mode.

5. Given a data set, compute two measures of dispersion: variance (and standard deviation), and interquartile range.

6. Given a data set, compute skewness and kurtosis, and interpret the results.

7. Use DistributionFitTest to test whether a data set is drawn from a normal distribution.

OverviewThe field of data analysis is divided into two broad areas: descriptive statistics and statistical inference. Descriptive statisticsinvolves techniques for summarizing data sets so as to highlight their salient features. Statistical inference, on the other hand,involves drawing conclusions from data in a precise and systematic way. In a traditional statistics course, statistical inference isthe primary focus. An example of statistical inference would be in determining if there was a difference in two populations givenlimited sampling from the two populations. A null hypothesis that there is no difference is tested. The answer often takes theform of a p-value which would be the probability that the difference between the two samples is due to chance. A small p-value,thus, allows one to reject the null hypothesis that there is no difference.

In fields in which data is scarce, sophisticated methods of statistical inference are of the utmost importance. For example, intesting a new cancer drug, the trials are very expensive so the researchers will look for reliable techniques that allow them todraw conclusions with as little data as possible. Another example of an area in which statistical inference is important is qualitycontrol. One of the pioneers of statistical inference, William Gosset, was a chemist in the Guiness brewery who was responsibleto ensuring a consistent quality of the beer. He needed to find a way determining the characteristics of a batch without testingevery bottle. Gosset wrote a paper in 1908, under the pseudonym Student to describe his method. His paper includes the sam-pling distribution which has been come to be known as Student’s t-distribution.

In the modern business environment in which transactions and other bits of informaton are recorded electronically, the role ofstatistical inference is of less importance than it was previously. In many contexts, data is plentiful rather than scarce. A commonlament is that businesses now have more data than they know what to do with. In this new environment, descriptive statistics andits ally, data visualization, come to the fore. With so much data available it is necessary to sift through it in a systematic way inorder to narrow down a set of questions to investigate further. The term exploratory data analysis is sometimes used to describethis activity.

Probability distributions come into play in both descriptive statistics and statistical inference. Distributions can be used todescribe data sets. For example, if a relative frequency histogram takes on a bell shape, we might infer that the distribution wasdrawn from a normal distribution. In this chapter, we will learn how to use a built-in function to check.

Correctly identifying the probability distribution underlying a data set is important for statistical inference. With limited data, ananalyst might need to make an assumption regarding the probability distribution. If the assumption is incorrect the conclusionsdrawn from even the most sophisticated techniques of statistical inference, will be off the mark, sometimes drastically so. Forexample, errors made by financial analysts in valuing mortgage securities during the run-up to the crisis, have been attributed totheir tendency to assume losses follow a normal distribution. If the distribution was normal, the actual losses experiences wouldhave been virtually imposssible. To explain what happpened, financial analysts now talk about fat tails--meaning that the tail ofthe distribution showed a higher probability that what would be expected under a normal distribution. Familiarity with non-normal distributions and the ability to identify them, is important in order to avoid such mistakes.

Page 2: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Correctly identifying the probability distribution underlying a data set is important for statistical inference. With limited data, ananalyst might need to make an assumption regarding the probability distribution. If the assumption is incorrect the conclusionsdrawn from even the most sophisticated techniques of statistical inference, will be off the mark, sometimes drastically so. Forexample, errors made by financial analysts in valuing mortgage securities during the run-up to the crisis, have been attributed totheir tendency to assume losses follow a normal distribution. If the distribution was normal, the actual losses experiences wouldhave been virtually imposssible. To explain what happpened, financial analysts now talk about fat tails--meaning that the tail ofthe distribution showed a higher probability that what would be expected under a normal distribution. Familiarity with non-normal distributions and the ability to identify them, is important in order to avoid such mistakes.

In this chapter, we begin with a discussion of probability distributions covering the key ones that you will likely encounter inyour data analysis work, including those that are important in statistical inference. Next we discuss the three types of measuresused in descriptive statistics: location, dispersion and shape. Using a data set drawn from the built-in financial data collection, weshow how the measures can be computed.

Probability DistributionsProbability distributions are divided into two broad classes: those that apply when the data is continuous and those that applywhen the data is discrete. In business applications, the data is usually discrete. Prices and quantities are measured in discreteunits. But often the data, if it is finely grained, can be approximated by a continuous distribution.

A continous univariate distribution can be represented by a function known as a PDF, or probability density function. Theintegral of the PDF is known as the CDF, or cumulative distribution function. The argument of the functions represents a ran-dome variable and is usually represented by an x. In data anlaysis, the random variable is the data point. The functions alsoinvolve constants that do not vary with the data selected. These constants are called parameters.

In this chapter we will look at four distributions that describe a large variety of data sets. The most common distribution innature, and in the social sciences, is the normal distribution, also known as the Gaussian distribution. The binomial distribution isa discrete distribution that is applied when the outcome can take only two values, for example making a sale or not making a sale.The Poisson distribution is important when analyzing waiting times. And finally, the lognormal distribution is relevant forvariables the experience compound growth over time. Wealth, income and market shares tend tend to follow this distribution.

Normal DistributionWith the normal distribution the parameters are the mean, Μ and the standard deviation, Σ. (To input the Greek letters, type escthe corresponding roman letter esc .) In order to get Mathematica to return the function, it is necessary to specify the parametersand a the variable whose probability is being investigated. The following returns the function in symbolic form, that is, specificvalues for either the parameters

PDF@NormalDistribution@Μ, ΣD, xD

ã-

Hx-ΜL2

2 Σ2

2 Π Σ

We can visulize the PDF by selecting values for Μ and Σ and a range for x. Consider the PDF for women’s height, where weconsider heights from 36 inch (3 feet) to 84 inches (7 feet).

2 chap4DescriptiveStatandProbDist.nb

Page 3: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Plot@PDF@[email protected], 2.5D, xD,8x, 36, 84<, PlotRange ® All, Filling ® AxisD

50 60 70 80

0.05

0.10

0.15

The height of the function is related to the probability that a woman’s height will take on the value of the x-axis. To compute theprobability that a woman will fall within a range of heights, we need to compute the area under the curve. For example, if wewanted to know the probability that a woman is 64 inches rounded to the nearest inch, we would take the area under the curvebetween 63.5 and 64.4. To find the area, we need to integrate the function between those two points. First, define the function:

womensHeightPDF = PDF@[email protected], 2.5D, xD

0.159577 ã-0.08 H-63.5+xL2

Now we can integrate between the 63.5 and 64.4

Integrate@womensHeightPDF, 8x, 63.5, 64.4<D

0.140576

To find the probability that a woman will be less than 5 foot, 6 inches we integrate over the range 0 to 66.

Integrate@PDF@[email protected], 2.5D, xD, 8x, 0, 66<D

0.841345

To answer the question of the probability that a women is less than a given height, we could also use the CDF, or cumulativedistribution function. It gives the probability that the x value takes on a given value or less. So if we wanted to ask what’s theprobability that a woman will be shorter than 5 foot, 6 inches, we can use the following command:

womensHeightCDF = CDF@[email protected], 2.5D, xD

1

[email protected] H63.5 - xLD

p66 = womensHeightCDF �. x ® 66

0.841345

Plotting the CDF over the same range as we plotted the PDF and showing the value of the CDF at 66 inches,

chap4DescriptiveStatandProbDist.nb 3

Page 4: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Plot@CDF@[email protected], 2.5D, xD,8x, 36, 84<, Filling ® Axis, AxesOrigin ® 80, 0<,

Epilog ®

8Red, Thin, Line@880, p66<, 866, p66<, 866, 0<<D<D

20 40 60 80

0.2

0.4

0.6

0.8

1.0

We can also use the CDF to get the probability that x is greater than a given value. It is simple 1 minus the value given by theCDF. Thus the probability that a woman is talled than 5 foot 6 inches is

1 - p66

0.158655

Binomial DistributionThe binomial distribution comes up in situations in which there are two possible outcomes. A good example is a free throw inbasketball. The player either succeeds or fails on each throw. Suppose a male basketball player has a 90% chance of success on afree throw. How likely is it that he will only get 75 or fewer successes out of 100 throws? Although the exact answer can befound by consulting any textbook on probability and statistics, we can use Mathematica to run simulations with random numbersand discover the answer that way. This approach is known as the Monte Carlo method after the famous casino on the FrenchRiviera. We can use the same approach to answer the question of the probability of long streaks of successes.

First create a list of 9 ones and 1 zero, to represent the expected outcome from 10 throws.

freeThrowList = Join@80<, Table@1, 89<DD

80, 1, 1, 1, 1, 1, 1, 1, 1, 1<

Now, let’s set a function that will draw randomly from this list to create a random sample. The number of draws is represented byn.

sample@n_D := RandomChoice@freeThrowList, nD;

We can use this function to simulate 100 free throws.

sample@100D

81, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1<

By adding together all the 1’s we get the percentage of sucesses.

4 chap4DescriptiveStatandProbDist.nb

Page 5: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Total@%D

88

The Monte Carlo method is to repeat the process for a large number of samples. Below are the results for hundred thousand setsof 100 free throws.

hundredK = Total �� Table@sample@100D, 8100 000<D

A very large output was generated. Here is a sample of it:

891, 90, 91, 92, 87, 91, 87, 93, 93, 91, 86, 86, 92,92, 94, 85, 89, 90, 93, 91, 90, �99 959�, 91, 90, 89, 90,92, 91, 90, 92, 95, 94, 89, 90, 96, 91, 91, 90, 94, 89, 88, 90<

Show Less Show More Show Full Output Set Size Limit...

A histogram creates a visualisation of the output.

Histogram@hundredKD

80 85 90 95 100

2000

4000

6000

8000

10 000

12 000

14 000

We can check that the simulation returns a mean close to the expected value of 90,

Mean@hundredKD �� N

90.0115

Note the use of N to ensure that Mathematica does not return a rational number. Similarly the standard deviation can be com-puted and compared with the analytical formula, in which p is the probability of success and n is the sample size.

p H1 - pL n

StandardDeviation@hundredKD �� N

3.00379

[email protected] ´ .1 ´ 100D

3.

Now, consider how many cases there are of samples with a 75% or lower success rate.

chap4DescriptiveStatandProbDist.nb 5

Page 6: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Cases@hundredK, x_ �; x < 76D

875<

To get the probability that there will be 75 or fewer successes in a sample size of 100, recall the formula from an introductorystatistics course:

Pr H X £ xL = âi=0

x n

ipi H1 - pLn-i

Sum@Binomial@100, iD .9^i ´ .1^H100 - iL, 8i, 0, 75<D

0.0000130728

Multiply by a hundred thousand to get the number of cases in a hundred thousand samples in which we can expect to see asuccess rate of 75% or less.

% 10^5

1.30728

To see if we can get close to this result using the Monte Carlo method, let’s replicate our simulation of a hundred thousandsamples 100 times and identify the cases in which the success rate was 75% or lower.

Table@Length@Cases@Total �� Table@sample@100D, 810^5<D, x_ �; x < 76DD, 8100<D

81, 3, 0, 2, 1, 2, 1, 3, 0, 1, 2, 1, 2, 3, 1, 1, 2, 0, 3, 1, 0, 5, 1, 0,0, 3, 0, 1, 0, 1, 0, 1, 2, 3, 1, 4, 0, 1, 1, 1, 2, 3, 2, 2, 3, 0, 1, 3, 1,1, 0, 2, 3, 3, 2, 3, 1, 2, 1, 2, 1, 1, 0, 2, 3, 1, 1, 2, 1, 1, 2, 4, 3, 0,1, 4, 1, 0, 2, 1, 0, 0, 2, 1, 1, 1, 1, 3, 4, 1, 4, 2, 0, 1, 1, 2, 0, 2, 3, 6<

Histogram@%D

1 2 3 4 5 6 7

10

20

30

40

Mean@%%D �� N

1.58

While Monte Carlo simulations are helpful in developing the intution behind the binomial distribution, Mathematica has the PDFand CDF for the binomial distribution built in. It can be called up to answer questions about the probability of a number ofsuccesses (x) over a number of trials (n) given a probability of success of (p). So to answer the question of the probability thatover 100 tosses a 90% free throw player will succeed 75 times, we use

6 chap4DescriptiveStatandProbDist.nb

Page 7: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

PDF@BinomialDistribution@n, pD, xD �. 8n ® 100, p ® .9, x ® 75<

8.97293 ´ 10-6

To find the probability that he will succeed 75 or less, we used the CDF

CDF@BinomialDistribution@n, pD, xD �. 8n ® 100, p ® .9, x ® 75<

0.0000130728

Note that we could get the same result by computing the PDF for values 0 through 75 and summing,

Sum@PDF@BinomialDistribution@n, pD, xD �. 8n ® 100, p ® .9, x ® i<, 8i, 0, 75<D

0.0000130728

Poisson DistributionLike the binomial, the Poisson is a discrete distribution. It is used in cases in which one wants to know the probability of a givennumber of incidents of some event over a period of time, or over some area. For example, it could be used to determine theprobability that three students will stop by a professor’s office hours within a 15 minute time frame, if one knows the averagenumber of students that stop by every 15 minutes. Or it could be used to determine the probability of having 10 defects in a yardof material if one knows the average number of defects. The average number of incidents is the only parameter in the Poissondistribution.

Assume that on average one student stops by a professors office hours in a 15 minute period, the probability that 3 will stop bycan be determined using the PDF for the Poisson distribution:

PDF@PoissonDistribution@ΜD, xD �. 8Μ ® 1, x ® 3<

1

6 ã

N@%D

0.0613132

The PDF can be visualized using DiscretePlot

DiscretePlot@PDF@PoissonDistribution@1D, xD,8x, 0, 7<, PlotMarkers ® Automatic, AxesOrigin ® 8-1, 0<D

æ æ

æ

æ

æ

æ æ æ

0 2 4 6

0.05

0.10

0.15

0.20

0.25

0.30

0.35

To find the probability that more three students will stop by in a 15 minute period, we can use the CDF to compute the probabil-ity of three or fewer, and then subtract the probability from one.

chap4DescriptiveStatandProbDist.nb 7

Page 8: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

1 - CDF@PoissonDistribution@1D, 3D

1 - GammaRegularized@4, 1D

N@%D

0.0189882

Lognormal DistributionLike the normal distribution, the lognormal is a continuous distribution that can be used when the random variable is finelygrained. Unlike the normal distribution, however, the lognormal distribution is not symmetrical about the mean. It is skewed tothe right, that is, it has a hump to the left of the mean, with a long tail extending out to the right. If a random variable is follows alognormal distribution then the log of the random variable will follow a normal distribution. The distribution of income andwealth follows approximately a lognormal distribution.

The first economist to recognize the skewed distribution of income was the Italian economist Vilfredo Pareto. His name isassociated with the generalization that 20% of a group will be associated with 80% of some other variable. For example, 20% ofthe population will hold 80% of the wealth, or 20% of one's customers will generate 80% of one's revenues. The rule does nothold perfectly but as a rough approximation, it works often enough to elicit notice.

Plot@PDF@LogNormalDistribution@3, .7D, xD, 8x, 1, 100<, Filling ® AxisD

20 40 60 80 100

0.005

0.010

0.015

0.020

0.025

0.030

0.035

Consider the lognormal distribution represented above. The parameters 3 and 0.7 are the mean and variance of the associatednormal distribution. The mean and standard deviation of the lognormal distribution above are:

ð@LogNormalDistribution@3, .7DD & �� 8Mean, StandardDeviation<

825.6617, 20.4058<

If the distribution above represents the distribution of income in thousands of dollars, we can use it to ask questions like (1)“What percentage of the population earns over $100, 000?” or (2) “What level of income puts one in the top 1%”

For the first question, use the CDF to find the percentage that earns less than $100,000 and subtract from 1.

1 - CDF@LogNormalDistribution@3, 0.7D, 100D

0.0109211

The answer to the first question indicates that the answer to the second question will be slightly more than $100,000. To answerthe second question we can use the function Solve:

8 chap4DescriptiveStatandProbDist.nb

Page 9: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Solve@1 - CDF@LogNormalDistribution@3, 0.7D, xD � .01, xD

Solve::ratnz :

Solve was unable to solve the system with inexact coefficients. The answer was obtained by solving

a corresponding exact system and numericizing the result. �

88x ® 102.355<<

Descriptive Statistics

LocationTo illustrate the decriptive statistics functions, let’s first create a data set of monthly returns over the years 2000 to 2011 for theS&P500 index, Walmart and GoldmanSachs. Since it will be a lot of output we suppress it by ending the command with ;.

ch4d = FinancialData@ð, "Return", 882000<, 82011<, "Month"<, "Value"D & ��

8"SP500", "GS", "WMT"<;

To see the dimensions of the data set, we can first check for the number of “rows” or records. It should be three since we areconsidering three different companies.

Length@ch4dD

3

To check for the number of columns, we need to map Length onto the data set. This command will count up the content of eachrow.

Length �� ch4d

8131, 131, 131<

Measures of location include measure of central tendency such as mean and median. Both of these functions are built in we canalso create functions ourselves. We start the name with a lowercase letter to disguish our function from the one that is built in.

mean@l_ListD := Plus �� l � Length@lD

mean �� ch4d

80.00034434, 0.0105041, 0.00255785<

Comparing with the built-in function:

mean �� ch4d == Mean �� ch4d

True

Recall the median value is at the 50th percentile. To find it we need to sort the data and then find the middle value. If there is aneven number of values, the two middle values need to be averaged. Since there are two steps, it makes sense to write a littleprogram. With Module we start with a list of the variable that will be defined in the program. They are enclosed in curly bracesand separated from the rest of the program with a comma.

median@l_ListD := Module@8data, n<,

data = Sort@lD; H*ranks the data highest to lowest*Ln = Length@lD; H*counts up the number of values*LIf@EvenQ@nD, Mean@8data@@n � 2DD, data@@n � 2 + 1DD<D, data@@n � 2 + 0.5DDDD

median �� ch4d

80.00712851, 0.0155272, 0.00517751<

chap4DescriptiveStatandProbDist.nb 9

Page 10: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

median �� ch4d � Median �� ch4d

True

Percentiles can be computed using the function Quantile. The desired percentile is put in as the second argument. The followinggives the 25th, 50th and 75th percentiles for the S&P500 returns.

Quantile@ch4d@@1DD, 8.25, .50, .75<D

8-0.0218461, 0.00712851, 0.031508<

The same result can be generated using the function Quartiles

Quartiles@ch4d@@1DDD

8-0.0215762, 0.00712851, 0.031119<

For continuous data, computing BinCounts, or alternatively, looking at a histogram is helpful in determining the mode. Selectingthe correct bin width is often a trial and error process.

BinCounts@ð, .01D & �� ch4d

881, 0, 0, 0, 0, 1, 1, 2, 5, 3, 5, 2, 2, 6, 8, 12, 8, 15, 21, 6, 12, 2, 7, 3, 3, 4, 2<,81, 0, 0, 0, 0, 0, 2, 0, 1, 0, 1, 0, 0, 4, 1, 3, 3, 2, 2, 3, 5, 5, 3, 3, 1, 7, 2, 8, 5, 8, 8,

3, 6, 8, 5, 3, 1, 3, 3, 5, 3, 4, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1<,81, 1, 1, 0, 1, 2, 0, 2, 2, 4, 8, 7, 6, 6, 8, 9, 15, 14,

8, 6, 7, 3, 6, 3, 4, 1, 3, 0, 0, 0, 1, 1, 1<<

Histogram@ð, 8.01<D & �� ch4d

:

-0.15 -0.10 -0.05 0.00 0.05 0.10

5

10

15

20

,

-0.2 -0.1 0.0 0.1 0.2 0.3

2

4

6

8

,

-0.10 -0.05 0.00 0.05 0.10 0.15

5

10

15

>

Using a bin width of 0.01, reveals single modes for the S&P500 and Walmart (the first and third data sets). From the histogram,we see that doubling the bin width for Goldman Sachs (the second data set) will result in a single modal value.

BinCounts@ch4d@@2DD, .02D

81, 0, 0, 2, 1, 1, 4, 4, 5, 5, 10, 6, 8, 10, 13, 11, 14, 8, 4, 8, 7, 3, 1, 0, 1, 1, 0, 0, 1, 2<

10 chap4DescriptiveStatandProbDist.nb

Page 11: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Histogram@ch4d@@2DD, 8.02<, LabelingFunction ® AboveD

1.

2.

1. 1.

4. 4.

5. 5.

10.

6.

8.

10.

13.

11.

14.

8.

4.

8.

7.

3.

1. 1. 1. 1.

2.

-0.2 -0.1 0.0 0.1 0.2 0.3

2

4

6

8

10

12

14

Visual inspection reveals a modal bin between .04 and .06. Verifying

Count@ch4d@@2DD, x_ �; 0.04 < x < .06D

14

For discrete data points, the function Commonest can be used to find the modal value.

DispersionVariance and standard deviation are the most common measures of dispersion.

Variance �� ch4dStandardDeviation �� ch4d

80.00223741, 0.0104518, 0.0032351<

80.0473013, 0.102234, 0.0568779<

Note that the variance and standard deviations are the sample (not the population) measures.

The interquartile range is another measure. To compute the interquartile range subtract the 25th percentile (or first quartile) fromthe 75th percentile (or third quartile).

Quartiles@ðD@@3DD - Quartiles@ðD@@1DD & �� ch4d

80.0526952, 0.125918, 0.0671541<

Alternatively we can use the built-in function

InterquartileRange �� ch4d

80.0526952, 0.125918, 0.0671541<

ShapeThe shape of a distribution of data refers to the shape of a histogram. Skewness refers to an absence of symmetry about the mean.A positively skewed distribution has a right hand tail that extends out and a hump to the left of the mean. The lognormal distribu-tion is illustrated below.

mean

chap4DescriptiveStatandProbDist.nb 11

Page 12: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

mean

Negatively skewed distributions have a left hand tail that extends out and a hump to the right of the mean.

The function Skewness can be applied to both data sets and to distributions.

Skewness@NormalDistribution@0, 1DDSkewness@LogNormalDistribution@0, 1DD �� N

0

6.18488

Applying the function to our data set:

Skewness �� ch4d

8-0.529075, 0.149054, -0.0217862<

As we could see from the histograms, they are not perfectly symmetrical but the degree of skewness is small.

Kurtosis refers peakedness of the distribution. If the kurtosis measure is greater than three, then we can conclude that relative to anormal distribution, the distribution has a sharper peak and heavier tails, in otherword there’s less probability mass in the areabetween--sometimes call the shoulders or flanks of the distribution. As with Skewness, the function Kurtosis can be applied tosymbolic distributions, as well as to data.

Kurtosis@NormalDistribution@0, 1DD

3

Applying to our data set,

Kurtosis �� ch4d

83.61981, 3.62201, 3.55668<

StandardDeviation@ch4d@@1DDD

0.0473013

DistributionFitTestExamining the skewness and kurtosis can given us an idea as to the underlying distribution. For example, a small skewnessnessmeasure and a measure of kurtosis close to 3 suggests that the distribution is normal. Alternatively, one can use the built-infunction DistributionFitTest which performs a goodness of fit test under the null hypothesis that the data was drawn from aspecified distribution. The command returns a p-values for various tests. Small p-values indicates that it is highly unlikely thatthe data set was drawn from the specified distribution.

12 chap4DescriptiveStatandProbDist.nb

Page 13: Probability Distributions and Descriptive StatisticsProbability Distributions and Descriptive Statistics Learning Objectives 1. Distinguish between discrete and continuous distributions

Examining the skewness and kurtosis can given us an idea as to the underlying distribution. For example, a small skewnessnessmeasure and a measure of kurtosis close to 3 suggests that the distribution is normal. Alternatively, one can use the built-infunction DistributionFitTest which performs a goodness of fit test under the null hypothesis that the data was drawn from aspecified distribution. The command returns a p-values for various tests. Small p-values indicates that it is highly unlikely thatthe data set was drawn from the specified distribution.

Consider the S&P500 data,

sp = ch4d@@1DD;

To investigate whether it is normally distributed, we can create an object containing the hypothesis test data with:

spObject = DistributionFitTest@sp, NormalDistribution@Μ, ΣD, "HypothesisTestData"D;

The properties of the object can be found with:

spObject@"Properties"D

8AllTests, AndersonDarling, AutomaticTest, CramerVonMises, DegreesOfFreedom,DistanceToBoundary, FittedDistribution, FittedDistributionParameters,HypothesisTestData, KolmogorovSmirnov, Kuiper, MardiaCombined,MardiaKurtosis, MardiaSkewness, PearsonChiSquare, PValue, PValueTable,ShapiroWilk, ShortTestConclusion, SzekelyEnergy, TestConclusion,TestData, TestDataTable, TestStatistic, TestStatisticTable, WatsonUSquare<

To see all the hypothesis test results.

spObject@"TestDataTable", AllD

Statistic P-Value

Anderson-Darling 1.04419 0.00889915

Cramé r-von Mises 0.191412 0.00635848

Jarque-Bera ALM 9.10397 0.0354042

Kolmogorov-Smirnov 0.081682 0.0315315

Kuiper 0.132955 0.0346651

Pearson Χ226.4427 0.00928683

Shapiro-Wilk 0.97413 0.0132001

Watson U2

0.171451 0.00795134

The small p-values indicate that it is unlikely that the distribution from which the data was drawn was normal.

chap4DescriptiveStatandProbDist.nb 13