2. Data analysis in one variable - Kansas State …agondem/Statistics_I_files/2...Descriptive Statistics The sample and the variables produce some data (the values of the variables

2. Data analysis in one variable

Descriptive Statistics

The sample and the variables produce some data (the values of the variables when applied to the sample). We want to understand these data. First, one variable at a time!

1. Construct frequency tables.

2. Construct graphs.

3. Compute statistics.

Each option above provides different understanding of the data.

Different type of data require different type of analysis!

Descriptive Statistics

Frequency Distribution of frequencies.

Graphs

Bar charts, pie charts. Stem-and-leaf diagrams (stemplot). Histograms, frequency polygons. Boxplot.

Statistics (numerical values)

Center values: mean, median and mode. Position values: quartiles, percentiles... Spread values: range, interquartile range, standard deviation, variance. Symmetry values: coefficients of symmetry. Kurtosis: kurtosis coefficient.

2.1 Frequencies

Frequency

Idea: to count how often a certain values appears in our data.

It depends on the type of data (type of variable).

Categorical Absolute and relative frequencies.

Discrete Absolute and relative frequencies. Cumulative absolute and relative frequencies.

Continuous

Group data by intervals. Absolute and relative frequencies. Cumulative absolute and relative frequencies. Densities and class marks.

Frequency – Categorical Variables

The following table shows the eye color for some people.

Marta Joan Laura Kayla Brandin Jay Reta Andy Tina

green blue brown blue green brown black brown black

Absolute frequency: count how many times each value appears in the table.

Green 2 Blue 2

Brown 3 Black 2

Frequency – Categorical Variables

The following table shows the eye color for some people.

Marta Joan Laura Kayla Brandin Jay Reta Andy Tina

green blue brown blue green brown black brown black

Relative frequency: quotients of the absolute frequencies by the size of the sample

Green 0.222 Blue 0.222

Brown 0.333 Black 0.222

Frequency – Discrete Variables

The following table shows the number of cars registered to each home in a block of 20 apartments.

1 2 1 0 3 4 0 1 1 1

2 2 3 2 3 2 1 4 0 0

Absolute frequency: count how many times each value appears in the table (how may people own 2 cars?).

Value 0 1 2 3 4 Absolute

Frequency 4 6 5 3 2 = 20



1 2 1 0 3 4 0 1 1 1

2 2 3 2 3 2 1 4 0 0

Relative frequency: quotients of the absolute frequencies by the size of the sample (which proportion of the sample owns 2 cars?).

Value 0 1 2 3 4 Relative

Frequency 0.2 0.3 0.25 0.15 0.1 = 1



1 2 1 0 3 4 0 1 1 1

2 2 3 2 3 2 1 4 0 0

Cumulative absolute frequency: sum of the absolute frequencies of all previous values (how many people own 3 cars or less?).

Value 0 1 2 3 4 Absolute

Frequency 4 6 5 3 2

C. Ab. Freq 4 10 15 18 20



1 2 1 0 3 4 0 1 1 1

2 2 3 2 3 2 1 4 0 0

Cumulative relative frequency: quotients of the cumulative absolute frequencies by the size of the sample (which proportion of the sample owns 3 cars or less?).

Value 0 1 2 3 4 Relative

Frequency 0.2 0.3 0.25 0.15 0.1

C. Rel. Freq 0.2 0.5 0.75 0.9 1


We can combine all the frequencies together in a single table.

Values Ab. Freq C. Ab. Freq. Rel. Freq. C. Rel. Freq.

ni Ni = n1 + … + ni fi = ni/N Fi = Ni/N

0 4 4 0.2 0.2 1 6 10 0.3 0.5 2 5 15 0.25 0.75 3 3 18 0.15 0.9 4 2 20 0.1 1

Cumulative frequencies do not make sense when applied to categorical variables!!

Frequency – Continuous Variables

The following table is a sample of glucose concentration in blood for 20 patients in hospital (in mmol/L).

3.182 5.317 4.115 5.578 7.398 6.377 6.310 4.916 5.556 7.071

5.738 3.124 4.652 6.753 6.698 6.637 5.034 3.425 3.958 3.696

This is a continuous variable, and (as was to be expected) no value appears twice in the above table. It does not make much sense to draw a table of frequencies… yet.

In this case, it is convenient to group the values in intervals before analyzing the data. Here is a suggestion (you may try other intervals):

[3.000, 4.500) [4.500, 6.000) [6.000, 6.500) [6.500,7.500)


The following table is a sample of glucose concentration in blood for 20 patients in hospital (in mmol/L).

3.182 5.317 4.115 5.578 7.398 6.377 6.310 4.916 5.556 7.071

5.738 3.124 4.652 6.753 6.698 6.637 5.034 3.425 3.958 3.696

Interval ni Ni fi Fi Length

li Density

= ni/li Class mark

[3.000, 4.500) 6 6 0.3 0.3 1.500 4 3.750 [4.500, 6.000) 6 12 0.3 0.6 1.500 4 5.250 [6.000, 6.500) 3 15 0.15 0.75 0.500 6 6.250 [6.500, 7.500) 5 20 0.25 1 1.000 5 6.750


Now you try! The following table shows time duration (in seconds) of phone calls to a certain business. Construct the corresponding table of frequencies.

77 289 128 59 19 148 157 203 126 118

104 141 290 48 3 2 372 140 438 56

44 274 479 211 179 1 68 386 2631 90

30 57 89 116 225 700 40 73 75 51

148 9 115 19 76 138 178 76 67 102

35 80 143 951 106 55 4 54 137 367

277 201 52 9 700 182 73 199 325 75

103 64 121 11 9 88 1148 2 465 25


For example, we may use the intervals specified on the leftmost column. This would be the resulting table of frequencies.

Interval ni Ni fi Fi Length

li Density

= ni/li Class mark

[0,50) 17 17 0.2125 0.2125 50 0.34 25 [50,100) 21 38 0.2625 0.475 50 0.42 75

[100,150) 17 55 0.2125 0.6875 50 0.34 125 [150,250) 9 64 0.1125 0.8 100 0.09 200

[250, 1000) 14 78 0.175 0.975 750 0.0186 625 [1000, 3000) 2 80 0.025 1 2000 0.001 2000

2.2 Graphs

Graphs

A graph is worth a thousand words! Graphs allow for a quick understanding of the distribution of the data.

Again, the type of data (variable) is important when choosing the type of graph we want.

Categorical Bar chart. Pie chart (circle graph).

Discrete Bar chart. Cumulative bar chart.

Continuous

Stem-and-leaf diagram. Histogram. Cumulative frequencies graph. Boxplot.

Graphs – Categorical Variables

The following table shows the blood types of a sample of 100 people.

O+ O- A+ A- B+ B- AB+ AB- 37 8 33 7 9 2 3 1

Bar chart of absolute frequencies Bar chart of relative frequencies

Graphs – Categorical Variables

The following table shows the blood types of a sample of 100 people.

O+ O- A+ A- B+ B- AB+ AB- 37 8 33 7 9 2 3 1

Pie chart (circle graph) for the above data

Graphs – Discrete Variables

For discrete variables we can also draw charts of cumulative frequencies.

The following table shows the final grades of 40 students from this course last year.

7 4 9 8 6 6 3 7 6 4 5 8 9 7 10 6 5 4 6 7 3 8 10 6 7 6 8 2 3 8 0 5 6 8 7 6 9 1 8 1


Here is the corresponding table of frequencies.

Grade Abs. Freq. C. Abs. Freq. Rel. Freq. C. Rel. Freq. 0 1 1 0.025 0.025 1 2 3 0.05 0.075 2 1 4 0.025 0.1 3 3 7 0.075 0.175 4 3 10 0.075 0.25 5 3 13 0.075 0.325 6 9 22 0.225 0.55 7 6 28 0.15 0.7 8 7 35 0.175 0.875 9 3 38 0.075 0.95

10 2 40 0.05 1


We can now draw the corresponding bar charts for absolute frequencies and cumulative absolute frequencies (you could do the same with relative frequencies instead).

Bar chart Cumulative bar chart


Yet another example: the bar charts for the example about number of cars per home.

Bar chart Cumulative bar chart

Graphs – Continuous Variables

Let’s see now some useful graphs to deal with continuous variables. First, the stem-and-leaf plot (or stemplot).

Country Literacy level Fertility Azerbaijan 98 2.8 Afghanistan 29 6.9 Germany 99 1.5 Saudi Arabia 62 6.7 Argentina 95 2.8 Armenia 98 3.2 Australia 99 1.9 Austria 99 1.5 Bahrain 77 4 Bangladesh 35 4.7

Stem and leaf 9 8 2 9 9 9 6 2 …

Stem and leaf 2. 8 6. 9 1. 5 6. 7 …


Let’s see now some useful graphs to deal with continuous variables. First, the stem-and-leaf plot (or stemplot).

Stem Leaves 9 5, 8, 8, 9, 9, 9 8 7 7 6 2 5 4 3 5 2 9 1 0

Stem Leaves 9. 8. 7. 6. 7, 9 5. 4. 0, 7 3. 2 2. 8, 8 1. 5, 5, 9 0.


Stem-and-leaf diagram: now you try!

The following table shows the percentage of older than 65 people living in each state of the USA. Form the corresponding stemplot.

5.7 12.0 13.8 11.0 9.7 12.4 13.0 12.8 13.6 9.9 12.5 11.3 13.3 11.2 11.7 12.1 13.0 8.5 13.5 12.1 14.0 13.4 9.6 11.3 12.9 11.6 13.5 13.0 12.1 13.2 14.3 12.3 15.6 11.7 12.1 15.3 11.2 12.4 12.7 12.0 14.4 10.6 13.1 13.3 14.9 13.3 17.6 13.2 14.7 14.5 13.5


17. 6 16. 15. 3, 6 14. 0, 3, 4, 5, 7, 9 13. 0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 6, 8 12. 0, 0, 1, 1, 1, 1, 3, 4, 4, 5, 7, 8, 9 11. 0, 2, 2, 3, 3, 6, 7, 7 10. 6

9. 6, 7, 9 8. 5, 7. 6. 5. 7


Let’s see now the histogram. The following table displays the CO2 emissions per person in countries with over 20 million people population.

Algeria 2.3 Germany 10.0 Mexico 3.7 South Africa 8.1

Argentina 3.9 Ghana 0.2 Morocco 1.0 Spain 6.8

Australia 17.0 India 0.9 Myanmar 0.2 Sudan 0.2

Bangladesh 0.2 Indonesia 1.2 Nepal 0.1 Tanzania 0.1

Brazil 1.8 Iran 3.8 Nigeria 0.3 Thailand 2.5

Canada 16.0 Iraq 3.6 Pakistan 0.7 Turkey 2.8

China 2.5 Italy 7.3 Peru 0.8 Ukraine 7.6

Colombia 1.4 Japan 9.1 Philippines 0.9 United Kingdom 9.0

Congo 0.0 Kenya 0.3 Poland 8.0 United States 19.9

Egypt 1.7 Korea, North 9.7 Romania 3.9 Uzbekistan 4.8

Ethiopia 0.0 Korea, South 8.8 Russia 10.2 Venezuela 5.1

France 6.1 Malaysia 4.6 Saudi Arabia 11.0 Vietnam 0.5


First, draw the table of frequencies.

Interval Abs. Freq.

C. Abs. Freq.

Rel. Freq.

C. Rel. Freq. Length Density Class

mark [0, 3) 24 24 0.5 0.5 3 8 1.5 [3, 6) 8 32 0.16667 0.66667 3 2.6667 4.5 [6, 9) 7 39 0.14583 0.8125 3 2.3333 7.5

[9, 12) 6 45 0.125 0.9375 3 2 10.5 [12,15) 0 45 0 0.9375 3 0 13.5 [15,18) 2 47 0.04167 0.97917 3 0.66667 16.5 [18,21) 1 48 0.02083 1 3 0.33333 19.5


Finally, draw the histogram: !  Area of each rectangle = absolute frequency (on the corresponding

interval). !  Height of each rectangle = density (on the corresponding interval).

Density on the vertical axis

Area equal to (absolute) frequency


Some important remarks about histograms:

Choose intervals of the same length (or else you’ll make your life very miserable)!!

They look a lot like bar charts, but histograms contain a lot more information than bar charts!!

The vertical axis ALWAYS measures the density!!


We can use this example to draw also the polygon of frequencies: !  Mark the class mark of each interval on top of each rectangle. !  Link each of the marks with a straight segment.


Here’s an example for you to try. The following table shows the average property damage caused by tornadoes (in millions of dollars, in 50 years).

Alabama 51.88 Indiana 53.13 Montana 2.27 Pennsylvania 17.11

Arizona 3.47 Iowa 49.51 Nebraska 30.26 South Carolina 17.19

Arkansas 40.96 Kansas 49.28 Nevada 0.10 South Dakota 10.64

California 3.68 Kentucky 24.84 New Hampshire 0.66 Tennessee 23.47

Colorado 4.62 Louisiana 27.75 New Jersey 2.94 Texas 88.60

Connecticut 2.26 Maine 0.53 New Mexico 1.49 Utah 3.57

Delaware 0.27 Maryland 2.33 New York 15.73 Vermont 0.24

Florida 37.32 Massachusetts 4.42 North Carolina 14.90 Virginia 7.42

Georgia 51.68 Michigan 29.88 North Dakota 14.69 Washington 2.37

Hawaii 0.34 Minnesota 84.84 Ohio 44.36 West Virginia 2.14

Idaho 0.26 Mississippi 43.62 Oklahoma 81.94 Wisconsin 31.33

Illinois 62.94 Missouri 68.93 Oregon 5.52 Wyoming 1.78


There are many other types of graphs, some more suitable than others depending on the situation. The key point of using one graph or another is to clearly reflect the data.

Be careful with cheating graphs! Below there is a link where you can find lots of misleading graphs. The worse part of it is that, in general, the mistakes are done in purpose.

http://www.statisticshowto.com/misleading-graphs/

Graphs – Symmetry

An important feature to detect with graphs is the symmetry of the distribution of the data (notice that this doesn’t apply to categorical variables).

Here there are some examples of symmetrical or asymmetrical distributions.

Unimodal symmetrical distribution Bimodal symmetrical distribution

Graphs – Symmetry

An important feature to detect with graphs is the symmetry of the distribution of the data (notice that this doesn’t apply to categorical variables).

Here there are some examples of symmetrical or asymmetrical distributions.

Asymmetrical to the right distribution Asymmetrical to the left distribution

2.3 Descriptive statistics

Descriptive statistics

The descriptive statistics measure different features of a variable. We will see statistics describing four such aspects of a variable:

1.  Central tendencies. They aim to locate the “central” point(s) of a data set. Depending on the point of view such central point(s) may be different.

2.  Statistical dispersion. They describe how stretched or squeezed the data set is.

3.  Skewness. This describes the level of symmetry (or asymmetry) of the distribution of data set.

4.  Kurtosis. This shows if the distribution has some “peak” or if it is “too flat”.

Central tendencies

There are several discrete statistics that we may use to describe central tendencies of a (discrete) variable. The idea is to find the values that are central (in some sense) in the sample.

!  Mode: most frequent value(s). A sample could have more than one mode.

!  Mean: average of the data.

!  Median: middle point of the data.

Mean and median don’t make sense for categorical variables!!

Central tendencies

The following table shows the result of rolling a die 40 times.

1 4 3 2 6 6 3 1 6 4 5 2 3 1 4 4 5 4 6 1 3 2 4 6 1 5 2 2 3 2 6 5 4 2 1 6 3 1 2 1

And here is the corresponding table of (absolute) frequencies.

Result Absolute Frequency. 1 8 2 8 3 6 4 7 5 4 6 7

Central tendencies

The descriptive statistics for this example are the following. !  Mode: most frequent values in the table. In this case, there are two

modes: 1 and 2 (as they appear 8 times each).

!  Mean: average of all values. The general formula is

where x1, x2,... are each of the observations in the data set.

For this particular example, we have: €

x = x1 + x2 + ...+ xn

n

€

x = 1+ 4 + 3+ ...+ 2 +140

=(8 ×1) + (8 × 2) + (6 × 3) + (7 × 4) + (4 × 5) + (7 × 6)

6= 3.3

Central tendencies

!  Median: middle point of the data. To find the median, a)  list the values in the original table in ascending (or descending)

order; b)  if the sample has odd size, the median is the value in the middle

of the list; c)  if the sample has even size, there will be two values in the

middle, and the median is the average of these two values. In this particular example:

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 6 6 6 6 6 6 6

The median is 3.

Central tendencies

Some remarks about the mode: +  easy to compute from the table of frequencies; -  usually it is less representative than the mean or the median.

Some remarks about the mean: +  it is easy to compute; +  uses all the values in the database for its computation; -  it is very sensitive to outliers in the database.

Some remarks about the median: +  useful when there are outliers in the database; -  does not use all the values in the database in its computation; -  depends on the order of the data but not on the actual values.

Dispersion – Quartiles and Percentiles

Measuring the center is usually not enough to get a good description of a distribution.

! Median: it divides the data into two halves.

! Quartiles: they divide the data into quarters. They are usually called first quartile (Q1), second quartile (the median! also Q2) and third quartile (Q3).

! Percentiles: they indicate the point below which a given percentage of the data is contained.


The following table shows the weights (in kg.) of a sample of 100 people.

78.38 90.70 56.23 59.74 74.11 72.24 93.73 72.65 119.00 115.27

111.26 73.24 72.07 87.05 76.97 84.97 60.03 93.09 86.01 76.49

94.14 89.26 66.80 68.73 85.38 69.10 63.45 93.01 54.40 91.77

73.82 50.25 88.17 109.74 106.30 84.23 66.98 76.84 97.50 55.43

109.96 49.94 62.52 55.55 41.49 57.24 68.61 85.06 47.48 81.55

82.18 103.77 76.98 61.74 57.42 114.29 99.76 91.92 71.21 93.37

68.03 93.94 48.58 102.73 91.18 85.82 88.11 69.44 120.60 85.00

102.56 64.12 102.88 32.11 88.59 57.09 78.50 94.70 83.33 69.71

64.29 73.77 55.54 57.90 44.88 67.00 65.91 73.21 62.43 84.15

83.66 61.69 72.75 87.63 63.65 50.17 68.00 99.34 105.23 86.47


The corresponding table of frequencies is

Interval Abs. Freq.

C. Abs. Freq.

Rel. Freq.


mark [31,41) 1 1 0.01 0.01 10 0.1 36

[41, 51) 7 8 0.07 0.09 10 0.7 46

[51, 61) 11 19 0.11 0.19 10 1.1 56

[61, 71) 19 38 0.19 0.38 10 1.9 66

[71, 81) 16 54 0.16 0.54 10 1.6 76

[81, 91) 20 74 0.20 0.74 10 2.0 86

[91, 101) 13 87 0.13 0.87 10 1.3 96

[101, 111) 8 95 0.08 0.95 10 0.8 106

[111, 121) 5 100 0.05 1 10 0.5 116


32.11 55.54 61.74 67.00 72.24 76.97 84.97 88.17 93.73 103.77

41.49 55.55 62.43 68.00 72.65 76.98 85.00 88.59 93.94 105.23

44.88 56.23 62.52 68.03 72.75 78.38 85.06 89.26 94.14 106.30

47.48 57.09 63.45 68.61 73.21 78.50 85.38 90.70 94.70 109.74

48.58 57.24 63.65 68.73 73.24 81.55 85.82 91.18 97.50 109.96

49.94 57.42 64.12 69.10 73.77 82.18 86.01 91.77 99.34 111.26

50.17 57.90 64.29 69.44 73.82 83.33 86.47 91.92 99.76 114.29

50.25 59.74 65.91 69.71 74.11 83.66 87.05 93.01 102.56 115.27

54.40 60.03 66.80 71.21 76.49 84.15 87.63 93.09 102.73 119.00

55.34 61.69 66.98 72.07 76.84 84.23 88.11 93.37 102.88 120.69

Before we find the quartiles, we need to order the database.


We can now find all the descriptive statistics for this sample. !  The modal class is [81, 91).

!  The mean is

!  The quartiles are Q1 = 63.89 Q2 = 76.91 Q3 = 91.48

€

x = 32.11+ 41.49 + 44.88 + ...+119.00 +120.69100

= 78.0526


Outliers

An outlier is an individual value that falls outside the overall pattern (extreme value). They tend to distort the dataset. The usual causes for outliers to appear are !  variability in the measurement;

!  experimental errors.

In the latter case, these values should be omitted from the database.

We need methods to !  detect outliers;

!  decide if they are errors (and therefore can be omitted).

Outliers

To decide if a value is an outlier, we use the interquartile range:

Within the interquartile range there is contained the central 50% of the values of the data set!

With the IQR we now find the region of admissible values: a value xi in the database is an outlier if either

or

That is, anything below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR is considered an outlier.

€

IQR =Q3 −Q1

€

Q3 +1.5 × IQR < xi

€

xi <Q1 −1.5 × IQR

Outliers

In the previous example, we have

and the resulting borders of the admissible region are

It is easy to check now in the table: in this example there are no outliers.

€

IQR = 91.48 − 63.89 = 27.59

€

Q3 +1.5 × IQR =132.865

€

Q1 −1.5 × IQR = 22.505

Practice session

Let’s practice all we have learnt so far. In this session we will combine R commander and Excel. 1.  Load the database “Children” from the webpage.

2.  What would you do with the missing data?

3.  Build tables of frequencies with Excel and R commander. Compare them.

4.  Draw a stemplot with R commander.

5.  Draw histograms with Excel and R commander. Compare them. Do you see any symmetry or asymmetry in the graphs?

6.  Find the modal class, the mean and the median, both in Excel and R commander.

7.  Find the 10-th, 20-th, 30-th and 70-th percentiles of the sample.

8.  Compute the IQR and find if there is any outlier. If so, what would you do?

9.  What conclusions do you draw from this data? Write a few lines explaining your observations.

The Boxplot

The Boxplot is a very useful graph, which is built from the quartiles and some related information that we have already seen.

The Boxplot Some remarks about the boxplot. ! The central box and the lengths of the “whiskers” can be taken as a measure of dispersion as well.

! The outliers are represented by circles, dots, asterisks, etc... Always beyond the whiskers.

! The width of the box is irrelevant!

! Excel does not allow boxplots. You need some more advanced software, like R commander.

Dispersion – Variance, Standard Deviation and Pearson’s Coefficient of Variation

! Variance: it is the average of the squares of the differences between each of the observations and the mean. Wow... Easier with a formula?

! Standard Deviation: it is the square root of the variance (piece of cake!).

! Pearson’s Coefficient of variation: it is the quotient of the standard deviation by the absolute value of the mean.

€

S2 =

(xi − x)2

i=1

n

∑n

€

S =

(xi − x)2

i=1

n

∑n

€

CV =Sx


The following table shows the weight of 100 people (see previous slides)

78.38 90.70 56.23 59.74 74.11 72.24 93.73 72.65 119.00 115.27

111.26 73.24 72.07 87.05 76.97 84.97 60.03 93.09 86.01 76.49

94.14 89.26 66.80 68.73 85.38 69.10 63.45 93.01 54.40 91.77

73.82 50.25 88.17 109.74 106.30 84.23 66.98 76.84 97.50 55.43

109.96 49.94 62.52 55.55 41.49 57.24 68.61 85.06 47.48 81.55

82.18 103.77 76.98 61.74 57.42 114.29 99.76 91.92 71.21 93.37

68.03 93.94 48.58 102.73 91.18 85.82 88.11 69.44 120.60 85.00

102.56 64.12 102.88 32.11 88.59 57.09 78.50 94.70 83.33 69.71

64.29 73.77 55.54 57.90 44.88 67.00 65.91 73.21 62.43 84.15

83.66 61.69 72.75 87.63 63.65 50.17 68.00 99.34 105.23 86.47


The mean is

and we can compute now the variance, the standard deviation, and Pearson’s coefficient of variation:

€

x = 32.11+ 41.49 + 44.88 + ...+119.00 +120.69100

= 78.0526

€

S2 =1100

(32.11− 78.053)2 + (41.49 − 78.053)2 + ...+ (120.69 − 78.053)2[ ] = 350.90

€

S = 350.90 =18.73

€

CV =Sx

=18.7378.05

= 0.24


Notice that ! The standard deviation has the same units as the population. It makes it convenient for describing dispersion. ! However, taking the square root increases the bias with respect to the original dispersion of the whole population (this is a technical inconvenience). Thus, the variance is a more accurate, yet somehow strange, measure of dispersion. ! S = 0 if and only if S2=0. In this case, we have for all i = 1,..., n

The lower (higher) these numbers are, the less (more) disperse the variable is!!

€

xi = x


You may find alternative (equivalent) formulas for the variance and the standard deviation. Using some easy arithmetical manipulations you should be able to see that any other formula you may find is equivalent to the formulas in the previous slide. For example:

€

S2 =1n

(xi − x)2

i=1

n

∑ =1n

(xi2 − 2xi x + (x)2

i=1

n

∑ ) =

=1n

xi2

i=1

n

∑ − 2x 1n

xii=1

n

∑ +1nn(x)2 =

=1n

xi2

i=1

n

∑ − 2x x + (x)2 =1n

xi2

i=1

n

∑ − (x)2

Chebyshev’s Inequality

The idea is that most observations (values in your database) are only a few times the standard deviation away from the mean. More precisely:

Distance from mean Percentage of observations 2S 75% 3S 89% 4S 94% 5S 96% 6S 97%

This is not an absolute rule in statistics (although it is in probability)!!


Some remarks about Pearson’s Coefficient of Variation. ! Unlike the variance and the standard deviation, Pearson’s coefficient of variation has no units.

! This makes it very suitable for comparing dispersion between two variables.

! Also, if CV is smaller or equal than 0.3 then we may consider that the variable is nicely distributed (otherwise it is disperse, and the higher the value of CV the more disperse the variable is).

! If the mean is close to zero, then CV is not very useful.

Estimating statistics for grouped variables

Imagine that you have a continuous variable for which you only know its frequency table (but not the original data). How can you (reasonably) estimate any of its descriptive statistics?

As an example, consider the following table (it has appeared before)

Interval Abs. Freq.

C. Abs. Freq.

Rel. Freq.


mark [0, 3) 24 24 0.5 0.5 3 8 1.5 [3, 6) 8 32 0.16667 0.66667 3 2.6667 4.5 [6, 9) 7 39 0.14583 0.8125 3 2.3333 7.5

[9, 12) 6 45 0.125 0.9375 3 2 10.5 [12,15) 0 45 0 0.9375 3 0 13.5 [15,18) 2 47 0.04167 0.97917 3 0.66667 16.5 [18,21) 1 48 0.02083 1 3 0.33333 19.5


The descriptive statistics for this example are the following. !  Modal class: the interval with highest absolute frequency. In this

case: [0,3) is the modal class.

!  Mean estimate: average of all values with respect to the class marks! The general formula is

where nj stands for the absolute frequency and cj stands for the class mark for the j-th interval.

In this example:

€

x = (24 ×1.5) + (8 × 4.5) + ...+ (2 ×16.5) + (1×19.5)48

= 5€

x = (n1 × c1) + (n2 × c2) + ...+ (nk × ck )n

=1n

n j × c jj =1

k

∑


What about the median and the quartiles? !  Median estimate: first, we need the median class, which is the class

that contains at least half of the total frequency. Once we have the median class, the formula is

where LM is the lower bound of the median class; n is the total frequency; BM is the cumulative frequency before the median class;

FM is the absolute frequency of the median class; lM is the length of the median class.

€

Med = LM +

n2− BM

FM× lM


In our example, the median class is [0,3) as it already contains half of the individuals. Thus,

LM = 0; n = 48; BM = 0; FM = 24; lM = 3.

The estimate for the median is

€

Med = 0 +

482− 0

24× 3 =

2424

× 3 = 3


In order to estimate the quartiles, there are similar formulas: !  First quartile estimate: first, find the first quartile class (Q1 class),

which is the first interval containing at least 25% of the individuals. Then, apply the formula

where LQ1 is the lower bound of the Q1 class; n is the total frequency; BQ1 is the cumulative frequency before the Q1 class;

FQ1 is the absolute frequency of the Q1 class; lQ1 is the length of the Q1 class.

€

Q1 = LQ1 +

n4− BQ1FQ1

× lQ1


In order to estimate the quartiles, there are similar formulas: !  Third quartile estimate: first, find the third quartile class (Q3 class),

which is the first interval containing at least 75% of the individuals. Then, apply the formula

where LQ3 is the lower bound of the Q3 class; n is the total frequency; BQ3 is the cumulative frequency before the Q3 class;

FQ3 is the absolute frequency of the Q3 class; lQ3 is the length of the Q3 class.

€

Q3 = LQ3 +

3n4− BQ3FQ3

× lQ3


For example, to estimate of the third quartile in the example above, the Q3 class is [6,9) as it is the first class to contain a 75% of the individuals. Then,

LQ3 = 6; n = 48; BQ3 = 32; FQ3 = 7; lQ3 = 3.

The estimate is

€

Q3 = 6 +

3 × 484

− 32

7× 3 = 6 +

36 − 327

× 3 = 7.714


Next we want to estimate the variance, the standard deviation and Pearson’s coefficient of variation.

For the variance, we have the following formula

and then we deduce the standard deviation and Pearson’s coefficient

€

S2 =1n

c j2 × n j

j=1

k

∑⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ − (x)

2

€

S = S2

€

CV =Sx


For the example above, we obtain

€

S2 =148

c j2 × n j

j=1

7

∑⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ − (5)

2 = 45.75 − 25 = 20.75

€

S = S2 = 20.75 = 4.56

€

CV =Sx

=4.565

= 0.91

Symmetry

A distribution is symmetric if the mean and the median agree. In other words, if the mean separates the database in two halves of the same size. Thus a first indicator of the (a)symmetry of a database is the difference of the mean and the median. Otherwise,

!  A database is asymmetric to the right if the majority of the data is concentrated on the left side of the mean.

!  A database is asymmetric to the left if the majority of the data is concentrated on the right side of the mean.

Symmetry

Let’s see two descriptive statistics to detect asymmetry to either side.

Both of them work similarly: !  If g1 = 0 (or g2= 0) then the distribution is symmetric.

!  If g1>0 (or g2>0) then the distribution is asymmetric to the right.

!  If g1<0 (or g2<0) then the distribution is asymmetric to the left.

€

g2 =1

n × S3(xi − x)

3

i=1

n

∑

€

g1 =x −MeS

Kurtosis

Kurtosis is a measure of whether the data are heavily-tailed or light-tailed (compared to a normal distribution). This way:

!  High kurtosis implies heavy tails, or outliers.

!  Low kurtosis implies light tails, or lack of outliers.

Another interpretation of kurtosis is the following: higher kurtosis means that a greater part of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations.

Kurtosis

To measure kurtosis we use the following statistic

!  If g3 = 0, the distribution is called mesokurtic.

!  If g3>0, the distribution is called leptokurtic (“lepto” means slender). A leptokurtic distribution tends to have fatter tails.

!  If g3<0, the distribution is called platykurtic (“platy” means broad). A platykurtic distribution tends to have thinner tails.

€

g3 =1

n × S4(xi − x)

4

i=1

n

∑ − 3

Symmetry and Kurtosis Let’s see how these statistics work with an example. The following table shows the average property damage caused by tornadoes (in millions of dollars, in 50 years).

Alabama 51.88 Indiana 53.13 Montana 2.27 Pennsylvania 17.11

Arizona 3.47 Iowa 49.51 Nebraska 30.26 South Carolina 17.19

Arkansas 40.96 Kansas 49.28 Nevada 0.10 South Dakota 10.64

California 3.68 Kentucky 24.84 New Hampshire 0.66 Tennessee 23.47

Colorado 4.62 Louisiana 27.75 New Jersey 2.94 Texas 88.60

Connecticut 2.26 Maine 0.53 New Mexico 1.49 Utah 3.57

Delaware 0.27 Maryland 2.33 New York 15.73 Vermont 0.24

Florida 37.32 Massachusetts 4.42 North Carolina 14.90 Virginia 7.42

Georgia 51.68 Michigan 29.88 North Dakota 14.69 Washington 2.37

Hawaii 0.34 Minnesota 84.84 Ohio 44.36 West Virginia 2.14

Idaho 0.26 Mississippi 43.62 Oklahoma 81.94 Wisconsin 31.33

Illinois 62.94 Missouri 68.93 Oregon 5.52 Wyoming 1.78

Symmetry and Kurtosis First, get the table of frequencies.

Interval Abs. Freq. C. Abs. Freq. Class mark Interval length

[0,10) 22 22 5 10 [10,20) 6 28 15 10 [20,30) 4 32 25 10 [30,40) 3 35 35 10 [40,50) 5 40 45 10 [50,60) 3 43 55 10 [60,70) 2 45 65 10 [70,80) 0 45 75 10 [80,90) 3 48 85 10

Symmetry and Kurtosis

These are the relevant statistics:

Statistic Value Quartile Value Statistic Value Mean 23.32 Q1 2.315 g1 0.34

Median 14.795 Q2 14.795 g2 1.02 Variance 639.44 Q3 41.625 g3 0.01 Standard deviation 25.29 IQR 39.31

The database is asymmetric to the right, as both g1 and g2 are positive. This can be observed in the histogram in the next slide. The kurtosis is pretty low, which indicates that most likely there aren’t any outliers. You can confirm that by computing the usual “outlier boundaries”.

Symmetry and Kurtosis

Here is the histogram.

Comparing two databases

So far we have studied features of a single database, with the goal of giving numerical descriptors of the distribution of the database. Using these descriptors, we can now compare the distributions of two databases. !  If both databases are expressed in the same units, then we can compare them straight away.

!  If each database has a different unit, then we must normalize the databases first. Data normalization is a very standard process (there are actually several different normalization techniques). It consists of replacing each value in the database by the following formula:

€

xi

€

xi − xS


The following tables are samples of time (in minutes) spent studying for students on a first-year course in college. Both databases have the same units, so there is no need to normalize the data.

Women 180 120 180 360 240 120 180 120 240 170 150 120 180 180 150 200 150 180 150 180 120 60 120 180 180 90 240 180 115 120

Men 90 120 30 90 200 90 45 30 120 75

150 120 60 240 300 240 60 120 60 30 30 230 120 95 150 0 200 120 120 180


First, compute all the descriptive statistics for each sample.

Women Mean 165.67

Median 175 Variance 3087.42

Standard dev 55.57 CV 0.34 g1 -0.18 g2 1.23 g3 3.04

Men Mean 117.17

Median 120 Variance 5327.81

Standard dev 72.99 CV 0.62 g1 -0.04 g2 0.63 g3 -0.26


In general, women seem to study more (both the mean and the average are greater). Moreover, the dispersion is lower for women. Pearson’s coefficient of variation (CV) already shows that the data for women is better distributed than for men.

Regarding the symmetry of each database, men seem to follow a more symmetrical distribution, although that is not so apparent from the corresponding histograms (in the next slide). The measures of symmetry (g1 and g2) are pretty low in both cases, and they do not show great differences.

The female distribution is leptokurtic while the male distribution is mesokurtic. The indices in both cases seem to be low enough so that no big difference is appreciated (check again the histograms in the next slide).



Consider now the following situation: among all new students, Tecnocampus offers a grant for the student with the best academic record from high school. However...

!  There is one student with a grade of 8 points out of 10 (coming from country A).

!  There is one student with a grade of B+ (coming from country B).

Who should get the grant? The units are different, and we can’t compare directly. Instead, we have to analyze samples of students from countries A and B, normalize the samples, and only then we will be able to decide which student should get the grant.


Data normalization causes (some of the) descriptive statistics to change!!

Statistic Value after normalization Mean 0

Median (Median – Mean)/S Variance 1

Standard dev. 1 Q1 (Q1 – Mean)/S Q2 (Q2 – Mean)/S Q3 (Q3 – Mean)/S g1 same g2 same g3 same

Some examples and exercises

1.  Give an example of a sample of size 5 such that a)  The mean is 0. b)  The median is 0. c)  The standard deviation is 0.

2.  Draw a boxplot for the following data set

3.  An accounting firm pays each of its five clerks $35,000, two junior accountants $80,000 each, and the firm’s owner $320,000. What is the mean salary? How many employees earn less than the mean? What is the median salary? Is the distribution of salaries symmetrical?

2.1 1.8 3.7 5.4 6.1 7.9 8.9 1.7 6.7


4.  The following two tables show samples of the length (in millimeters) of two varieties of the same flower.

a)  Study each sample in full detail. Describe each sample in your own words.

b)  Compare both samples. Which are the most important differences?

Variety 1 47.12 46.75 46.81 47.12 46.67 47.43 46.44 46.64 48.07 48.34 48.15 50.26 50.12 46.34 46.94 48.36

Variety 2 39.63 42.01 41.93 37.83 41.47 37.40 38.20 40.57 38.10 42.18 38.79 38.23 41.69 39.78 38.01 43.09


5.  This table shows the most common types of spam email.

a)  What kind of variable is this? What statistics can you compute in order to analyze the data?

b)  Draw a pie chart. Draw also a bar chart where the bars are ordered in increasing (or decreasing) order by their height.

Type of spam % Adult 14.5 Financial 16.2 Health 7.3 Leisure 7.8 Products 21.0 Scams 14.2


6.  A survey about health habits asks about how many hours of sport people practice per week (without including weekends). Below is the resulting table of frequencies.

a)  Estimate all the descriptive statistics. b)  Draw a histogram. c)  Describe with your own words the results of the survey.

Interval Abs. Freq.

[0,2) 23 [2,4) 37 [4,6) 18 [6,8) 13

[8,10) 5

Bonus Level!

Links of interest

The following links may be of interest to you...

!  Glossary of statistics terms http://www.stat.berkeley.edu/~stark/SticiGui/Text/gloss.htm#p

!  English-Spanish dictionary of statistics terms http://www.tribunamedica.com/glosario.htm

!  R manual https://cran.r-project.org/doc/manuals/r-release/R-intro.html

!  Math is fun https://www.mathsisfun.com/data/index.html

!  More coming soon...

Documents

2. Data analysis in one variable - Kansas State …agondem/Statistics_I_files/2...Descriptive Statistics The sample and the variables produce some data (the values of the variables