Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA

Chapter 2Chapter 2 Slide Slide 11

Psy B07

DESCRIBING AND DESCRIBING AND EXPLORING DATAEXPLORING DATA


Psy B07

Plotting dataPlotting data Grouping dataGrouping data TerminologyTerminology NotationNotation Measures of Central TendencyMeasures of Central Tendency Measures of VariabilityMeasures of Variability Properties of a StatisticProperties of a Statistic

OutlineOutline


Psy B07

Plotting DataPlotting Data

Once a bunch of data has been Once a bunch of data has been collected, the raw numbers must be collected, the raw numbers must be manipulated in some fashion to manipulated in some fashion to make them more informative.make them more informative.

Several options are available Several options are available includingincluding plottingplotting the data or the data or calculatingcalculating descriptive statisticsdescriptive statistics


Psy B07


AgeAge18 18 26 26 21 21 21 21 25 25 18 18 20 20 21 21 18 18 21 21 21 21 21 21 20 20 21 21 20 20 23 23 22 22 20 20 21 21 22 22 24 24 26 26 19 19 19 19

WeighWeightt107 107 115 115 108 108 111 111 163 163 119 119 119 119 200 200 178 178 135 135 143 143 113 113 103 103 166 166 112 112 151 151 192 192 135 135 117 117 138 138 137 137 161 161 117 117 142142

Raw data of Raw data of typical age typical age and weight in and weight in a second a second year course year course (made-up (made-up data)data)

Age Age 20 20 21 21 20 20 19 19 19 19 21 21 22 22 19 19 20 20 20 20 19 19 19 19 19 19 20 20 20 20 19 19 20 20 20 20 20 20 22 22 22 22 19 19 23 23 2020

WeighWeightt108 108 110 110 109 109 127 127 143 143 121 121 112 112 136 136 161 161 131 131 144 144 123 123 101 101 193 193 127 127 158 158 149 149 138 138 129 129 138 138 137 137 156 156 122 122 132132


Psy B07


Often, the first thing one does with Often, the first thing one does with a set of raw data is to plot a set of raw data is to plot frequency distributions.frequency distributions.

Usually this is done by first Usually this is done by first creating a table of the frequencies creating a table of the frequencies broken down by values of the broken down by values of the relevant variable, then the relevant variable, then the frequencies in the table are plotted frequencies in the table are plotted in a in a histogramhistogram


Psy B07


Example: Typical age in a second year courseExample: Typical age in a second year course

Age Frequency

18 319 1020 1421 1022 523 224 125 126 2

Note: The frequencies Note: The frequencies in the adjacent table in the adjacent table were calculated by were calculated by simply counting the simply counting the number of subjects number of subjects having the specified having the specified value for the age value for the age variablevariable


Psy B07


Age Frequency18 319 1020 1421 1022 523 224 125 126 2

0

2

4

6

8

10

12

14

16

18 19 20 21 22 23 24 25 26

Age

Freq

uenc

y


Psy B07

Grouping DataGrouping Data

Plotting is easy when the variable Plotting is easy when the variable of interest has a relatively small of interest has a relatively small number of values (like our age number of values (like our age variable did).variable did).

However, the values of a variable However, the values of a variable are sometimes more continuous, are sometimes more continuous, resulting in uninformative resulting in uninformative frequency plots if done in the frequency plots if done in the above manner.above manner.


Psy B07


For example, our weight variable For example, our weight variable ranges from 100 lb. to 200 lb. If we ranges from 100 lb. to 200 lb. If we used the previously described used the previously described technique, we would end up with 100 technique, we would end up with 100 bars, most of which with a frequency bars, most of which with a frequency less than 2 or 3 (and many with a less than 2 or 3 (and many with a frequency of zero).frequency of zero).

We can get around this problem by We can get around this problem by grouping our values into bins. Try for grouping our values into bins. Try for around 10 bins with natural splits.around 10 bins with natural splits.


Psy B07


Weight Bin Midpoint Frequency

100 - 109 104.5 6110 - 119 114.5 10120 - 129 124.5 6130 - 139 134.5 10140 - 149 144.5 5150 - 159 154.5 3160 - 169 164.5 4170 - 179 174.5 1180 - 189 184.5 0190 - 199 194.5 2200 - 209 204.5 1


Psy B07

Grouping DataGrouping DataWeight Frequency

104.5 6114.5 10124.5 6134.5 10144.5 5154.5 3164.5 4174.5 1184.5 0194.5 2204.5 1

0

2

4

6

8

10

1210

4.5

114.

5

124.

5

134.

5

144.

5

154.

5

164.

5

174.

5

184.

5

194.

5

204.

5Weight (lbs)

Fre

qu

ency

Check out this demo which clearly shows how the width of the bin that you select can clearly affect the “look” of the data

Here is another similar demonstration of the effects of bin width

See section in text on cumulative frequency distributionsSee section in text on cumulative frequency distributions


Psy B07

TerminologyTerminology

Often, frequency histograms tend to have a Often, frequency histograms tend to have a roughly symmetrical bell-shape and such roughly symmetrical bell-shape and such distributions are called distributions are called normalnormal or or GaussianGaussian

60.5 362.5 864.5 766.5 1268.5 770.5 672.5 474.5 076.5 1

0

2

4

6

8

10

12

14

60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5

Height (Inches)

Fre

quen

cy


Psy B07


Sometimes, the bell shape is not Sometimes, the bell shape is not symmetricalsymmetrical

The term The term positive skewpositive skew refers to the refers to the situation where the “tail” of the situation where the “tail” of the distribution is to the right, distribution is to the right, negative negative skewskew is when the “tail” is to the left is when the “tail” is to the left


Psy B07


60.5 3 0.75 762.5 8 2.75 1364.5 7 4.75 1266.5 12 6.75 568.5 7 8.75 570.5 6 10.75 272.5 4 12.75 074.5 0 14.75 176.5 1 16.75 1

18.75 020.75 1

0

2

4

6

8

10

12

14

0.75

2.75

4.75

6.75

8.75

10.8

12.8

14.8

16.8

18.8

20.8

Fre

qu

ency


Psy B07

NotationNotation

VariablesVariables When we describe a set of data When we describe a set of data

corresponding to the values of corresponding to the values of some variable, we will refer to that some variable, we will refer to that set using a letter such as X or Y.set using a letter such as X or Y.

When we want to talk about When we want to talk about specific data points within that set, specific data points within that set, we specify those points by adding a we specify those points by adding a subscript to the letter like Xsubscript to the letter like X1.1.


Psy B07

NotationNotation

5,5, 8, 12,8, 12, 3,3, 6,6, 8,8, 77

X1, X2, X3, X4, X5, X6, X7X1, X2, X3, X4, X5, X6, X7


Psy B07

NotationNotation

The Greek letter sigma, which looks The Greek letter sigma, which looks like like , means “add up” or “sum” , means “add up” or “sum” whatever follows it.whatever follows it.

Thus, Thus, XXii, means “add up all the , means “add up all the XXiis.s.

If we use the XIf we use the Xiis from the previous s from the previous example, example, XXi i = 49 (or just = 49 (or just X).X).


Psy B07

Nasty ExampleNasty Example

Midterm Real Student Mark Mark X Y

1 82 84 2 66 51 3 70 72 4 81 56 5 61 73


Psy B07

Nasty ExampleNasty Example

XX = 360= 360

YY = 336= 336

(X-Y)(X-Y) = 24= 24

XX22 = 26262= 26262

((X)X)22 = 129600= 129600


Psy B07

Your turnYour turn

(XY) = 24283(XY) = 24283

(((X-Y))(X-Y))22 = 576 = 576

(X(X22-Y-Y22) = 2956) = 2956


Psy B07

NotationNotation

Sometimes things are made more Sometimes things are made more complicated because letters (e.g., complicated because letters (e.g., X) are sometimes used to refer to X) are sometimes used to refer to entire data sets (as opposed to entire data sets (as opposed to single variables) and multiple single variables) and multiple subscripts are used to specify subscripts are used to specify specific data points.specific data points.


Psy B07

NotationNotation

Week1 2 3 4 5

Student

1 7 6 4 2 22 3 4 4 3 43 3 4 5 4 6

XX2424 = 3 = 3

X or X or XXijij = 61 = 61


Psy B07

Measures of Central Measures of Central TendencyTendency

While distributions provide an While distributions provide an overall picture of some data set, it overall picture of some data set, it is sometimes desirable to represent is sometimes desirable to represent the entire data set usingthe entire data set using descriptive descriptive statisticsstatistics..

The first descriptive statistics we The first descriptive statistics we will discuss, are those used to will discuss, are those used to indicate where the centre of the indicate where the centre of the distribution lies.distribution lies.


Psy B07


60.5 362.5 864.5 766.5 1268.5 770.5 672.5 474.5 076.5 1

0

2

4

6

8

10

12

14

60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5

Height (Inches)

Fre

qu

ency


Psy B07


There are, in fact, three different There are, in fact, three different measures of central tendency.measures of central tendency.

The first of these is called the The first of these is called the modemode..

The mode is simply the value of the The mode is simply the value of the relevant variable that occurs most relevant variable that occurs most often (i.e., has the highest often (i.e., has the highest frequency) in the sample.frequency) in the sample.


Psy B07


Note that if you have done a frequency Note that if you have done a frequency histogram, you can often identify the histogram, you can often identify the mode simply by finding the value with mode simply by finding the value with the highest bar.the highest bar.

However, that will not work when However, that will not work when grouping was performed prior to plotting grouping was performed prior to plotting the histogram (although you can still use the histogram (although you can still use the histogram to identify the modal the histogram to identify the modal group, just not the modal value)group, just not the modal value)


Psy B07


Value Freq Value Freq

61 3 69 362 4 70 263 4 71 464 4 72 465 3 73 066 7 74 067 5 75 068 4 76 1

Create a non-grouped frequency table as Create a non-grouped frequency table as described previously, then identify the value with described previously, then identify the value with the greatest frequency.the greatest frequency.

Example: Class height.Example: Class height.


Psy B07


A second measure of central A second measure of central tendency is called the tendency is called the medianmedian..

The median is the point The median is the point corresponding to the score that lies corresponding to the score that lies in the middle of the distribution in the middle of the distribution (i.e., there are as many data points (i.e., there are as many data points above the median as there are above the median as there are below the median).below the median).


Psy B07


To find the median, the data points must To find the median, the data points must first be sorted into either ascending or first be sorted into either ascending or descending numerical order.descending numerical order.

The The positionposition of the median value can then of the median value can then be calculated using the following formula:be calculated using the following formula:

2

1N

Median Location


Psy B07


1) If there are an odd number of data 1) If there are an odd number of data points:points:

(1, 3, 3, 4, 4, 5, 6, 7, 12)(1, 3, 3, 4, 4, 5, 6, 7, 12)

The median is the item in the fifth The median is the item in the fifth position of the ordered data set, position of the ordered data set, therefore the median is 4therefore the median is 4

Median Location 52

19


Psy B07


2) If there are an even number of data 2) If there are an even number of data points:points:

(1, 3, 3, 3, 5, 5, 6, 7)(1, 3, 3, 3, 5, 5, 6, 7)

We take the average of the two adjacent We take the average of the two adjacent values – in this case giving us 4values – in this case giving us 4

Median Location 5.42

18


Psy B07


Finally, the most commonly used Finally, the most commonly used measure of central tendency is called measure of central tendency is called the the meanmean (denoted x for a sample, and (denoted x for a sample, and μμ for a population).for a population).

The mean is the same of what most of The mean is the same of what most of us call the average, and it is calculated us call the average, and it is calculated in the following manner:in the following manner:

N

XX


Psy B07


For example, given the data set that we For example, given the data set that we used to calculate the median (odd used to calculate the median (odd number example), the corresponding number example), the corresponding mean would be:mean would be:

59

45

N

XX


Psy B07


When a distribution is fairly When a distribution is fairly symmetrical, the mean, median, symmetrical, the mean, median, and mode will be quite similarand mode will be quite similar

However, when the underlying However, when the underlying distribution is not symmetrical, the distribution is not symmetrical, the three measures of central tendency three measures of central tendency can be quite differentcan be quite different


Psy B07


This raises the issue of which measure is best.This raises the issue of which measure is best.

Note that if you were calculating these values, you would show all your steps (it’s good to be a prof!).Note that if you were calculating these values, you would show all your steps (it’s good to be a prof!).

Mode = 2 slices per week

Median = 4 slices per week

Mean = 5.7 slices per week

Example: Pizza EatingValue Freq Value Freq

0 4 8 51 2 10 22 8 15 13 6 16 14 6 20 15 6 40 16 5


Psy B07


Here is a demonstration that allows you to change a frequency histogram while simultaneously noting the effects of those changes on the mean versus the median.

As you use the demo, you should easily be able to think about how these changes are also affecting the mode, right?


Psy B07

Measures of VariabilityMeasures of Variability

In addition to knowing where the In addition to knowing where the centre of the distribution is, it is centre of the distribution is, it is often helpful to know the degree to often helpful to know the degree to which individual values cluster which individual values cluster around the centre.around the centre.

This is known as This is known as variabilityvariability


Psy B07


There are various measures of variability, There are various measures of variability, the most straightforward being the range the most straightforward being the range of the sample:of the sample:

Highest value minus lowest valueHighest value minus lowest value

While range provides a good first pass at While range provides a good first pass at variance, it is not the best measure variance, it is not the best measure because of its sensitivity to extreme because of its sensitivity to extreme scores (see text).scores (see text).


Psy B07


One approach to estimating variability is One approach to estimating variability is to directly measure the degree to which to directly measure the degree to which individual data points differ from the individual data points differ from the mean and then average those deviations.mean and then average those deviations.

This is known as the This is known as the average deviationaverage deviation

N

XX )(


Psy B07


However, if we try to do this with real However, if we try to do this with real data, the result will always be zero:data, the result will always be zero:

Example: (2,3,3,4,4,6,6,12)Example: (2,3,3,4,4,6,6,12)

08

0

8

)7,1,1,1,1,2,2,3()(

N

XX


Psy B07


One way to get around the problem One way to get around the problem with the average deviation is to use with the average deviation is to use the absolute value of the differences, the absolute value of the differences, instead of the differences themselves.instead of the differences themselves.

The absolute value of some number is The absolute value of some number is just the number without any sign:just the number without any sign:

For Example: |-3| = 3For Example: |-3| = 3 And: |+3| = 3And: |+3| = 3


Psy B07


Thus, we could re-write and solve our average Thus, we could re-write and solve our average deviation question as follows:deviation question as follows:

Therefore, this data set has a mean of 5, and a Therefore, this data set has a mean of 5, and a MAD of 2.25MAD of 2.25

25.28

188

71111223

N

XXMAD


Psy B07


Although the MAD is an acceptable Although the MAD is an acceptable measure of variability, the most measure of variability, the most commonly used measure is commonly used measure is variance (denoted svariance (denoted s22 for a sample for a sample and and 22 for a population) and its for a population) and its square root termed the standard square root termed the standard deviation (denoted s for a sample deviation (denoted s for a sample and and for a population). for a population).


Psy B07


The computation of variance is also The computation of variance is also based on the basic notion of the average based on the basic notion of the average deviation however, instead of getting deviation however, instead of getting around the “zero problem” by using around the “zero problem” by using absolute deviations (as in MAD), the absolute deviations (as in MAD), the “zero problem” is eliminating by “zero problem” is eliminating by squaring the differences from the meansquaring the differences from the mean

N

XX 22 )(


Psy B07


Example: (2,3,4,4,4,5,6,12)Example: (2,3,4,4,4,5,6,12)

25.88

)491011149(

)( 22

N

XX


Psy B07


To convert the variance into SD, we To convert the variance into SD, we simply take a square root of it:simply take a square root of it:

87.2

25.8

8

)491011149(

)( 2

N

XX


Psy B07


This demonstration allows you to This demonstration allows you to play with the mean and standard play with the mean and standard deviation of a distribution. Note deviation of a distribution. Note that changing the mean of the that changing the mean of the distribution simply moves the entire distribution simply moves the entire distribution to the left or right distribution to the left or right without changing its shape. In without changing its shape. In contrast, changing the standard contrast, changing the standard deviation alters the spread of the deviation alters the spread of the data but does not affect where the data but does not affect where the distribution is “centered”distribution is “centered” DEMO


Psy B07


Population vs. SamplePopulation vs. Sample As mentioned, we usually deal with As mentioned, we usually deal with

statistics, not parameters. statistics, not parameters. σσ22 andand σσ are are parameters. Their counterparts, when parameters. Their counterparts, when dealing with samples are sdealing with samples are s22 and s. The and s. The formulae are slightly differentformulae are slightly different

1

)(2

N

XXs

1

)(

N

XXs


Psy B07

Properties of a StatisticProperties of a Statistic

So, the mean (X) and variance (sSo, the mean (X) and variance (s22) are ) are the descriptive statistics that are most the descriptive statistics that are most commonly used to represent the data commonly used to represent the data points of some sample.points of some sample.

The real reason that they are the The real reason that they are the preferred measures of central tendency preferred measures of central tendency and variance is because of certain and variance is because of certain properties they have as estimators of properties they have as estimators of their corresponding population their corresponding population parameters; parameters; μμ and and 22..


Psy B07


Four properties are considered desirable Four properties are considered desirable in a population estimator; sufficiency, in a population estimator; sufficiency, unbiasedness, efficiency, & resistance.unbiasedness, efficiency, & resistance.

Both the mean and the variance are the Both the mean and the variance are the best estimators in their class in terms of best estimators in their class in terms of the first three of these four properties.the first three of these four properties.

To understand these properties, you first To understand these properties, you first need to understand a concept in need to understand a concept in statistics called the sampling distributionstatistics called the sampling distribution


Psy B07


We will discuss sampling distributions off and on throughout the course, and I only want to touch on the notion now.

Basically, the idea is this – in order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

Check out this demonstration which I hope makes the concept of sampling distributions more clear.


Psy B07


1) 1) SufficiencySufficiency

A A sufficientsufficient statistic is one that statistic is one that makes use of all of the information makes use of all of the information in the sample to estimate its in the sample to estimate its corresponding parameter.corresponding parameter.


Psy B07


2) 2) UnbiasednessUnbiasedness

A statistic is said to be an A statistic is said to be an unbiasedunbiased estimator if its expected value (i.e., estimator if its expected value (i.e., the mean of a number of sample the mean of a number of sample means) is equal to the population means) is equal to the population parameter it is estimating.parameter it is estimating.

Explanation of N-1 in sExplanation of N-1 in s22 formula. formula.


Psy B07


Using the procedure, the mean can Using the procedure, the mean can be shown to be an unbiased be shown to be an unbiased estimator (see p 47).estimator (see p 47).

However, if the However, if the σσ22 formula is used formula is used to calculate to calculate ss22 it turns out to it turns out to underestimate underestimate σσ22


Psy B07


The reason for this bias is that, when we The reason for this bias is that, when we calculate scalculate s2, 2, we use x, an estimator of the we use x, an estimator of the population meanpopulation mean

The chances of x being EXACTLY the same The chances of x being EXACTLY the same as as μμ are virtually nil, which results in the are virtually nil, which results in the biasbias

To compensate, we use N-1To compensate, we use N-1 Note that this is only true when calculating Note that this is only true when calculating

ss22, if you have a measurable population , if you have a measurable population and you want to calculate and you want to calculate 22, you use N in , you use N in the denominator, not N-1the denominator, not N-1


Psy B07


Degrees of FreedomDegrees of Freedom The mean of 6, 8, & 10 is 8.The mean of 6, 8, & 10 is 8.

If I allow you to change as many If I allow you to change as many of these numbers as you want of these numbers as you want BUT the mean must stay 8, how BUT the mean must stay 8, how many of the numbers are you free many of the numbers are you free to vary?to vary?


Psy B07


The point of this exercise is that when the The point of this exercise is that when the mean is fixed, it removes a degree of mean is fixed, it removes a degree of freedom from your sample -- this is like freedom from your sample -- this is like actually subtracting 1 from the number of actually subtracting 1 from the number of observations in your sample.observations in your sample.

It is for exactly this reason that we use N-It is for exactly this reason that we use N-1 in the denominator when we calculate s1 in the denominator when we calculate s22 (i.e., the calculation requires that the (i.e., the calculation requires that the mean be fixed first which effectively mean be fixed first which effectively removes -- fixes -- one of the data points).removes -- fixes -- one of the data points).


Psy B07


3) 3) EfficiencyEfficiency

The The efficiencyefficiency of a statistic is of a statistic is reflected in the variance that is reflected in the variance that is observed when one examines the observed when one examines the means of a bunch of independently means of a bunch of independently chosen samples. The smaller the chosen samples. The smaller the variance, the more efficient the variance, the more efficient the statistic is said to bestatistic is said to be


Psy B07


4) 4) ResistanceResistance

The The resistanceresistance of an estimator of an estimator refers to the degree to which that refers to the degree to which that estimate is effected by extreme estimate is effected by extreme values.values.

As mentioned previously, both X As mentioned previously, both X and sand s22 are highly sensitive to are highly sensitive to extreme valuesextreme values


Psy B07


4) 4) ResistanceResistance

Despite this, they are still the most Despite this, they are still the most commonly used estimates of the commonly used estimates of the corresponding population corresponding population parameters, mostly because of parameters, mostly because of their superiority over other their superiority over other measures in terms sufficiency, measures in terms sufficiency, unbiasedness, & efficiencyunbiasedness, & efficiency

Documents

Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA