LSP 121 Week 2 Intro to Statistics and SPSS/PASW

LSP 121

Week 2Intro to Statistics and SPSS/PASW

Descriptive Statistics:Mean, Median, Percentile, Range

• Mean• Median – the middle score

• The score with an equal number of data points above and below• If there are an even number of datapoints, take the average of the

middle two

• Percent Rank – calculates the position of a datapoint in a data set. More precisely, tells you approximately how many percent of the data is less than the datapoint.• e.g. 86th percentile means that 86 percent of data-points /people / etc

were below that number

• Range – difference between the maximum and minimum values in the data set

2

Median

• Median for bank 1 = the middle value of 11 data points

• Median for bank 2: even number of data points – there is no middle.– Take the average of the two middle values

3

Bank 1: 4.1 5.2 5.6 6.2 6.7 7.2 7.7 7.7 8.5 9.3 11.0

Bank 2: 6.6 6.7 6.7 6.9 6.9 7.1 7.2 7.3 7.4 7.7 7.8 7.8

Descriptive Statistics: Quartiles• Lower quartile: aka first quartile - the median

of the data values in the lower half of a data set (do not include the median)

• Middle quartile: aka second quartile - this is the overall median

• Upper quartile: aka third quartile - the median of the data values in the upper half of a data set (do not include the median)– Note: Some statistical software packages use the 25th, 50th,

and 75th percentiles as their quartiles (instead of median values). SPSS determines quartiles in this way. On an exam, you would use the medians.

4

Quartiles

• For example (bank waiting times):

5

Bank 1: 4.1 5.2 5.6 6.2 6.7 7.2 7.7 7.7 8.5 9.3 11.0

Bank 2: 6.6 6.7 6.7 6.9 6.9 7.1 7.2 7.3 7.4 7.7 7.8 7.8

lower quartile median upper quartile

Bank 2median = (7.1 + 7.2)/2 = 7.15lower quartile = 6.7upper quartile = 7.7range: 7.8 – 6.6 = 1.2

Descriptive Statistics:The Five-Number Summary

• The five number summary consists of:– The minimum value– The lower quartile (first quartile)– The median (second quartile)– The upper quartile (third quartile)– The maximum value

• As mentioned earlier, SPSS determines quartiles using the percentiles: First quartile is 25th percentile, second quartile is 50th percentile, and third quartile is 75th percentile

6

Standard Deviation• Quartiles are OK for characterizing data, but

standard deviation is preferred by statisticians• It is a measure of how far data values are spread

around the mean of a data set• Formula:

– Std dev = sqrt(sum of (deviations from the mean)2 / total number of data values – 1)

– You don’t need to know this formula!– Don’t calculate by hand, use statistical software such as

SPSS (which we’ll do in a few minutes)

7

Standard Deviation - Guesstimate• A simple way to estimate standard deviation is the range

estimate• Don’t rely on estimation – use only to get a very quick and

general idea of the value of sd.

• Divide range by 4

• Watch for outliers. They can ruin your range estimate

• What is an outlier? • Two or more standard deviations from the mean (above OR

below)

8

Standard Deviation• Go back to Big Bank / Best Bank example• Big Bank: range = 6.9

• 6.9 / 4 = 1.7• Actual standard deviation is 1.96

• Best Bank: range = 1.2• 1.2 / 4 = 0.3• Actual standard deviation is 0.44

• Any outliers? Means are 7.2 and 6.7Big Bank:

4.1 5.2 5.6 6.2 6.7 7.2 7.7 7.7 8.5 9.3 11.0Best Bank:

6.6 6.7 6.7 6.9 7.1 7.2 7.3 7.4 7.7 7.8 7.8

9

* Histograms

• Nice way to view a data set• A histogram is a chart created by defining a

set of bins and counting how many data points lie in each bin. Bars are drawn with height proportional to the number of data points in each bin.– * Note: The histogram does not keep track of the

value of each data point – it only keeps track of which bin a data point is contained in.

10

Example HistogramSalaries of 26 Men’s Basketball Coaches

11

What is the most common salary according to this graph? How many coaches make this amount?

Between $50,000 and $100,000Most of the coaches (15).

How many coaches make less than $50,000?Only 1.

How many make more than than $100,000?About 10.

These would make for good exam questions…

Statistics and SPSS/PASW

• While Excel can do some basic statistics, it is not considered a serious statistics tool

• You really should use something like SPSS/PASW or SAS

• We’ll use SPSS/PASW since DePaul has a site license

12

Let’s Try An Example• Copy the dataset grades.xls (from the QRC web page Excel Files Older Data) to My

Documents and start SPSS • or try the file IncomeGaps.xls

• Open the Grades.xls spreadsheet• Note: SPSS looks for files with an extention of .sav However, Excel files

have an .xls extension. You must select the ‘Files of Type’ dropdown to tell SPSS to search for XLS (i.e. Excel) files.

• Change the variable names and make sure the data is numeric, not text• Click on the ‘Variable View’ tab at the bottom• For each of the two rows, click the cell under ‘Type’ and choose Numeric.• Then click back to ‘Data View’

• Click on Analyze -> Descriptive Statistics -> Frequencies• Copy any variables that you want to analyze (i.e. exam 1 and exam 2) into

the box on the right

13

14

• Be careful! If the numeric fields in the dataset have any $, % or #, SPSS will have difficulty converting these to numeric

• In particular, if the data has dollar signs, have SPSS first convert the field to Dollar, then convert it to Numeric (IncomeGaps.xls)

Let’s Try An Example

15

• Using the grades for Exam 2, find the– 5 number summary (minimum, 1st quartile,

median, 3rd quartile, maximum)• See this link for instructions

– Mean– Range– What is the standard deviation?

Let’s Try An Example

http://condor.depaul.edu/~sjost/it223/documents/spss-tutorial.htm#PartJ

Listing Z-Values• A good stats package will make

it easy to determine z-values• Click on Analyze Descriptive

Statistics Descriptives• Choose the variable, let’s use

Exam2• Be sure the check ‘Save

standardized values as variables’ at the bottom

• When you return to the ‘Data View’ you will see that a new column has appeared giving you the z-score for every value in the Exam2 data set

16

Pivot Tables

• Let’s say you have just performed a survey.• One of the questions you ask is: “What type of

home computer Internet connection do you have?”– Answers can be: None, Dial-up, DSL, Cable, Other,

Not Sure.

17

Pivot Tables

• Here are some of your results

18

Respondent ID Cable Type 11111 no 11112 ds 11113 cm 11114 dk 11115 du 11116 du

Where no = none; ds = dsl; cm = cable modem; du = dial up; dk = don’t know; ot = other

Pivot Tables

• You can use SPSS to count the occurrences of data items, just like a pivot table

• Open a new file: File New• Enter your data into SPSS (you can leave out the IDs

for now)• Click on Analyze / Descriptive Statistics / Frequencies• Move the variable that you want to count from the

left box to the right box• Make sure Display Frequencies Table is checked• Run it (Click ‘OK’)

19

Crosstabulations(Crosstabs)

• Crosstabs are an extension of pivot tables• Let’s say you have asked a number of

students: How many schools did you apply to?• You get results something like the following

(in a spreadsheet):

20

Crosstabs

21

Respondent ID Sex # of schools

1 F 6

2 M 2

3 F 7

4 M 4

5 F 9

6 F 10

7 M 3

8 M 2

9 F 7

10 F 5

Crosstabs

• Now open the data in SPSS• Then pull down the menu Analyze and click on

Descriptive Statistics, then Crosstabs• What variable do you want in the row? The

column?– We are probably interested in determining

examining how many schools females apply to relative to males

• When ready, click OK to perform the crosstab.22