94
1 1. What is SPSS? SPSS stands for the "Statistical Package for the Social Sciences." It is composed of two inter-related facets, the statistical package itself and SPSS language, a system of syntax used to execute commands and procedures. When you use SPSS, you work in one of several windows: the Data View, the Variable View, the Output View, the draft output view, and the script view. 2. Data File for Practical Sessions For the purpose of lectures we are going to use the SPSS file called: File to be used for SPSS Lectures.sav A survey was conducted among 100 students (that is why we have used 100 rows in Data Data View: one row for each student) and data has been collected for 8 variables (each variable occupies one column): 1. Percentage Marks obtained on SPSS exam in January 2. Percenatage Marks obtained on SPSS exam in April 3. Percentage Marks obtained in Computer Studies 4. Percentage of lectures attended from January to April 5. Numeracy level on a scale of 1 to 20 6. University: UOM or UTM? 7. Socio Economic Status of Student 8. Overall Grade in previous semester A survey was conducted among 100 singers who have sold CDs during the year 2007. Data was collected for the following 4 variables: 1. Advertising Budget (thousands of rupees) for the CD 2. No of CDs Sold (thousands) 3. No. of times Songs are played on Radio 1 during the week prior to its release 4. Attractiveness of the Singer on a scale 1 to 10 When you first open SPSS for Windows, the first thing you will see is the Data Editor. The Data Editor consists of two windows. By default the Data View (The data view has a spreadsheet-like interface, much like Excel which allows the data to be entered and viewed, is shown in Figure 1). Data values can be entered in the Data View spreadsheet (This is where you can start! Start inputting your data). The other window is the Variable View, which allows the types of variables to be specified and viewed (Figure 2). (This is where you actually type the questions from your questionnaire to SPSS, the codes or labels (e.g. 0, 1, 2..) for categorical variables (e.g. males, females): Male = 1, Female = 0) The user can toggle between the windows by clicking on the appropriate tabs on the bottom left of the screen (Figure 3). By default SPSS aligns numerical data entries to the right-hand side of the cells and text (string) entries to the left-hand side. By default SPSS uses a period/full stop to indicate missing numerical values. You may also use numbers to represent missing data. For example

Lectures on Spss 2010

Embed Size (px)

Citation preview

Page 1: Lectures on Spss 2010

1

1. What is SPSS? SPSS stands for the "Statistical Package for the Social Sciences." It is composed of two

inter-related facets, the statistical package itself and SPSS language, a system of syntax used to execute commands and procedures. When you use SPSS, you work in one of several windows: the Data View, the Variable View, the Output View, the draft output view, and the script view.

2. Data File for Practical Sessions For the purpose of lectures we are going to use the SPSS file called: File to be used for SPSS Lectures.sav A survey was conducted among 100 students (that is why we have used 100 rows in Data Data View: one row for each student) and data has been collected for 8 variables (each variable occupies one column):

1. Percentage Marks obtained on SPSS exam in January 2. Percenatage Marks obtained on SPSS exam in April 3. Percentage Marks obtained in Computer Studies 4. Percentage of lectures attended from January to April 5. Numeracy level on a scale of 1 to 20 6. University: UOM or UTM? 7. Socio Economic Status of Student 8. Overall Grade in previous semester

A survey was conducted among 100 singers who have sold CDs during the year 2007. Data was collected for the following 4 variables:

1. Advertising Budget (thousands of rupees) for the CD 2. No of CDs Sold (thousands) 3. No. of times Songs are played on Radio 1 during the week prior to its release 4. Attractiveness of the Singer on a scale 1 to 10

When you first open SPSS for Windows, the first thing you will see is the Data Editor. The Data Editor consists of two windows. By default the Data View (The data view has a spreadsheet-like interface, much like Excel which allows the data to be entered and viewed, is shown in Figure 1). Data values can be entered in the Data View spreadsheet (This is where you can start! Start inputting your data). The other window is the Variable View, which allows the types of variables to be specified and viewed (Figure 2). (This is where you actually type the questions from your questionnaire to SPSS, the codes or labels (e.g. 0, 1, 2..) for categorical variables (e.g. males, females): Male = 1, Female = 0) The user can toggle between the windows by clicking on the appropriate tabs on the bottom left of the screen (Figure 3). By default SPSS aligns numerical data entries to the right-hand side of the cells and text (string) entries to the left-hand side. By default SPSS uses a period/full stop to indicate missing numerical values. You may also use numbers to represent missing data. For example

Page 2: Lectures on Spss 2010

2

data was missing for two students for Numeracy. The missing data have been represented by 98 (98 represent data that was missing because student was absent)and 99 (99 represent data that was missing because student had an exemption). Figure 1: Data View

When labels have been assigned to the category codes of a categorical variable, these can be displayed by checking Value Labels (or by selecting on the toolbar). For example you can either display the codes 0 and 1 in the column uni or you can display University of Mauritius and University of Technology. The appearance of the Data View spreadsheet is controlled by the View drop-down menu. This can be used to change the font in the cells, remove lines, and make value labels visible. Figure 2: Toggle between Data/Variable View

Figure 3: Variable View

Page 3: Lectures on Spss 2010

3

3. Variable View In Variable View each variable definition occupies a row of this spreadsheet. (The first column in Data View must be completely defined in the first row of Variable View. The second column in Data View must be completely defined in the second row of Variable View.) As soon as data is entered under a column in the Data View, the default name of the column occupies a row in the Variable View. That is why you should NOT type the name of columns in Data View shown in Figure 4. These names will appear automatically once you define the variables in Variable View as shown in Figure 3. Figure 4

There are 10 characteristics to be specified under the columns of the Variable View: Name — the chosen variable name. This can be up to eight alphanumeric characters but must begin with a letter. While the underscore (_) is allowed, hyphens (-), ampersands (&), and spaces cannot be used. Variable names are not case sensitive. Name is used for internal processing by SPSS. Name does not appear on the output generated by SPSS. Label — a label attached to the variable name. In contrast to the name, this is not confined to eight characters and spaces can be used. It is generally a good idea to assign variable labels. (Here you can paste the sentences from your questionnaire which you have typed in MS Word). They are helpful for reminding users of the meaning of variables (placing the cursor over the variable name in the Data View will make the variable label appear) and are displayed in the output from statistical analyses. Type — the type of data. SPSS provides a default variable type once variable values have been entered in a column of the Data View. The type can be changed by highlighting the respective entry in the second column of the Variable View and clicking the three-period symbol (…) ( ) appearing on the right-hand side of the cell. This results in the Variable Type box being opened, which offers a number of types of data including various formats for numerical data, dates, or currencies. (Note that a common mistake made by first-time users is to enter categorical variables as type “string” by typing text into the Data View. To enable later analyses, categories should be given artificial number codes and defined to be of type “numeric.”) Decimals — the number of digits to the right of the decimal place to be displayed for data entries. This is not relevant for string data and for such variables the entry under the fourth column is given as a greyed-out zero. The value can be altered in the same way as the value of Width. For example, the value 879.45 has a decimal of 2. Note decimal must be adjusted before adjusting the width. Width — the width of the actual data entries. The default width of numerical variable entries is eight. The width can be increased or decreased by highlighting the respective cell in the third column and employing the upward or downward arrows appearing on the right-hand side of the cell or by simply typing a new number in the cell. For example, the value 879.45 has a width of 6.

Page 4: Lectures on Spss 2010

4

Values — labels attached to category codes. For categorical variables, an integer code should be assigned to each category and the variable defined to be of type “numeric.” When this has been done, clicking on the respective cell under the sixth column of the Variable View makes the three-period symbol appear, and clicking this opens the Value Labels dialogue box, which in turn allows assignment of labels to category codes. For example, our data set included a categorical variable sex indicating the gender of the subject. Clicking the three-period symbol ( ) opens the dialogue box shown in Figure 5 where numerical code “0” was declared to represent university of Mauritius and code “1” University of Technology. Figure 5 Value Labels.

Missing — missing value codes. SPSS recognizes the period symbol as indicating a missing value. If other codes have been used (e.g. 98, 99) these have to be declared to represent missing values by highlighting the respective cell in the seventh column, clicking the three-periods symbol and filling in the resulting Missing Values dialogue box accordingly. Figure 6: Defining Missing Values

Columns — width of the variable column in the Data View. The default cell width for numerical variables is eight. Note that when the Width value is larger than the Columns value, only part of the data entry might be seen in the Data View. The cell width can be changed in the same way as the width of the data entries or simply by dragging the relevant column boundary. (Place cursor on right-hand boundary of the title of the column to be resized. When the cursor changes into a vertical line with a right and left arrow, drag the cursor to the right or left to increase or decrease the column width.) Align — alignment of variable entries. The SPSS default is to align numerical variables to the right-hand side of a cell and string variables to the left. It is generally helpful to adhere to this default; but if necessary, alignment can be changed by highlighting the relevant cell in the ninth column and choosing an option from the drop-down list. Measure — measurement scale of the variable. The default chosen by SPSS depends on the data type. For example, for variables of type “numeric,” the default measurement scale is a continuous or interval scale (referred to by SPSS as “scale”). For variables of type “string,” the default is a nominal scale. The third option, “ordinal,” is for categorical variables with

Page 5: Lectures on Spss 2010

5

ordered categories but is not used by default. It is good practice to assign each variable the highest appropriate measurement scale (“scale” > “ordinal” > “nominal”) since this has implications for the statistical methods that are applicable. The default setting can be changed by highlighting the respective cell in the tenth column and choosing an appropriate option from the drop-down list.

4. Practice session 1- Data input Description of Data In this session, we consider two data sets. The first, shown in Table 2.1, involves the lifespan of two groups of rats, one group given a restricted diet and the other an ad libitum diet (that is, “free eating”). Interest lies in assessing whether lifespan is affected by diet. Note labels used are: 1= restricted diet and 2 = ad libitum diet The second data set, shown in Table 2.2, gives the ages at marriage for a sample of 100 couples that applied for marriage licences in Cumberland County, PA, in 1993. Some of the questions of interest about these data are as follows: How is age at marriage distributed? Is there a difference in average age at marriage of men and women? How are the ages at marriage of husband and wife related? Note labels used are: 0= female and 1 = male.

(n = 106)

Page 6: Lectures on Spss 2010

6

You have to input the data in Table 2.1 as follows.

Launch SPSS for Windows.

Switch to Data View. In the first column input the rat number. There are 194 rats therefore this column would contain the numbers 1, 2, 3, …. 194.

The second column would contain the lifespans (i.e. the data given in Table 2.1).

The third column should contain the code 1 (1= restricted diet) for the first 105 rats and code 2 for the remaining 89 rats (1= restricted diet and 2 = ad libitum diet).

Page 7: Lectures on Spss 2010

7

STEP 1 Figure 7: Entering Data in Data View

STEP 2 Then go to Variable View to define the Variables and define the Name and Label as shown in Figure 8. Figure 8. Defining Variables:

STEP 3 Click on the three-period symbol ( ) in the cell where the row diet and Values meet as shown in figure 9. This opens a dialogue box shown in Figure 10. Figure 9.

Page 8: Lectures on Spss 2010

8

Step 4 Type 1 in Value and type Restricted diet in Value Label. Click on Click on Click on Ok. Switch to Data View.

Click on on the toolbar to see the labels:

Page 9: Lectures on Spss 2010

9

Note: The Statistics Menus The drop-down menus available after selecting Data, Transform, Analyze, or Graphs from the menu bar provide procedures concerned with different aspects of a statistical analysis. They allow manipulation of the format of the data spreadsheet to be used for analysis (Data), generation of new variables (Transform), running of statistical procedures (Analyze), and construction of graphical displays (Graphs). Most statistics menu selections open dialogue boxes. The dialogue boxes are used to select variables and options for analysis. A main dialogue for a statistical procedure has several components: A source variables list is a list of variables from the Data View spreadsheet that can be used in the requested analysis. Only variable types that are allowed by the procedure are displayed in the source list. Variables of type “string” are often not allowed.

5. Practice session 2- Univariate Analyses

The analysis of almost every data set should begin by examination of relevant summary statistics, and a variety of graphical displays. SPSS supplies standard summary measures of location and spread of the distribution of a continuous variable together with a variety of useful graphical displays. The easiest way to obtain the data summaries is to select the commands:

Analyze – Descriptive statistics – Explore Figure 3.1

Page 10: Lectures on Spss 2010

10

Transfer Life Spans of Rats to Dependent List. Dependent List declares the continuous variables. We would like to have output generated for the diets separately. So diet is transferred to Factor List. Labeling the observations by the rat’s ID number will enable possible outlying observations to be identified. Figure 3.2

Page 11: Lectures on Spss 2010

11

For graphical displays of the data we again need the Explore dialogue box; in fact, by checking both in this box, we can get our descriptive statistics and the plots we require. Here we select Boxplots and Histogram to display the distributions of the lifespans of the rats, and probability plots to assess more directly the assumption of normality within each dietary group. We can now move on to examine the graphical displays we have selected. The box plots are shown in Figure 3.3. Figure 3.3

This type of plot (also known as box-and-whisker plot) provides a “picture” of a five-point summary of the sample observations in each group. The lower end of the box represents the lower quartile and the upper end the upper quartile; thus the box width is the IQR and covers the middle 50% of the data. The horizontal line within the box is placed at the median of the sample. The bottom “whisker” extends to the minimum data point in the sample, except if this point is deemed an outlier by SPSS. (SPSS calls a point an “outlier” if the point is more than 1.5 X IQR away from the box and considers it an “extreme value” when it is more than 3 X IQR away.)

Page 12: Lectures on Spss 2010

12

In the latter case, the whisker extends to the second lowest case, except if this is found to be an outlier and so on. The top whisker extends to the maximum value in the sample, again provided this value is not an outlier. The box plots in Figure 3.3 leads to the same conclusions as the descriptive summaries. Lifespans in the restricted diet group appear to be longer “on average” but also more variable.

How to report? A number of rats have been indicated as possible outliers and for the ad libitum diet; some are even marked as extreme observations. Since we have employed case labels, we can identify the rats with very short lifespans. Here the rat with the shortest lifespan (89 days) is rat number 107. (Lifespans that are short relative to the bulk of the data can arise as a result of negative skewness of the distributions — observations labeled “outliers” by SPSS do not necessarily have to be removed before further analyses, although they do merit careful consideration. Here we shall not remove any of the suspect observations before further analyses.) The evidence from both the summary statistics for the observations in each dietary group and the box plot is that the distributions of the lifespans in the underlying population are non-symmetric and that the variances of the lifespans vary between the diet groups. Such information is important in deciding which statistical tests are most appropriate for testing hypotheses of interest about the data, as we shall see later. For data set containing outliers, it is NOT recommended to use mean, standard deviation and variance. Use median instead of the mean. Use inter-quartile range instead of Standard Deviation/Variance. Figure 3.4 shows the descriptive statistics supplied by default (further statistics can be requested from Explore via the Statistics sub-dialogue box).

Page 13: Lectures on Spss 2010

13

Figure 3.4

How to report? Write one statement on the measures of central tendency: mean or median or mode. The median lifespan is shorter for rats on the ad libitum diet (710 days compared with 1035.5 days for rats on the restricted diet). A similar conclusion is reached when either the mean or the 5% trimmed mean is used as the measure of location. Write one statement on the measures of dispersion: interquartile range or standard deviation or variance. The “spread” of the lifespans as measured by the interquartile range (IQR) appears to vary with diet, with lifespans in the restricted diet group being more variable (IQR in the restricted diet group is 311.5 days, but only 121 days in the Ad libitum diet group). Other measures of spread, such as the standard deviation and the range of the sample, confirm the increased variability in the restricted diet group. Note another way to state that is to say that the lifespans in the Ad libitum diet is more consistent.

Page 14: Lectures on Spss 2010

14

Write one statement on the measures of symmetry: index of skewness. Write one statement on the measures of peakedness: index of kurtosis SPSS provides measures of two aspects of the “shape” of the lifespan distributions in each dietary group, namely, skewness and kurtosis. The index of skewness takes the value zero for a symmetrical distribution. A negative value indicates a negatively skewed distribution, a positive value a positively skewed distribution — Figure 3.5 shows an example of each type. Both data sets we show some degree of negative skewness. There is a concentration of smaller values. The kurtosis index measures the extent to which the peak of a unimodal frequency distribution departs from the shape of normal distribution. A value of zero corresponds to a normal distribution (A); positive values indicate a distribution that is more pointed than a normal distribution (C) and a negative value a flatter distribution (B) — Figure 3.6 shows examples of each type. For both sets data, the distributions are more pointed than a normal distribution. Such findings have possible implications for later analyses that may be carried out on the data.

Figure 3.5

Curves with different degrees of kurtosis.

Figure 3.6

Page 15: Lectures on Spss 2010

15

Figure 3.8

Figure 3.8

7. Histogram An alternative to the box plot for displaying sample distributions is the histogram. Figure 3.7 shows the histograms for the lifespans under each diet. Each histogram displays the frequencies with which certain ranges (or “bins”) of lifespans occur within the sample. SPSS chooses the bin width automatically, but here we chose both our own bin width (100 days) and the range of the x-axis (100 days to 1500 days) so that the histograms in the two groups were comparable. To change the default settings to reflect our choices, we go through the following steps: Figure 3.7

Page 16: Lectures on Spss 2010

16

As we might expect the histograms indicate negatively skewed frequency distributions with the left-hand tail being more pronounced in the restricted diet group.

8. Practice session 3- Univariate Analyses

(Contd.)

For the purpose of this session use the SPSS file called: File to be used for SPSS Lectures.sav.

Carry out the univariate analyses on the variable Percentage Marks obtained on SPSS exam in January in order to compare the performance of the students of UOM and UTM.

Figure 3.8

Page 17: Lectures on Spss 2010

17

Figure 3.8b

You have to generate the following output and compare the performances of the students of UTM and UOM. Generate the Q-Q plot. Carry out the normality tests.

Page 18: Lectures on Spss 2010

18

Descriptives

54.440 2.7779

48.858

60.022

53.900

53.000

385.843

19.6429

22.0

99.0

77.0

32.750

.259 .337

-.893 .662

61.760 3.1774

55.375

68.145

62.333

67.500

504.798

22.4677

15.0

97.0

82.0

40.250

-.482 .337

-.931 .662

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

University: UOM or UTM?University of Mauritius

University of Technology

Percentage Marksobtained on SPSSexam in January

Statistic Std. Error

9. 1 Normality tests

Note the Kolmogorov-Smirnov and Shapiro-Wilk tests will are used to test whether the variables are normally distributed. If the Sig. Value of the test is less than 5% or 0.05 we conclude that the variable does not follow a Normal distribution. In case of conflict, we report the Shapiro-Wilk test. To obtain Normality test: click on Analyze – Descriptive Statistics – Explore. Click on Plots and tick Normality plots with tests as shown in Figure 3.8b.

Page 19: Lectures on Spss 2010

19

Tests of Normality

University: UOM or UTM? Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Percentage Marks obtained on SPSS exam in January

University of Mauritius .109 50 .192 .962 50 .111

University of Technology

.140 50 .016 .935 50 .008

a Lilliefors Significance Correction

How to report? (a) H0: Percentage Marks obtained in the SPSS exam in January by UOM students follows a Normal distribution H1: Percentage Marks obtained in the SPSS exam in January by UOM students does not follow a Normal distribution Test: Kolmogorov-Smirnov

Statistic = .109 p-value = 0.192 Conclusion: The Percentage Marks obtained on SPSS exam in January for UOM students follows a Normal distribution as its Sig. = 0.192 > 0.05. Accept H0. (b) H0: Percentage Marks obtained in the SPSS exam in January by UTM students follows a Normal distribution H1: Percentage Marks obtained in the SPSS exam in January by UTM students does not follow a Normal distribution Test: Kolmogorov-Smirnov

Statistic = .140 p-value = 0.016 Conclusion: The Percentage Marks obtained on SPSS exam in January for UOM students follows a Normal distribution as its Sig. = 0.016 < 0.05. Reject H0. For UTM students it does not follow a Normal distribution as its Sig. = 0.016 < 0.05.

9.2 Q-Q Plot Finally, normality can be assessed more formally with the help of a quantile-quantile probability plot (Q-Q plot); this involves a plot of the quantiles expected from a standard normal distribution against the observed quantiles. Such a plot for the observations in each group is shown in Figure 3.9. A graph in which the points lie approximately on the reference line indicates normality. Points above the line indicate that the observed quantiles are lower than expected and vice versa. For the rats lifespan data we find that the very small and very

Page 20: Lectures on Spss 2010

20

large quantiles are smaller than would be expected — with this being most pronounced for the lowest three quantiles in the ad libitum group. Such a picture is characteristic of distributions with a heavy left tail; thus again we detect some degree of negative skewness. Figure 3.9

10. Frequency Analysis Frequency analyses are used for both cleaning the data and counting. Let us carry out a frequency analysis to find out how many student were awarded the various numeracy levels. Click on Analyze – Descriptive Statistics - Frequencies as shown in Figure 10.1. Figure 10.1

Move the variable Numeracy level on a scale of 1 to 20 to the list of Variable(s) as shown in Figure 10.2. Finally click on OK.

Page 21: Lectures on Spss 2010

21

Figure 10.2

The following output will be generated. There were 4 students who obtained Numeracy level1.00. There were 15 students who obtained level 2.00. Note Percent takes missing data into consideration when calculating the percentage. Valid percent ignores missing data. There are two missing cases for the variable Numeracy level. For Example for Numeracy level 1.00: Percent: 4/100 * 100 = 4.0 % and Valid Percent: 4/98 *100 = 4.1% Numeracy level on a scale of 1 to 20

Frequency Percent Valid Percent Cumulative

Percent Valid Numeracy Level

1.00 4 4.0 4.1 4.1 2.00 15 15.0 15.3 19.4 3.00 15 15.0 15.3 34.7 4.00 17 17.0 17.3 52.0 5.00 13 13.0 13.3 65.3 6.00 8 8.0 8.2 73.5 7.00 9 9.0 9.2 82.7 8.00 9 9.0 9.2 91.8 9.00 2 2.0 2.0 93.9 10.00 3 3.0 3.1 96.9 12.00 1 1.0 1.0 98.0 13.00 1 1.0 1.0 99.0 14.00 1 1.0 1.0 100.0 Total 98 98.0 100.0

Missing Missing Data because student was absent

1 1.0

Missing Data because student was exempted

1 1.0

Total 2 2.0

Page 22: Lectures on Spss 2010

22

Total 100 100.0

11. Comparing two groups.

11.1 Comparing two independent groups.

Page 23: Lectures on Spss 2010

23

(2)

Page 24: Lectures on Spss 2010

24

11.2 Comparing two dependent/related groups

(1)

Page 25: Lectures on Spss 2010

25

Page 26: Lectures on Spss 2010

26

12. Practice session 4 – Comparing two independent groups: Mann Whitney U-test and Independent Samples t-test

The two groups of rats are Independent (as two rats of the same family are not in two different groups). Who live longer: those who were on restricted diet or those who were on ad libitum diet? We wish to compare the lifespan (the only variable) of two groups (restricted diet and ad libitum diet).

12.1 Mann Whitney U-test H0: Population median of lifespan in the restricted diet group = population median of lifespan in the ad libitum group H1: Population median of lifespan in the restricted diet group ≠ population median of lifespan in the ad libitum group Step 1: Using Normality tests verify that the variable lifespan does NOT a Normal distribution. Tests of Normality

Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Lifespan of rats .086 195 .001 .974 195 .001

a Lilliefors Significance Correction Since Sig. = 0.001 < 5%=0.05 we conclude that at 5% significance level, lifespan does not follow a Normal distribution. This means that we can’t use independent samples t-test to test H0. So we will use the Mann Whitney U-test. Step 2: Click

Analyze – Non-parametric Tests – 2 Independent-Samples…

Page 27: Lectures on Spss 2010

27

Step 3: Transfer lifespan to Test Variable. Transfer diet to Grouping Variable. The Test Variable(s) list contains the variables that are to be compared between two levels of the Grouping Variable. Step 4: Click on Define Groups. Step 5: Type in 1 and 2 which are the codes we have used for the two diets. Here the variable lifespan is to be compared between level “1” and “2” of the grouping variable diet. The Define Groups… sub-dialogue box is used to define the levels of interest. In this example, the grouping variable has only two levels, but pair-wise group comparisons are also possible for variables with more than two group levels.

Step 6: Click on Continue and then Ok. Note. Mann-Whitney U test is selected by default. The output generated is The output also include the following Table. Ranks

Diet N Mean Rank Sum of Ranks Lifespan of rats Restricted diet 106 128.70 13642.50

Ad libitum diet 89 61.43 5467.50Total 195

In the table, the group with higher mean rank is the group with greater number of high scores within it. Therefore, we can conclude that restricted diet group has significantly higher lifespan.

Page 28: Lectures on Spss 2010

28

Test Statistics(a)

Lifespan of

rats Mann-Whitney U 1462.500 Wilcoxon W 5467.500 Z -8.291 Asymp. Sig. (2-tailed) .000

a Grouping Variable: Diet

How to report? H0: Population median of lifespan in the restricted diet group = population median of lifespan in the ad libitum group H1: Population median of lifespan in the restricted diet group ≠ population median of lifespan in the ad libitum group Test: Mann-Whitney U test Statistics: Z = -8.291 p = 0.000 (Note p = Asymp. Sig. (2-tailed)) Conclusion: Since p = 0.000 < 5%, we reject H0. At 5% level of significance, there is a significant difference in the lifespan of rats on the two diets. Note we are using a 2-tail test (as indicated by ≠ in H1). Since restricted diet group has higher mean rank it has significantly higher lifespan.

12.2 Independent Samples t-test We shall apply an independent samples Student’s t-test, conveniently ignoring for the moment the indication given by our preliminary examination of the data that two of the assumptions of the test, normality and homogeneity of variance, might not be strictly valid. H0: Population mean of lifespan in the restricted diet group = population mean of lifespan in the ad libitum group H1: Population mean of lifespan in the restricted diet group ≠ population mean of lifespan in the ad libitum group Step 1: Using Normality tests verify that the variable lifespan follows a Normal distribution. (let us ignore for the moment that lifespan follows a normal distribution) Step 2: Click

Analyze – Compare Means – Independent-Samples T Test

Page 29: Lectures on Spss 2010

29

Figure 12.1

Step 3: Transfer lifespan to Test Variable. Transfer diet to Grouping Variable. The Test Variable(s) list contains the variables that are to be compared between two levels of the Grouping Variable. Step 4: Click on Define Groups. Step 5: Type in 1 and 2 which are the codes we have used for the two diets. Here the variable lifespan is to be compared between level “1” and “2” of the grouping variable diet. The Define Groups… sub-dialogue box is used to define the levels of interest. In this example, the grouping variable has only two levels, but pair-wise group comparisons are also possible for variables with more than two group levels. Figure 12.2

Page 30: Lectures on Spss 2010

30

Step 6: Click on Continue. Click on OK The following out put will be generated:

This begins with a number of descriptive statistics for each group. (Note that the standard errors of the means are given, i.e., the standard deviation of lifespan divided by the square root of the group sample size.) The next part of the display gives the results of applying two versions of the independent samples t-test; the first is the usual form, based on assuming equal variances in the two groups (i.e., homogeneity of variance), standard error of this estimator (32.9 days) to construct a 95% CI for the mean difference (from 219.9 to 349.6 days). The mean lifespan in the restricted diet group is between about 220 and 350 days longer than the corresponding value in the ad libitum diet. The “Independent Samples Test” table also includes statistical significance test proposed by Levene (1960) for testing the null hypothesis that the variances in the two groups are equal. In this instance, the test suggests that there is a significant difference in the size of the within diet variances (p < 0.001).

How to report? H0: The variances of lifespan in the two groups are homogeneous. H1: The variances of lifespan in the two groups are not homogeneous. Test: Levene’s Test Statistic: F = 33.433 p = 0.000 < 5% Conclusion: At 5% level of significance, we reject H0 and conclude that the variances can not be assumed to be equal.

Page 31: Lectures on Spss 2010

31

Consequently, it may be more appropriate here to use the alternative version of the t-test given in the second row of the table.

This version of the t-test uses separate variances instead of a pooled variance to construct the standard error and reduces the degrees of freedom to account for the extra variance. t = 9.161 p = 0.000 < 5%

How to report? H0: Population mean of lifespan in the restricted diet group = population mean of lifespan in the ad libitum group H1: Population mean of lifespan in the restricted diet group ≠ population mean of lifespan in the ad libitum group Test: Independent Samples t-test Statistics: t = 9.161 p = 0.000 < 5% (Note p = Asymp. Sig. (2-tailed)) Conclusion: Since p = 0.000 < 5%, we reject H0. At 5% level of significance, there is a significant difference in the lifespan of rats on the two diets. (Note we are using a 2-tail test (as indicated by ≠ in H1)). Since restricted diet group has higher mean (968.75 compared to 684.01) it has significantly higher lifespan. Note: One-tail test H0: Population mean of lifespan in the restricted diet group = population mean of lifespan in the ad libitum group

H1: Population mean of lifespan in the restricted diet group > population mean of lifespan in the ad libitum group Statistics: t = 9.161

For a one-tail t-test p = Sig. (1-tailed) = Sig. (2-tailed) ÷ 2 = 0.000 ÷2 = 0.000 < 2.5 % = 5% ÷ 2 At 5% level of significance, there is a significant difference in the lifespan of rats on the two diets.

Page 32: Lectures on Spss 2010

32

Since earlier analyses showed there was some evidence of abnormality in the lifespans data, it may be useful to look at the results of an appropriate nonparametric Mann-Whitney U-test (instead of the t-test) that does not rely on this assumption.

13. Practice session 5 – Comparing two independent groups: Mann Whitney U-test and Independent Samples t-test

Use the data file: File to be used for SPSS Lectures.sav Carry out the test

(a) Using Normality tests verify if the variable Percentage Marks obtained on SPSS exam in January follows a Normal distribution.

(b) Decide whether we should use Independent Samples t-test* or Mann-Whitney U test** to test:

H0: Mean*/Median** Percentage Marks obtained on SPSS exam in January by UOM students = Mean*/Median** Percentage Marks obtained on SPSS exam in January by UTM students H1: Mean*/Median** Percentage Marks obtained on SPSS exam in January by UOM students ≠ Mean*/Median** Percentage Marks obtained on SPSS exam in January by UTM students (c) Carry out the test(s) and interpret your results. Carry out the test

(a) Using Normality tests verify if the variable Percentage Marks obtained on SPSS exam in April follows a Normal distribution.

(b) Decide whether we should use Independent Samples t-test or Mann-Whitney U test to test:

H0: Mean*/Median** Percentage Marks obtained on SPSS exam in April by UOM students = Mean*/Median** Percentage Marks obtained on SPSS exam in April by UTM students H1: Mean*/Median** Percentage Marks obtained on SPSS exam in April by UOM students ≠ Mean*/Median** Percentage Marks obtained on SPSS exam in April by UTM students (c) Carry out the test(s) and interpret your results.

Page 33: Lectures on Spss 2010

33

14. Practice session 6 – Comparing two dependent groups: Wilcoxon Signed Ranks test ** and Paired Samples t-test *

The SPSS file to be used: File to be used for SPSS Lectures.sav A survey was conducted among 100 students (that is why we have used 100 rows in Data Data View: one row for each student). All the 100 students had to take a test in January. They were then given further lectures in SPSS. The SAME 100 students had to take a second test in April. Each student has a pair of marks (January and April). That is we have assessed the students twice. That’s why we have two columns of data: one for January and one for April. These two sets of data are called dependent or paired. We would like to know if there has been a change in performance from January to April. We may write it as H0: Mean*/Median** Percentage Marks obtained on SPSS exam in January by ALL 100 students = Mean*/Median** Percenatage Marks obtained on SPSS exam in April by ALL 100 students H1: Mean*/Median** Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Mean*/Median** Percenatage Marks obtained on SPSS exam in April by ALL 100 students Step 1: For each student, calculate the difference in marks obtained in January and April and store it in a column called diffmark. Click on Transform and Compute:

Type in diffmark in Target Variable. Transfer Percentage Marks obtained on SPSS exam in April in Numeric Expression:

Click on . Transfer Percentage Marks obtained on SPSS exam in January in Numeric Expression: Click on Type & Label.

Page 34: Lectures on Spss 2010

34

Type Difference in Jan and Apr Marks

Click on Continue Click on OK. A column will be automatically added to the data file.

Step 2: Using Normality tests verify that the variable Difference in Jan and Apr Marks follows a Normal distribution.

Page 35: Lectures on Spss 2010

35

Tests of Normality

Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Difference in Jan and Apr Marks .106 100 .007 .962 100 .006

a Lilliefors Significance Correction

How to report? H0: Difference in Jan and Apr Marks follows a Normal distribution H1: Difference in Jan and Apr Marks does not follow a Normal distribution Test: Kolmogorov-Smirnov

Statistic = .106 p-value = 0.007 Conclusion: The Difference in Jan and Apr Marks does not follow a Normal distribution as its Sig. = 0.007 < 0.05. Reject H0. As the condition of Normality is not satisfied we can’t use the Paired Samples t-test. We should use the Wilcoxon Signed Ranks test to test: H0: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students = Median Percenatage Marks obtained on SPSS exam in April by ALL 100 students H1: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Median Percenatage Marks obtained on SPSS exam in April by ALL 100 students

Page 36: Lectures on Spss 2010

36

14.1 Wilcoxon Signed Ranks test Step 1: Click

Analyze – Non-parametric Tests – 2 Related-Samples…

Step 2: Click on Percentage Marks obtained on SPSS exam in January and immediately after click Percentage Marks obtained on SPSS exam in April

Step 3: Then click on to transfer the two variables simultaneously to Test Pair(s) List.

Page 37: Lectures on Spss 2010

37

Click on OK. Step 4. The following output will be generated: Ranks

N Mean Rank Sum of Ranks Percentage Marks obtained on SPSS exam in April - Percentage Marks obtained on SPSS exam in January

Negative Ranks 24(a) 43.13 1035.00

Positive Ranks 64(b) 45.02 2881.00

Ties 12(c)

Total 100

a Percentage Marks obtained on SPSS exam in April < Percentage Marks obtained on SPSS exam in January b Percentage Marks obtained on SPSS exam in April > Percentage Marks obtained on SPSS exam in January c Percentage Marks obtained on SPSS exam in April = Percentage Marks obtained on SPSS exam in January Test Statistics(b)

Percentage Marks obtained on SPSS exam in April - Percentage Marks obtained on SPSS exam in

January Z -3.848(a) Asymp. Sig. (2-tailed) .000

a Based on negative ranks. b Wilcoxon Signed Ranks Test

Page 38: Lectures on Spss 2010

38

How to report? There were 24 students whose Percentage Marks obtained on SPSS exam in April < Percentage Marks obtained on SPSS exam in January There were 64 students whose Percentage Marks obtained on SPSS exam in April > Percentage Marks obtained on SPSS exam in January There were 12 students whose Percentage Marks obtained on SPSS exam in April = Percentage Marks obtained on SPSS exam in January H0: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students = Median Percentage Marks obtained on SPSS exam in April by ALL 100 students H1: Median Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Median Percentage Marks obtained on SPSS exam in April by ALL 100 students Test: Wilcoxon Signed Rank Test Statistics: Z = -3.848 p = 0.000 Conclusion: Since p = 0.000 < 5%, at 5% level of significance, we reject H0. As the result is based on negative ranks, we conclude that there has been a significant increase in the marks from January to April.

14.2 Paired Samples t-test We shall apply a paired samples Student’s t-test, conveniently ignoring for the moment the fact that the variable Difference in Jan and Apr Marks does not follow a Normal distribution. H0: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students = Mean Percentage Marks obtained on SPSS exam in April by ALL 100 students H1: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Mean Percentage Marks obtained on SPSS exam in April by ALL 100 students Click on Step 1. Analyze – Compare Means – Paired Samples t-test

Page 39: Lectures on Spss 2010

39

Step 2: Click on Percentage Marks obtained on SPSS exam in January and immediately after, click Percentage Marks obtained on SPSS exam in April

Step 3: Then click on to transfer the two variables simultaneously to Test Pair(s) List. Click on OK. Step 4. The following output will be generated: Paired Samples Statistics

Mean N Std. Deviation Std. Error

Mean Pair 1 Percentage Marks obtained on

SPSS exam in January 58.100 100 21.3156 2.1316

Percentage Marks obtained on SPSS exam in April 60.610 100 19.8611 1.9861

Page 40: Lectures on Spss 2010

40

Paired Samples Test

Paired Differences t df Sig. (2-tailed)

Mean Std.

Deviation

Std. Error Mean

95% Confidence Interval of the

Difference

Lower Upper Pair 1

Percentage Marks obtained on SPSS exam in January - Percentage Marks obtained on SPSS exam in April

-2.510 6.3477 .6348 -3.770 -1.250 -3.954 99 .000

How to report? The mean Percentage Marks obtained on SPSS exam in January was 58.100. The mean Percentage Marks obtained on SPSS exam in April was 60.610. The mean difference in scores is -2.510. H0: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students = Mean Percentage Marks obtained on SPSS exam in April by ALL 100 students H1: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students ≠ Mean Percentage Marks obtained on SPSS exam in April by ALL 100 students Test: Paired Samples t- Test Statistics: t = -3.954 p = 0.000 Conclusion: Since p=0.000 < 5%, at 5% level of significance, we reject H0. As mean Percentage Marks obtained on SPSS exam in April was higher than that obtained in January, we conclude that there has been a significant increase in the marks from January to April. Note: One-tail test H0: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students = Mean Percentage Marks obtained on SPSS exam in April by ALL 100 students H1: Mean Percentage Marks obtained on SPSS exam in January by ALL 100 students < Mean Percentage Marks obtained on SPSS exam in April by ALL 100 students Statistics: t = -3.954

For a one-tail t-test p = Sig. (1-tailed) = Sig. (2-tailed) ÷ 2 = 0.000 ÷2 = 0.000 < 2.5 % = 5% ÷ 2 At 5% level of significance, we reject H0.

Page 41: Lectures on Spss 2010

41

15. Practice session 7 – Comparing two dependent groups: Wilcoxon Signed Ranks test and Paired Samples t-test

Data file to be used: Enter data for the ages at marriage for the sample of 100 couples that applied for marriage licences in Cumberland County, PA, in 1993. (Data set is given on page 6).

(a) Compute the difference between the age of every husband and wife store the difference in the column diffage (label it as difference in age of husband and wife) (b) Using Normality tests verify if the variable difference in age of husband and wife follows a Normal distribution. (c) Decide whether we should use Wilcoxon Signed Ranks test or Paired Samples t-test

H0: Husbands’ age at marriage = Wifes’ age at marriage H1: Husbands’ age at marriage ≠ Wifes’ age at marriage (d) Carry out the test(s) and interpret your results.

Page 42: Lectures on Spss 2010

42

16. Practice session 8 – Correlation Pearson’s Correlation Coefficient and Spearman’s rho

The SPSS file to be used: File to be used for SPSS Lectures.sav Pearson’s Correlation coefficient is used when two variables

have a linear relationship, and are measured at ratio or interval level.

Pearson's correlation coefficient assumes that each pair of variables is bivariate normal, especially for small sample (size less than 30). However, it is considered to be a robust statistics. If the variables are not normally distributed but take values that can be ranked then we should use non-parametric Spearman’s rho. Kendall’s Tau is another non-parametric correlation and it should be used rather than Spearman’s rho when you have a small set of data with large number of tied ranks. Correlation coefficient (usually denoted r) ranges in value from -1 (a perfect negative linear relationship) and +1 (a perfect positive linear relationship). A value of 0 indicates no linear relationship. When interpreting your results, be careful not to draw any cause-and-effect conclusions due to a significant correlation. Correlation coefficients say nothing about which variable causes the other to change. For example: r = + 0.397 between X and Y indicates a positive correlation between X and Y, that is as X increases, Y increases. r = -0.441 between X and indicates a positive correlation between X and Y, that is as X increases, Y decreases. Although we cannot make direct conclusions about causality, we can take the correlation coefficient one step further by squaring it. The correlation coefficient (R2) is the measure of the amount of variability in one variable that is explained by the other. For example if r = -0.441, then R2 = 0.194 = 19.4%. This means X accounts for 19.4% of the variability in Y. This means 80.6% of the variability in Y is still to be accounted for by other variables. Note: even if X can account for 19.4% of the variability in Y, X does not necessarily cause this variation.

Generating a Scatter Diagram One of the assumption of correlational analyses is that the two variables must have a linear relationship. Suppose we would like to investigate the relationship between

Page 43: Lectures on Spss 2010

43

(a) Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained on SPSS exam in April (b) Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained in Computer Studies Step 1. Click on Graphs and then Scatter.

Step 2. Then click on Simple.

Step 3. Transfer Percentage Marks obtained on SPSS exam in January in X-axis Step 4. Transfer Percentage Marks obtained on SPSS exam in April in Y-axis Step 5. Transfer University to Set Markers by as we would like to define UOM and UTM by different colours or symbols on the same Scatter plot.

Page 44: Lectures on Spss 2010

44

Step 6. Click on Titles and Type

Step 7. Click on Continue and then on OK.

Page 45: Lectures on Spss 2010

45

How to report?

Scatter Plot of January and April SPSS Exam marks

Percentage Marks obtained on SPSS exam in January

100806040200

Perc

enta

ge M

ark

s obta

ined o

n S

PS

S e

xam

in A

pril

100

80

60

40

20

0

University: UOM or U

University of Techno

logy

University of Maurit

ius

The Scatter diagram shows that three is a positive linear relationship between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained on SPSS exam in April. YOU should repeat the procedure for Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained in Computer Studies to obtain the following scatter diagram.

Page 46: Lectures on Spss 2010

46

Scatter Plot of January and April SPSS Exam marks

Percentage Marks obtained on SPSS exam in January

100806040200

Pe

rce

nta

ge

Ma

rks

ob

tain

ed

in C

om

pu

ter

Stu

die

s

80

70

60

50

40

30

20

University: UOM or U

University of Techno

logy

University of Maurit

ius

The Scatter diagram shows that three is no linear relationship between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained in Computer Studies.

Correlational Analyses Suppose we would like to investigate the correlation between (a) between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained on SPSS exam in April; and (b) between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained in Computer Studies. Step 1. Analyze – Correlate – Bivariate

Step 2. Transfer Percentage Marks obtained on SPSS exam in January, Percentage Marks obtained on SPSS exam in April and Marks obtained in Computer Studies in the list of Variables.

Page 47: Lectures on Spss 2010

47

Step 3. Click on Options and select Means and Std deviations only if you are using Pearson’s correlation.

Missing Values. You can choose one of the following:

Exclude cases pairwise. Cases with missing values for one or both of a pair of variables for a correlation coefficient are excluded from the analysis. Since each coefficient is based on all cases that have valid codes on that particular pair of variables, the maximum information available is used in every calculation. This can result in a set of coefficients based on a varying number of cases.

Exclude cases listwise. Cases with missing values for any variable are excluded from all correlations..

Step 4. Select Exclude cases pairwise. Step 5. Select Continue and OK.

How to report? SPSS presents the correlation in a symmetric matrix form as shown in Table 16.1. The element along the diagonal is always 1.00 as there is a perfect correlation between a variable and itself. There are three values in each cell. The first value at the top is the correlation coefficient, the second value is the significance level, and the last value is the sample size. The correlation between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained on SPSS exam in April is +0.955 which indicates a significant (This indicated by **. Correlation is significant at the 0.01 level (2-tailed); significant means significantly different from zero!) strong positive correlation between the two variables. This indicates that those who obtained high scores in January also obtained high scores in April. The correlation between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained on SPSS exam in April is +0.955 which indicates a

Page 48: Lectures on Spss 2010

48

significant strong positive correlation between the two variables. This indicates that those who obtained high scores in January also obtained high scores in April. The correlation between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained in Computer Studies is +0.064. However, this correlation is not significantly different from Zero. This indicates that there no relation between Percentage Marks obtained on SPSS exam in January and Percentage Marks obtained in Computer Studies. That is those scoring high marks on SPSS exam in January did not always score high marks on Computer Studies exam. Similarly there the correlation between Percentage Marks obtained on SPSS exam in April and Percentage Marks obtained in Computer Studies is not significantly different from zero. Table 16.1 Correlations

Percentage Marks

obtained on SPSS exam in January

Percentage Marks

obtained on SPSS exam

in April

Percentage Marks

obtained in Computer

Studies Percentage Marks obtained on SPSS exam in January

Pearson Correlation

1 .955(**) .064

Sig. (2-tailed) . .000 .526N 100 100 100

Percentage Marks obtained on SPSS exam in April

Pearson Correlation

.955(**) 1 .035

Sig. (2-tailed) .000 . .730

N 100 100 100

Percentage Marks obtained in Computer Studies

Pearson Correlation

.064 .035 1

Sig. (2-tailed) .526 .730 .N

100 100 100

** Correlation is significant at the 0.01 level (2-tailed).

Note: Flag significant correlations. Correlation coefficients significant at the 0.05 level are identified with a single asterisk, and those significant at the 0.01 level are identified with two asterisks.

Page 49: Lectures on Spss 2010

49

17. Practice session 9 – Correlation Pearson’s Correlation Coefficient and Spearman’s rho

17.1 Using Pearson’s Correlation coefficient, investigate the relationship between each pair of variable: Advertising Budget (thousands of rupees) for the CD No of CDs Sold (thousands) No. of times Songs are played on Radio 1 during the week prior to its release Draw Scatter diagram for each pair of variable. 17.2 Using Spearman’s rho, investigate the relationship between Numeracy level on a scale of 1 to 20 and Overall Grade in previous semester.

Numeracy level on a

scale of 1 to 20

Overall Grade in previous semester

Spearman's rho Numeracy level on a scale of 1 to 20

Correlation Coefficient

1.000 .324(**)

Sig. (2-tailed) . .001

N 98 98

Overall Grade in previous semester

Correlation Coefficient

.324(**) 1.000

Sig. (2-tailed) .001 .

N 98 100

** Correlation is significant at the 0.01 level (2-tailed).

Page 50: Lectures on Spss 2010

50

18. Practice session 10 – Multiple Linear Regression

In this chapter, we shall deal with two sets of data where interest lies in either examining low one variable relates to a number of others or in predicting one variable from others. The first data set is shown in Table 4.1 and includes four variables, sex, age, extroversion, and car, the latter being the average number of minutes per week a person spends looking after his or her car. According to a particular theory, people who score higher on a measure of extroversion are expected to spend more time looking after their cars since a person may project their self-image through themselves or through objects of their own. At the same time, car-cleaning behavior might be related to demographic variables such as age and sex. Therefore, one question here is how the variables sex, age, and extroversion affect the time that a person spends cleaning his or her car.

Page 51: Lectures on Spss 2010

51

Multiple Linear Regression Multiple linear regression is a method of analysis for assessing the strength of the relationship between each of a set of explanatory variables (sometimes known as independent variables, although this is not recommended since the variables are often correlated), and a single response (or dependent) variable. When only a single explanatory variable is involved, we have what is generally referred to as simple linear regression. Applying multiple regression analysis to a set of data results in what are known as regression coefficients, one for each explanatory variable. These coefficients give the estimated change in the response variable associated with a unit change in the corresponding explanatory variable, conditional on the other explanatory variables remaining constant. The fit of a multiple regression model can be judged in various ways, for example, calculation of the multiple correlation coefficient or by the examination of residuals, each of which will be illustrated later. (Further details of multiple regression are given below)

Page 52: Lectures on Spss 2010

52

In the car cleaning data set in Table 4.1, each of the variables — extroversion (extrover in the Data View spreadsheet), sex (sex), and age (age) — might be correlated with the response variable, amount of time spent car cleaning (car). In addition, the explanatory variables might be correlated among themselves. All these correlations can be found from the correlation matrix of the variables, obtained by using the commands

Analyze – Correlate – Bivariate… and including all four variables under the Variables list in the resulting dialogue box . This generates the output shown in Display 4.1. The output table provides Pearson correlations between each pair of variables and associated significance tests. We find that car cleaning is positively correlated with extroversion (r = 0.67, p < 0.001) and being male (r = 0.661, p < 0.001). The positive relationship with age (r = 0.234) does not reach statistical significance ( p = 0.15). The correlations between the

Page 53: Lectures on Spss 2010

53

explanatory variables imply that both older people and men are more extroverted (r = 0.397, r = 0.403).

Since all the variables are correlated to some extent, it is difficult to give a clear answer to whether, for example, extroversion is really related to car cleaning time, or whether the observed correlation between the two variables arises from the relationship of extroversion to both age and sex, combined with the relationships of each of the latter two variables to car cleaning time. (A technical term for such an effect would be confounding.) Similarly the observed relationship between car cleaning time and gender could be partly attributable to extroversion. In trying to disentangle the relationships involved in a set of variables, it is often helpful to calculate partial correlation coefficients. Such coefficients measure the strength of the linear relationship between two continuous variables that cannot be attributed to one or more confounding variables (for more details, see Rawlings, Pantula, and Dickey, 1998). For example, the partial correlation between car cleaning time and extroversion rating “partialling out” or “controlling for” the effects of age and gender measures the strength of relationship between car cleaning times and extroversion that cannot be attributed to relationships with the other explanatory variables. We can generate this correlation coefficient in SPSS by choosing

Analyze – Correlate – Partial… from the menu and filling in the resulting dialogue box as shown in Display 4.2. The resulting output shows the partial correlation coefficient together with a significance test (Display 4.3). The estimated partial correlation between car cleaning and extroversion, 0.51, is smaller than the previous unadjusted correlation coefficient, 0.67, due to part of the relationship being attributed to gender and/or age. We leave it as an exercise to the reader to

Page 54: Lectures on Spss 2010

54

generate the reduced partial correlation, 0.584, between car cleaning time and gender after controlling for extroversion and age.

Thus far, we have quantified the strength of relationships between our response variable, car cleaning time, and each explanatory variable after adjusting for the effects of the other

Page 55: Lectures on Spss 2010

55

explanatory variables. We now proceed to use the multiple linear regression approach with dependent variable, car, and explanatory variables, extrover, sex, and age, to quantify the nature of relationships between the response and explanatory variables after adjusting for the effects of other variables. (This is a convenient point to note that categorical explanatory variables, such as gender, can be used in multiple linear regression modeling as long they are represented by dummy variables. To “dummy-code” a categorical variable with k categories, k-1 binary dummy variables are created. Each of the dummy variables relates to a single category of the original variable and takes the value “1” when the subject falls into the category and “0” otherwise. The category that is ignored in the dummy-coding represents the reference category. Here sex is the dummy variable for category “male,” hence category “female” represents the reference category.) A multiple regression model can be set up in SPSS by using the commands

Analyze – Regression – Linear…

This results in the Linear Regression dialogue box shown in Display 4.4:

Page 56: Lectures on Spss 2010

56

We specify the dependent variable and the set of explanatory variables under the headings Dependent and Independent(s), respectively.

The regression output is controlled via the Statistics… button. By default, SPSS only prints estimates of regression coefficients and some model fit tables. Here we also ask for confidence intervals to be included in the output (Display 4.4).

The resulting SPSS output tables are shown in Display 4.5 and Display 4.6. The model fit output consists of a “Model Summary” table and an “ANOVA” table (Display 4.5). The former includes the multiple correlation coefficient, R, its square, R2, and an adjusted version of this coefficient as summary measures of model fit (see Box 4.1). The multiple correlation coefficient R = 0.799 indicates that there is a strong correlation between the observed car cleaning times and those predicted by the regression model. In terms of variability in observed car cleaning times accounted for by our fitted model, this amounts to a proportion of R2 = 0.634, or 63.4%. Since by definition R2 will increase when further terms are added to the model even if these do not explain variability in the population, the adjusted R2 is an attempt at improved estimation of R2 in the population. The index is adjusted down to compensate for chance increases in R2, with bigger adjustments for larger sets of explanatory variables (see Der and Everitt, 2001). Use of this adjusted measure leads to a revised estimate that 60.8% of the variability in car cleaning times in the population can be explained by the three explanatory variables.

Page 57: Lectures on Spss 2010

57

The error terms in multiple regression measure the difference between an individual’s car cleaning time and the mean car cleaning time of subjects of the same age, sex, and extroversion rating in the underlying population. According to the regression model, the mean deviation is zero (positive and negative deviations cancel each other out). But the more variable the error, the larger the absolute differences between observed cleaning times and those expected. The “Model Summary” table provides an estimate of the standard deviation of the error term (under “Std. Error of the Estimate”). Here we estimate the mean absolute deviation as 13.02 min, which is small considering that the observed car cleaning times range from 7 to 97 min per week. Time Spent (in minutes per week) = 11.306 + .464 (extroversion) + .156 (age) + 20.071(gender) Finally, the “ANOVA” table provides an F-test for the null hypothesis that none of the explanatory variables are related to car cleaning time, or in other words, that R2 is zero (see Box 4.1). Here we can clearly reject this null hypothesis (F (3,36) = 21.1, p < 0.001), and so conclude that at least one of age, sex, and extroversion is related to car cleaning time. The output shown in Display 4.6 provides estimates of the regression coefficients, standard errors of the estimates, t-tests that a coefficient takes the value zero, and confidence intervals (see Box 4.1). The estimated regression coefficients are given under the heading “Unstandardized Coefficients B”; these give, for each of the explanatory variables, the predicted change in the dependent variable when the explanatory variable is increased by one unit conditional on all the other variables in the model remaining constant. For example, here we estimate that the weekly car cleaning time is increased by 0.464 min for every additional score on the extroversion scale (or by 4.64 min per week for an increase of 10 units on the extroversion scale) provided that the individuals are of the same age and sex. Similarly, the estimated effect for a ten-year increase in age is 1.56 min per week. The interpretation of regression coefficients associated with dummy variables is also straightforward; they give the predicted difference in the dependent variable between the category that has been dummy-coded and the reference category. For example, here we estimate that males spend 20.07 min more per week car washing than females after adjusting

Page 58: Lectures on Spss 2010

58

for age and extroversion rating. The regression coefficient estimate of extroversion has a standard error (heading “Std. Error”) of 0.13 min per week and a 95% confidence interval for the coefficient is given by [0.200, 0.728], or in other words, the increase in car cleaning time per increase of ten in extroversion rating is estimated to be in the range 2.00 to 7.28 min per week. (Those interested in p-values can use the associated t-test to test the null hypothesis that extroversion has no effect on car cleaning times.) Finally, the Coefficients table provides standardized regression coefficients under the heading “Standardized Coefficients Beta”. These coefficients are standardized so that they measure the change in the dependent variable in units of its standard deviation when the explanatory variable increases by one standard deviation. The standardization enables the comparison of effects across explanatory variables (more details can be found in Everitt, 2001b). For example, here increasing extroversion by one standard deviation (SD = 19.7) is estimated to increase car cleaning time by 0.439 standard deviations (SD = 20.8 min per week). The set of beta-coefficients suggests that, after adjusting for the effects of other explanatory variables, gender has the strongest effect on car cleaning behavior. (Note that checking Descriptives and Part and partial correlations in the Statistics sub-dialogue box in Display 4.4 provides summary statistics of the variables involved in the multiple regression model, including the Pearson correlation and partial correlation coefficients shown in Displays 4.1 and 4.3.) For the car cleaning data, where there are only three explanatory variables, using the ratio of an estimated regression coefficient to its standard error in order to identify those variables that are predictive of the response and those that are not, is a reasonable approach to developing a possible simpler model for the data (that is, a model that contains fewer explanatory variables). But, in general, where a larger number of explanatory variables are involved, this approach will not be satisfactory. The reason is that the regression coefficients and their associated standard errors are estimated conditional on the other explanatory variables in the current model. Consequently, if a variable is removed from the model, the regression coefficients of the remaining variables (and their standard errors) will change when estimated from the data excluding this variable. As a result of this complication, other procedures have been developed for selecting a subset of explanatory variables, most associated with the response. The most commonly used of these methods are: Forward selection. This method starts with a model containing none of the explanatory variables. In the first step, the procedure considers variables one by one for inclusion and selects the variable that results in the largest increase in R2. In the second step, the procedures consider variables for inclusion in a model that only contains the variable selected in the first step. In each step, the variable with the largest increase in R2 is selected until, according to an F-test, further additions are judged to not improve the model. Backward selection. This method starts with a model containing all the variables and eliminates variables one by one, at each step choosing the variable for exclusion as that leading to the smallest decrease in R2. Again, the procedure is repeated until, according to an F-test, further exclusions would represent a deterioration of the model. Stepwise selection. This method is, essentially, a combination of the previous two approaches. Starting with no variables in the model, variables are added as with the forward

Page 59: Lectures on Spss 2010

59

selection method. In addition, after each inclusion step, a backward elimination process is carried out to remove variables that are no longer judged to improve the model. Automatic variable selection procedures are exploratory tools and the results from a multiple regression model selected by a stepwise procedure should be interpreted with caution. Different automatic variable selection procedures can lead to different variable subsets since the importance of variables is evaluated relative to the variables included in the model in the previous step of the procedure. A further criticism relates to the fact that a number of tests are employed during the course of the automatic procedure, increasing the chance of false positive findings in the final model. Certainly none of the automatic procedures for selecting subsets of variables are foolproof; they must be used with care and warnings such as the following given in Agresti (1996) should be noted:

Computerized variable selection procedures should be used with caution. When one considers a large number of terms for potential inclusion in a model, one or two of them that are not really important may look impressive simply due to chance. For instance, when all the true effects are weak, the largest sample effect may substantially overestimate its true effect. In addition, it often makes sense to include certain variables of special interest in a model and report their estimated effects even if they are not statistically significant at some level.

In addition, the comments given in McKay and Campbell (1982a, b) concerning the validity of the F-tests used to judge whether variables should be included in or eliminated from a model need to be considered. Here, primarily for illustrative purposes, we carry out an automatic forward selection procedure to identify the most important predictors of car cleaning times out of age, sex, and extroversion, although previous results give, in this case, a very good idea of what we will find. An automatic forward variable selection procedure is requested from SPSS by setting the Method option in the Linear Regression dialogue box to Forward (see Display 4.4). When evaluating consecutive models, it is helpful to measure the change in R2 and to consider collinearity diagnostics (see later), both of which can be requested in the Statistics sub-dialogue box (see Display 4.4). The Options sub-dialogue box defines the criterion used for variable inclusion and exclusion. The default settings are shown in Display 4.7. The inclusion (and exclusion) criteria can either be specified in terms of the significance level of an F-test (check Use probability of F) or in terms of a threshold value of the F-statistic (check Use F-value). By default, SPSS chooses a less stringent criteria for removal than for entry; although here for the automatic forward selection only, the entry criterion (significant increase in R2 according to an F-test at the 5% test level) is relevant. Display 4.8 shows the results from the automatic forward variable selection. SPSS repeats the entry criterion used and lists the variables selected in each step. With three potential predictor variables, the procedure iterated through three steps. In the first step, the variable extroversion was included. In the second step, gender was added to the model. No variable was added in the third step since the remaining potential predictor variable, age, did not improve the model according to our chosen inclusion criterion. The F-tests employed in each step are shown in the “Model Summary” table. Here the model selected after the first step (extroversion only) explained 44.9% of the variance in car cleaning times and the test for the single regression coefficient

Page 60: Lectures on Spss 2010

60

is highly significant (F (1,38) = 31, p < 0.001). Adding gender to the model increases the percentage variance explained by 18.3% (F (1,37) = 18.4, p < 0.001). YOU have to check that backward and stepwise variable selection leads to the same subset of variables for this data example. But remember, this may not always be the case.) For stepwise procedures, the “Coefficients” table shows the regression coefficients estimated for the model at each step. Here we note that the unadjusted effect of xtroversion on car cleaning time was estimated to be an increase in car cleaning time of 7.08 min per week per 10 point increase on the extroversion scale. When adjusting for gender (model 2), this effect reduces to 5.09 min per week per 10 points (95% CI from 2.76 to 7.43 min per week per 10 points). SPSS also provides information about the variables not included in the regression model at each step. The “Excluded Variables” table provides standardized regression coefficients (under “Beta in”) and t-tests for significance. For example, under Model 1, we see that gender, which had not been included in the model at this stage, might be an important variable since its standardized effect after adjusting for extroversion is of moderate size (0.467), there also remains moderate size partial correlation between gender and car cleaning after controlling for extroversion (0.576). Multicollinearity Approximate linear relationships between the explanatory variables, called multicollinearity, can cause a number of problems in multiple regression, including:

It severely limits the size of the multiple correlation coefficient because the explanatory variables are primarily attempting to explain much of the same variability in the response variable (see Dizney and Gromen, 1967, for an example).

It makes determining the importance of a given explanatory variable difficult because the effects of explanatory variables are confounded due to their intercorrelations.

It increases the variances of the regression coefficients, making use of the estimated model for prediction less stable. The parameter estimates become unreliable (for more details, see Belsley, Kuh, and Welsh, 1980).

Spotting multicollinearity among a set of explanatory variables might not be easy. The obvious course of action is to simply examine the correlations between these variables, but while this is a good initial step that is often helpful, more subtle forms of multicollinearity involving more than two variables might exist. A useful approach is the examination of the variance inflation factors (VIFs) or the tolerances of the explanatory variables. The tolerance of an explanatory variable is defined as the proportion of variance of the variable in question not explained by a regression on the remaining explanatory variables with smaller values indicating stronger relationships. The VIF of an explanatory variable measures the inflation of the variance of the variable’s regression coefficient relative to a regression where all the explanatory variables are independent. The VIFs are inversely related to the tolerances with larger values indicating involvement in more severe relationships (according to a rule of thumb, VIFs above 10 or tolerances below 0.1 are seen as a cause of concern). Since we asked for Collinearity diagnostics in the Statistics sub-dialogue box, the Coefficients” table and the “Excluded Variables” table in Display 4.8 include columns labeled “Collinearity Statistics.” In the “Coefficients” table, the multicollinearities involving

Page 61: Lectures on Spss 2010

61

the explanatory variables of the respective model are assessed. For example, the model selected in the second step of the procedure included extroversion and gender as explanatory variables. So a multicollinearity involving these two variables (or more simply, their correlation) has been assessed. In the “Excluded Variables” table, multicollinearities involving the excluded variable and those included in the model are assessed. For example, under “Model 2,” multicollinearities involving age (which was excluded) and extroversion and gender (which were included) are measured. Here none of the VIFs give reason for concern. (SPSS provides several other collinearity diagnostics, but we shall not discuss these because they are less useful in practice than the VIFs.) We usually calculate the average VIF and this should not be substantially greater than one. An average VIF greater than 10 is definitely a cause of concern. It might be helpful to visualize our regression of car cleaning times on gender and extroversion rating by constructing a suitable graphical display of the fitted model. Here, with only one continuous explanatory variable and one categorical explanatory variable, this is relatively simple since a scatterplot of the predicted values against extroversion rating can be used. First, the predicted (or fitted) values for the subjects in our sample need to be saved via the Save… button on the Linear Regression dialogue box (see Display 4.4). This opens the Save sub-dialogue box shown in Display 4.9 where Unstandardized Predicted Values can be requested. Executing the command includes a new variable pre_1 on the right-hand side of the Data View spreadsheet. This variable can then be plotted against the extroversion variable using the following instructions: _ The predicted value variable, pre_1, is declared as the Y Axis and the extroversion variable, extrover as the X Axis in the Simple Scatterplot dialogue box. _ The gender variable, sex, is included under the Set Markers by list to enable later identification of gender groups. _ The resulting graph is then opened in the Chart Editor and the commands Format – Interpolation… – Straight used to connect the points.

Page 62: Lectures on Spss 2010

62

Page 63: Lectures on Spss 2010

63

Page 64: Lectures on Spss 2010

64

The final graph shown in Display 4.10 immediately conveys that the amount of time spent car cleaning is predicted to increase with extroversion rating, with the strength of the effect determined by the slope of two parallel lines (5.1 min per week per 10 points on the extroversion scale). It also shows that males are estimated to spend more time cleaning their cars with the increase in time given by the vertical distance between the two parallel lines (19.18 min per week). Autocorrelation Autocorrelation exist if adjacent residuals are correlated, i.e. residuals are not independent. Autocorrelation affects the model. Autocorrelation is measured using Durbin Watson Statistics. Values between 1 and 3 indicated that the autocorrelation does not affect the model. The closer the value is to 2, the better it is. Homoscedasticity At each level of predictor variables, the variance of the residual terms should be constant. This just means that the residual at each level of predictor(s) should have the same variance (homoscedasticity); when the variances are very unequal there is said to be heteroscedasticity. In order to verify this assumption, we plot *ZRESID against *ZPRED. If the assumption of homoscedasticity is met then this plot should be a random array of dots evenly dispersed around zero. If this graph funnels out, then the chances are that there is heteroscedasticity in data.

Page 65: Lectures on Spss 2010

65

How to report? (a) Write the regression equation

Interpretation of standardized coefficients: (b) R2 = . Interpretation of R2: (c) Anova: F = p-value = ; Interpretation (recall “ANOVA” table

provides an F-test for the null hypothesis that none of the explanatory variables are related to car cleaning time, or in other words, that R2 is zero) Average VIF = ; Comment on multicollinearity:

(d) Durbin-Watson Statistic = ; Comment on Autocorrelation. (e) Plot *ZRESID against *ZPRED in order to verify the assumption of

homoscedasticity is met.

19. Practice session 11 – Multiple Linear Regression

For the purpose of this session use the SPSS file called: File to be used for SPSS Lectures.sav A survey was conducted among 100 singers who have sold CDs during the year 2007. Data was collected for the following 4 variables:

1. Advertising Budget (thousands of rupees) for the CD 2. No of CDs Sold (thousands) 3. No. of times Songs are played on Radio 1 during the week prior to its release 4. Attractiveness of the Singer on a scale 1 to 10

Use Multiple linear regression to come up with a model No of CDs Sold (thousands) = Bo + B1(Advertising Budget (thousands of rupees)

for the CD) + B2(No. of times Songs are played on Radio 1 during the week prior to its release) + B3(Attractiveness of the Singer on a scale 1 to 10)

Page 66: Lectures on Spss 2010

66

20. Practice session 12 – Cross Tab and Chi-Square test for Independence

Very often, we are not interested in test scores, or continuous measures, but in categorical variables. Categorical variables are also called grouping variables. Categorical variable lead to categorical data which are Generally non-numerical data. Data placed in exclusive categories. Cases counted rather than measured. E.g. People can be classified in categories according to their occupation E.g. Cars can be classified in categories according to their make or colour We will use the datafile on rats (practice session 1) Let us categorise the lifespan of rats. Let us say that we classify all those who have a

Lifespan of up to 500 into a category that we will call short (we will give this

category a code 0) Lifespan of 501 to 999 into a category that we will call medium (we will give this

category a code 1) Lifespan of 1000 and above into a category that we will call long (we will give this

category a code 2)

This will be carried out by SPSS as follows: Step1.: Transform - Recode – Into different Variables…..

Step2.: Transfer lifespan Type in the name lifgroup Type in the label Categories of lifespan

Page 67: Lectures on Spss 2010

67

Step 3.: Click on Old and New Values

Step 4.: Click on Range and type in 500 as shown and 0 in Value

Step 5.: Click on Add.

Step 6.: Click on Range and type in 501 through 999 as shown and 1 in Value

Page 68: Lectures on Spss 2010

68

Step 7.: Click on Add Step 8.: Click on Range and type in 1000 through highest as shown and 2 in Value

Step 9.: Click on Add

Step 10.: Click on Continue

Step 11.: Click on Change

Step 12: Click on OK

Page 69: Lectures on Spss 2010

69

Step 13.: Go to Data View to note the column lifgroup that has been added.

Step 14.: Go to Variable View to label the Values.

Step 15: Click on OK.

The lifespan categories have now been created. This procedure can be used to create categories for variables like age, salary, …..

Page 70: Lectures on Spss 2010

70

Contingency tables and chi-squared test of independence Contingency tables are one of the most common ways to summarize observations on two categorical variables. For all such tables, interest generally lies in assessing whether or not there is any relationship or association between the row variable and the column variable that make up the table. Most commonly, a chi-squared test of independence is used to answer this question, although alternatives such as Fisher’s exact test or McNemar’s test may be needed when the sample size is small (Fisher’s test) or the data consist of matched samples (McNemar’s test). In addition, in 2 X 2 tables, it may be required to calculate a confidence interval for the ratio of population proportions. For a series of 2 X 2 tables, the Mantel-Haenszel test may be appropriate (see later). (Brief accounts of each of the tests mentioned are given in below).

Page 71: Lectures on Spss 2010

71

Page 72: Lectures on Spss 2010

72

Page 73: Lectures on Spss 2010

73

Note. For Chi-Square test to be meaningful

1. The categories must be mutually exclusive so that each case or person contributes to one cell/category only.

2. The expected frequencies (or expected count) should be greater than 5. It is acceptable that in larger contingency tables to have up to 20% of expected frequencies below 5. Even in larger contingency tables, expected frequencies should not be less than 1. If this condition is not satisfied then we usually merge adjacent cells. This of course decreases the number of categories.

Generating a Contingency table or Crosstab

Lifespan has been classified into three categories: short, medium, and long. Diet has been classified into two categories: restricted and ad libitum.

Let us generate a 2 x 3 (2 Rows and 3 columns) contingency table. We will put the diet into rows and lifespan into columns. Step 1.: Analyze – Descriptive Statistics - Crosstabs

Step 2.: Transfer diet into rows and lifgroup into columns.

Page 74: Lectures on Spss 2010

74

Step 3.: Click Statistics. Select Chi-square. Select Phi and Cramer’s V. Step 4.: Click Continue. Step 5.: Click Cells. Select Observed and Expected. Select Row, Column and Total. Step 6.: Click Continue. Step 7.: Click OK. The output generated includes:

Page 75: Lectures on Spss 2010

75

Table 20.1 Diet * Categories of lifespan Crosstabulation

Categories of lifespan

Total Short Medium Long Diet Restricted

diet Count 11 34 61 106Expected Count

9.8 63.1 33.2 106.0

% within Diet

10.4% 32.1% 57.5% 100.0%

% within Categories of lifespan

61.1% 29.3% 100.0% 54.4%

% of Total 5.6% 17.4% 31.3% 54.4%

Ad libitum diet

Count 7 82 0 89Expected Count

8.2 52.9 27.8 89.0

% within Diet

7.9% 92.1% .0% 100.0%

% within Categories of lifespan

38.9% 70.7% .0% 45.6%

% of Total 3.6% 42.1% .0% 45.6%

Total Count 18 116 61 195Expected Count

18.0 116.0 61.0 195.0

% within Diet

9.2% 59.5% 31.3% 100.0%

% within Categories of lifespan

100.0% 100.0% 100.0% 100.0%

% of Total 9.2% 59.5% 31.3% 100.0% Table 20.2 Chi-Square Tests

Value df Asymp. Sig.

(2-sided) Pearson Chi-Square 80.884(a) 2 .000

Likelihood Ratio 104.448 2 .000Linear-by-Linear Association

40.893 1 .000

N of Valid Cases 195

a 0 cells (.0%) have expected count less than 5. The minimum expected count is 8.22.

Page 76: Lectures on Spss 2010

76

Table 20.3 Symmetric Measures

Value Approx. Sig. Nominal by Nominal

Phi .644 .000

Cramer's V .644 .000

N of Valid Cases 195

a Not assuming the null hypothesis. b Using the asymptotic standard error assuming the null hypothesis.

How to report? (a) In Table 20.1 ensure that it satisfies the criteria that to have up to 20% of expected

frequencies below 5. Here all the cells have expected frequency (Expected Count) of greater than 5.

(b) Write one statement for each cell in Table 20.1. Note we have 3 X 2= 6 cells in all. There are four possible statements that we can formulate for one cell. Although all the statements are mathematically correct, not all of them are logically correct. So you are advised to choose that statement that best describes the cell. We will consider the cell Restricted diet – Short:

Categories of lifespan Total Short

Diet Restricted diet

Count 11 106Expected Count

9.8

% within Diet

10.4%

% within Categories of lifespan

61.1%

% of Total 5.6%

Total Count 18 195

(i) Count: 11 rats on restricted diet had a short life. (ii) % within Diet: (This is also called row %). It is calculated by dividing Count

by row total and then multiplying by 100: (11÷106) x 100 = 10.4%. This is interpreted as 10.4% of those on restricted diet (as this row is the row of restricted diet) have short life.

(iii) % within Categories of lifespan: (This is also called column %). It is

calculated by dividing Count by column total and then multiplying by 100: (11÷ 18) x 100 = 61.1%. This is interpreted as 61.1% of those having short life (as this column is the column of short life) were on restricted diet.

Page 77: Lectures on Spss 2010

77

(iv) % of Total: (This is also called Total %). It is calculated by dividing Count by

total and then multiplying by 100: (11÷195) x 100 = 5.6%. This is interpreted as 5.6% of the total number of rats included in the study were on restricted and had a short life.

(c) Write the hypothesis:

H0: There is no association between lifespan of rats and their diet. H1: There is an association between lifespan of rats and their diet. (or we could have written it as H0: Lifespan of rats and their diet are independent. H1: Lifespan of rats depend on their diet.) Test: Chi-Square. (from Table 20.2) Statistics: Chi-Square = 80.884 p = 0.000. Conclusion: Since p = 0.000 < 5%, we reject H0. We conclude that at 5% level of significance, there is an association between lifespan of rats and their diet. (d) Strength of association. After rejecting H0, we accept H1 and therefore conclude that there is an association. The question that follows naturally is how strong is the association? The measures of the strength of association are shown in Table 20.3. These measures are restricted to the range 0 to 1. The closer the value is to 1, the stronger the association. Phi.: This statistic is accurate for a 2X2 contingency table. Cramer’s V: When variables in the cross-tab have only two categories then phi and Cramer’s V have exactly the same value. However, when categories have more than two categories then Cramer’s V is more useful. In our case (from Table 20.3) Cramer’s V = 0.644 out of a possible maximum value of 1. This represents quite a strong association between diet and lifespan.

21. Practice session 13 – Cross Tab and Chi-Square test for Independence

Use the file File to be used for SPSS Lectures.sav

(a) Generate a crosstab for Socio Economic Status of Student and Overall Grade in previous semester.

(b) Test the Hypothesis

H0: There is no association between Socio Economic Status of Student and Overall Grade in previous semester.

H1: There is an association between Socio Economic Status of Student and Overall Grade in previous semester.

Page 78: Lectures on Spss 2010

78

22. Practice session 14 – One-Way ANOVA and Kruskal Wallis test

If two groups of participants perform a task under different conditions, an independent samples t test can be used to test the null hypothesis (H0) of equality of the two population means:

H0:μ1 = μ2 If the test shows significance, we can reject H0 and conclude that there is a difference between the two population means. t-test is applied when we have two means (only) to compare. The same null hypothesis, however, can also be tested by using the analysis of variance (ANOVA for short). Like the t test, the ANOVA was designed for the purpose of comparing means but is more versatile than the t test. Suppose that in an investigation of the effects of four supposedly performance-enhancing drugs (Drug A, B, C, and D) upon skilled performance, five groups of subjects/persons are tested: 1. A control group, who have received a Placebo. 2. A group who have received Drug A. 3. A group who have received Drug B. 4. A group who have received Drug C. 5. A group who have received Drug D. Do any of the drugs affect level of performance? Our scientific hypothesis is that at least one of them does. The null hypothesis is the negation of this possibility: H0 states, in effect, that none of them does: in the population, the mean performance score is the same under all five conditions:

H0:μ1 = μ2 = μ3 = μ4 = μ5

H1: H0 is false.

The ANOVA provides a direct test of this hypothesis. The results of the experiment are shown in Table 22.1.

Page 79: Lectures on Spss 2010

79

Table 22.1 Performance score of persons who were given drug A,B,C, and D, and Placebo.

It is apparent from Table 22.1 that there are considerable differences among the five sample. The question is, could the null hypothesis actually be true and the differences we see in the table have come about merely through sampling error? The analysis of variance (ANOVA): some basic terms Factors, levels and measures In ANOVA, a factor is a set of related conditions or categories. The conditions or categories making up a factor are known as its levels, even though, as in the qualitative factors of gender or blood group, there is no sense in which one category can be said to be ‘higher’ or ‘lower’ than another. The terms factor and level are the ANOVA equivalents of independent variable (IV) and value, respectively. In the ANOVA, the dependent variable (DV) is known as a measure. In our current example, the measure (or dependent variable) is Score. Between subjects and within subjects factors We observed that between subjects experiments, in which different groups of participants (subjects) are tested under the different conditions, result in independent samples of scores; whereas within subjects experiments, in which each participant is tested under all conditions, result in related samples of scores. In ANOVA designs, a factor is said to be between subjects if each participant is either tested under only one condition or has been selected from one of a set of mutually exclusive natural categories. In our drugs experiment, Drug Condition (Placebo, Drug A, Drug B, Drug C, Drug D) is a between subjects factor. Between subjects factors must be distinguished

Page 80: Lectures on Spss 2010

80

From within subjects factors, in which the participant is tested at all levels (i.e. under all the conditions making up the factor). In ANOVA designs, an experiment with a within subjects factor is also said to have repeated measures on that factor: the measure or DV is taken at all levels. Our drug experiment is a one-factor between subjects experiment. The one-way ANOVA is applicable here.

Running ANOVA. Step 1: Entering the data As with the independent samples t test, you will need to define two variables:

1. a grouping variable with a simple name such as Group, which identifies the condition under which a score was achieved. (The grouping variable should also be given a more meaningful variable label such as Drug Condition, which will appear in the output.) 2. a variable with a name such as Score, which contains all the scores in the data set.

This is the dependent variable. The grouping variable will consist of five values (one for the placebo condition and one for each of the four drugs). We shall arbitrarily assign numerical values thus: 0 = Placebo; 1 = Drug A; 2 = Drug B; 3 = Drug C; 4 = Drug D.

Page 81: Lectures on Spss 2010

81

Step 2. Verifying Conditions

1. For each subpopulation (for each drug and placebo), the dependent variable must follow a Normal distribution. Carry out a normality test to verify that this condition is satisfied.

Tests of Normality

Drug

Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Score Placebo .142 9 .200(*) .978 9 .951

Drug A .236 9 .159 .932 9 .502 Drug B .220 9 .200(*) .904 9 .277 Drug C .305 9 .015 .852 9 .078 Drug D .185 9 .200(*) .958 9 .782

* This is a lower bound of the true significance. a Lilliefors Significance Correction

Page 82: Lectures on Spss 2010

82

As all the sig.> 0.05, the condition of normality is satisfied.

2. The second condition: The subpopulations must have the same variance. The Levene’s test will be used. This will be obtained together with the ANOVA test.

3. The subpopulations must be independent of each other. Step 3. Running Analyses

Step 4.

Step 5. Click on Contrasts

Page 83: Lectures on Spss 2010

83

Note: Contrasts are used to investigate which mean or set of mean values or linear combination of mean values shows differences with other mean values.

Page 84: Lectures on Spss 2010

84

Page 85: Lectures on Spss 2010

85

Page 86: Lectures on Spss 2010

86

Therefore we conclude that effect of drug A on performance is not significantly different from effect of Placebo. Step 6. Click on Post-hoc Suppose, we reject H0:μ1 = μ2 = μ3 = μ4 = μ5, then we would like to know the pair(s) of mean that led to this rejection.

Page 87: Lectures on Spss 2010

87

Planned and unplanned comparisons Before running a drug experiment in the current example, the experimenter may have some very specific questions in mind. It might be expected, for example (perhaps on theoretical grounds), that the mean score of every group who have ingested one of the drugs will be greater than the mean score of the Placebo group. This expectation would be tested by comparing each drug group with the Placebo group. Perhaps, on the other hand, the experimenter has theoretical reasons to suspect that Drugs A and B should enhance performance, but Drugs C and D should not. That hypothesis would be tested by comparing the Placebo mean with the average score for groups A and B combined and with the average score for groups B and C combined. These are examples of planned comparisons. Often, however, the experimenter, perhaps because the field has been little explored, has only a sketchy idea of how the results will turn out. There may be good reason to expect that some of the drugs will enhance performance; but it may not be possible, a priori, to be more specific. Unplanned, a posteriori or post hoc, comparisons are part of the ‘data-snooping’ that inevitably follows the gathering of a data set.

Page 88: Lectures on Spss 2010

88

Page 89: Lectures on Spss 2010

89

Page 90: Lectures on Spss 2010

90

Step 7. Click on Options

Step 8. Click on Continue Step 9. Click on OK

Page 91: Lectures on Spss 2010

91

Test of Homogeneity of Variances Score

Levene Statistic df1 df2 Sig.

2.464 4 40 .061

F = 2.464 p = 0.061 > 0.05. Accept that variances are Homogeneous. The non-significance of the Levene F Statistic for the test of equality of error variances (homogeneity of variances) indicates that the assumption of homogeneity of variance is tenable; however, considerable differences among the variances are apparent from inspection. To obtain the following Table, don’t choose contrast and run the ANOVA analyses. ANOVA Score

Sum of Squares df Mean Square F Sig.

Between Groups 337.422 4 84.356 7.888 .000 Within Groups 427.778 40 10.694 Total 765.200 44

H0:μ1 = μ2 = μ3 = μ4 = μ5

Test: ANOVA F = 7.888 p= 0.000<0.05, we reject H0.

How to report? (a)

H0:μ1 = μ2 = μ3 = μ4 = μ5

H1: H0 is false. Test: ANOVA F = 7.888 p= 0.000<0.05, we reject H0.

Page 92: Lectures on Spss 2010

92

(b) Tests of Normality

Drug

Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Score Placebo .142 9 .200(*) .978 9 .951

Drug A .236 9 .159 .932 9 .502 Drug B .220 9 .200(*) .904 9 .277 Drug C .305 9 .015 .852 9 .078 Drug D .185 9 .200(*) .958 9 .782

* This is a lower bound of the true significance. a Lilliefors Significance Correction As all the sig.> 0.05, the condition of normality is satisfied. (c) F = 2.464 p = 0.061 > 0.05. Accept that variances are Homogeneous. The non-significance of the Levene F Statistic for the test of equality of error variances (homogeneity of variances) indicates that the assumption of homogeneity of variance is tenable; however, considerable differences among the variances are apparent from inspection. (d) Post-hoc test: Tukey Test The output shows that there are two subgroups of tests. Within each subgroup there are no significant pairwise differences; on the other hand, any member of either subgroup is significantly different from any member of the other subgroup. For example, there are no differences among Drugs B, C and D; but each of those is significantly different from both the Placebo and Drug A. In a word, of the four drugs tested, the only one not to produce an improvement over the Placebo was Drug A.

(e) contrasts

Kruskal Wallis test In case the Normality condition is not satisfied then, we should use the Kruskal Wallis test.

Page 93: Lectures on Spss 2010

93

Page 94: Lectures on Spss 2010

94