Mqc Stat170 Assg3 2011c Soln

1

Macquarie City Campus STAT170 Introductory Statistics

Semester 3, 2011 Assignment 3 SOLUTIONS

Due: Week 12 (in your tutorial class)

This assignment is worth 5% of your final assessment of the unit. Instructions for submission: 1. You can either word-process this assignment, or write neatly by hand. 2. The assignment may be done individually or by a group of TWO students. 3. Each student should attempt ALL questions in the entire assignment independently in the first instance. This should be done during the week after the assignment has been distributed. 4. When all students in the group have attempted all questions, the group should meet to discuss their solutions. The groups should then write up a final version of their solution. 5. Only ONE assignment should be submitted per group. Each student in the group will receive the same mark allocated for that assignment, provided each student contributed equally. Note: As each part of this assignment covers different materials from the unit, it is important that each student attempts all questions. The purpose of group work is to give students an opportunity to work together as a team by discussing their solutions with fellow students. Under NO circumstances should one student in the group attempt one question and another student attempt another. You are reminded that mistakes are also shared among all students in your group. Declaration: All students signing below certify that they have contributed equally to the attached work and take responsibility for the answers to ALL questions. We carried out this assignment without significant assistance from anyone else outside our group apart from general discussion. Student ID Surname Given name(s) Signature

1.

2.

In the case where one member’s contribution was significantly less than the other members’ contributions, this should be drawn to the attention of the lecturer.

2

Introduction A study of students who were enrolled in a second year bioinformatics unit was carried out to investigate students’ performance in their assignments as well as in the final exam. A tutorial class of 26 students was used for the study. During the semester students were asked to complete two assignments. Both assignments were marked out of a total of 30. Marks in the final exam were also recorded for those students. Data from this study include: Sex: 1 =male, 2 = female Ass1: First assignment mark (out of 30) Ass2: Second assignment mark (out of 30) Exam: Final exam mark (out of 100) Attendance: 1 = attended at least 50%, 2 = attended less than 50% of classes The data file is marks.xls,

3

Question 1 Research Question: Is there a change in students’ performance (i.e. change in marks) in Assignment 1 (Ass1) and in Assignment 2 (Ass2) ? Perform an appropriate hypothesis test to answer the above research question. Remember to justify any assumptions. Note: 1. You have to do some work on the Excel file, even though you will do your hypothesis testing by hand manually. 2. Then find from the data file the required statistics in order to perform your calculations. But do your hypothesis testing by hand manually; do NOT use EcStat’s hypothesis testing output.

This is a paired t-test, and we need to form a new variable (column) in

Excel:

diff = Ass2 – Ass1 (or Ass1 – Ass2)

The numerical summary for diff (from EcStat) is:

Numerical Summary: diff

Variable Size Mean StDev

diff 26 1.7500 2.9572

H: H0: µd = 0

A: The histogram of diff suggests that the difference could come from a

normal population. (Alternatively, n=26 ≥ 25, and by CLT, dy is

approximately normally distributed.)

T: 01727.326/9572.2

075.10 =−=−=ns

yt

d

d µ

df = n-1 = 25

P: From t-table with df=25, 0.005 < p-val < 0.01. Hence reject Ho.

C: Evidence shows that students had higher marks in Assignment 2 than

in Assignment 1 on average.

95% C.I. for µd = 26

9572.2060.275.11 ±=± − n

sty d

nd

= (0.555, 2.945)

We are 95% confident that average difference in marks between

Assignment 2 and Assignment 1 lies between 0.555 and 2.945.

(Check that the CI above excludes the null value 0.)

4

Question 2 Research Question: Is there a difference in the marks obtained in Assignment 2 (Ass 2) between students who attended at least 50% of classes and those who did not? Perform, by hand, an appropriate hypothesis test to address the above research question. Use the following information to help you. Do NOT use EcStat to do the hypothesis test.

Attendance Size Mean SE StDevAttend >50% 15 19.467 1.088 4.764Attend <50% 11 15.500 1.271 3.294

H: H0: µ1=µ2

A:

• The 2 histograms indicate that the 2 samples could come from 2

normal populations.

• The two sample standard deviations are close, and so are the 2

corresponds IQRs (boxes). Thus it is reasonable to assume the 2

population standard deviations are equal, i.e. σ1 = σ2.

T: 1014

294.310764.4142

)1()( 22

21

222

211

+×+×=

−+−+−=

nn

snsns p

= 4.2143

111

151

2143.4

5.15467.1911

21

21

+

−=+

−=

nns

yyt

p

= 2.3713

df = 15+11-2=24

P: From t-table with df=24, 0.02 < p-val < 0.05 Hence reject Ho.

C: The average Assignment 2 marks is higher for those students who had

50% or more attendance than those who had less than 50% attendance.

95% CI for µ1-µ2 =

21242121

11)(

nnstyySEtyy p +×±−=×±− υ

Attend <50%

Attend >50%

5 10 15 20 25 30Ass2

Attendance

5

11

1

15

12143.4064.2)5.15467.19( +×±−=

= (0.514, 7.420)

We are 95% confident that the average Assignment 2 marks is 0.514 up

to 7.420 higher for those who had 50% or more attendance than those

who had less than 50% attendance.

(Check that the CI excludes the null value 0.)

6

Question 3 Research Question: Is the mark obtained by a student in Assignment 1 (Ass1) a useful predictor for his or her mark in the final exam (Exam)?

(a) Perform, by hand, an appropriate hypothesis test to address the above research question. Use the above information to help you. Do NOT use EcStat to do the hypothesis test. H: Ho: β = 0

A: From the scatter plot, the relation looks linear. The residuals seem to

have normal distribution and constant spread.

T: 106.7461.0276.3

)(===

bSE

bt

df = 26-2 = 24

P: From t-able, using df=24, p-val< 0.0005. Hence reject Ho.

C: There is a positive linear relation between Exam and Assignment 1

marks.

For extra 1 mark increase in Assignment 1, there corresponds an increase

of 3.276 marks in Exam.

95% CI for β = bSEtb ×± 24 = 3.276 ± 2.064*0.461 = (2.325, 4.228)

We are 95% confident that the true increase β in the population lies

between 2.325 and 4.228. (b) Write down the value of the goodness-of-fit statistic. Interpret the meaning of this value. r2 = 0.677

67.7% of the variation in Exam marks can be explained (accounted for) by

the variation in Assignment 1 marks. (c) Calculate the value of the correlation coefficient. Interpret the meaning of this value. r =√0.667 = +0.8167 There is a strong positive linear relationship between Exam marks and

Assignment 1 marks.

30

40

50

60

70

80

90

100

8 13 18 23 28Ass1

Exam

df: 24coeff SE t p-value

13.6238 7.573 1.7990 0.085 -2.006 29.2543.2760 0.461

r-sq: 0.677 Resid SS: 1602.188 s: 8.171

outcome:predictorconstantAss1

Exam95% C.I.

7

Question 4 Research Question: Which of the following 4 variables, Ass1, Ass2, Gender and Attendance, are significant in affecting Exam? (a) Use EcStat to perform analysis on each of the independent variables with Exam. Paste the outputs in the spaces below. Do NOT write anything here.

30

40

50

60

70

80

90

100

8 13 18 23 28Ass1

Examdf: 24coeff SE t p-value

13.6238 7.573 1.7990 0.085 -2.006 29.2543.2760 0.461 7.0989 0.000 2.324 4.229

r-sq: 0.677 Resid SS: 1602.188 s: 8.171

Fitted line: Exam = 13.6238 + 3.276 Ass1


Exam95% C.I.

30

40

50

60

70

80

90

100

5 10 15 20 25 30Ass2

Examdf: 24

coeff SE t p-value17.9088 5.402 3.3149 0.003 6.759 29.0592.7129 0.294 9.2138 0.000 2.105 3.321

r-sq: 0.780 Resid SS: 1094.590 s: 6.753

Fitted line: Exam = 17.9088 + 2.7129 Ass2


Exam95% C.I.

female

male

30 50 70 90Exam

GenderGender Size Mean SE StDev

male 12 63.290 4.075 14.662female 14 68.633 3.773 13.636

Resid SS: 4781.95 r-sq:factor df t p-val s diff

Gender 24 0.962 0.3456 14.116 5.343

ExamTwo-sample t-test:

Attend <50%

Attend >50%

30 50 70 90Exam

AttendanceAttendance Size Mean SE StDev

Attend >50% 15 67.606 3.687 14.830Attend <50% 11 64.204 4.305 13.469

Resid SS: 4892.98 r-sq: 0.01factor df t p-val s diff CI/2

Attendance 24 0.600 0.5540 14.278 3.402 11.698

ExamTwo-sample t-test:

8

(b) Using your EcStat outputs in (a), write a brief statistical report to address the research question. Your report MUST contain the four sections: Introduction, Methods, Results and Conclusion. Some marks will be allocated to the organization of your report. You are advised to word-process the report on A4 paper and limit the length to at most 2 pages. Hints: 1. Although not compulsory, it is advisable to summarize the results into an appropriate table. 2. To cull the “bad” variables and to select the “good” ones, we suggest that you follow these steps: Step 1: Look for any case where assumptions of the relevant tests are violated, and then “disqualify” those variables. Step 2: To select the relevant independent variables affecting Exam, discard those having p-values > 0.05. INTRODUCTION

Researchers are interested to determine which of the 4 independent

variables Ass1, Ass2, Gender and Attendance, are significant in

affecting the dependent variable Exam?

METHODS

The sample consisted of 26 students, assumed randomly selected from all

students enrolled in a second year bioinformatics unit. The target

population is obviously all students enrolled in the bioinformatics unit. In

the 4 independent variables, Ass1, Ass2 are numerical, while Gender and

Attendance are categorical (and binary). The first two require

regressions, while the latter two demand 2-sample t-tests.

RESULTS

We shall look at the two methods separately.

A. Regression

For the 2 regressions involving Ass1 and Ass2 with Exam, the 2 scatter

plots show that the 3 conditions for regression, namely linearity, constant

spread for residuals, and normality of residuals are satisfied. The

results are summarized in the table below.

Independent

variable

Assumptions

satisfied?

p-val Significant

predictor?

(Reject Ho: β=0?)

r2 Result

Ass1 Yes 0.000 Yes 0.677 sig predictor

Ass2 Yes 0.000 Yes 0.780 sig

9

predictor

Both variables Ass1 and Ass2 have p-values > 5%, both are significant

predictors for Exam.

(Note: r2 is actually NOT required here since r2 is only used to select the

best predictor. But the research question does not ask for the BEST

predictor.)

A. 2-sample t-test

The results are summarized in the table below.

Independent

variable

Assumptions

satisfied?

p-val Significant

variable?

(Reject Ho?)

Result

Gender Equal spread - Yes

Normality - ?

0.3456 No (p-val>5%) -----

Attendance Equal spread – Yes

Normality - ?

0.5540 No (p-val>5%) -----

In each of the 2-sample t-tests, the equal spread assumption seems to be

satisfied, according to the box plots and the corresponding sample

standard deviations. For normality condition, it is not directly verifiable

as the sample sizes are small and no histograms or stem-and-leaf plots

are available – unless we draw them ourselves. However, the p-values are

larger than 5% for both cases. Hence both variables Gender and

Attendance are discarded, and whether the normality condition is met or

not thus becomes irrelevant.

CONCLUSION

Of the 4 given independent variables Assignment 1, Assignment 2,

Gender and Attendance, only Assignment 1, Assignment 2 are

significant in affecting the dependent variable Exam.

Documents

Mqc Stat170 Assg3 2011c Soln