Upload
cecilia-rana
View
13
Download
0
Embed Size (px)
Citation preview
1
Macquarie City Campus STAT170 Introductory Statistics
Semester 3, 2011 Assignment 3 SOLUTIONS
Due: Week 12 (in your tutorial class)
This assignment is worth 5% of your final assessment of the unit. Instructions for submission: 1. You can either word-process this assignment, or write neatly by hand. 2. The assignment may be done individually or by a group of TWO students. 3. Each student should attempt ALL questions in the entire assignment independently in the first instance. This should be done during the week after the assignment has been distributed. 4. When all students in the group have attempted all questions, the group should meet to discuss their solutions. The groups should then write up a final version of their solution. 5. Only ONE assignment should be submitted per group. Each student in the group will receive the same mark allocated for that assignment, provided each student contributed equally. Note: As each part of this assignment covers different materials from the unit, it is important that each student attempts all questions. The purpose of group work is to give students an opportunity to work together as a team by discussing their solutions with fellow students. Under NO circumstances should one student in the group attempt one question and another student attempt another. You are reminded that mistakes are also shared among all students in your group. Declaration: All students signing below certify that they have contributed equally to the attached work and take responsibility for the answers to ALL questions. We carried out this assignment without significant assistance from anyone else outside our group apart from general discussion. Student ID Surname Given name(s) Signature
1.
2.
In the case where one member’s contribution was significantly less than the other members’ contributions, this should be drawn to the attention of the lecturer.
2
Introduction A study of students who were enrolled in a second year bioinformatics unit was carried out to investigate students’ performance in their assignments as well as in the final exam. A tutorial class of 26 students was used for the study. During the semester students were asked to complete two assignments. Both assignments were marked out of a total of 30. Marks in the final exam were also recorded for those students. Data from this study include: Sex: 1 =male, 2 = female Ass1: First assignment mark (out of 30) Ass2: Second assignment mark (out of 30) Exam: Final exam mark (out of 100) Attendance: 1 = attended at least 50%, 2 = attended less than 50% of classes The data file is marks.xls,
3
Question 1 Research Question: Is there a change in students’ performance (i.e. change in marks) in Assignment 1 (Ass1) and in Assignment 2 (Ass2) ? Perform an appropriate hypothesis test to answer the above research question. Remember to justify any assumptions. Note: 1. You have to do some work on the Excel file, even though you will do your hypothesis testing by hand manually. 2. Then find from the data file the required statistics in order to perform your calculations. But do your hypothesis testing by hand manually; do NOT use EcStat’s hypothesis testing output.
This is a paired t-test, and we need to form a new variable (column) in
Excel:
diff = Ass2 – Ass1 (or Ass1 – Ass2)
The numerical summary for diff (from EcStat) is:
Numerical Summary: diff
Variable Size Mean StDev
diff 26 1.7500 2.9572
H: H0: µd = 0
A: The histogram of diff suggests that the difference could come from a
normal population. (Alternatively, n=26 ≥ 25, and by CLT, dy is
approximately normally distributed.)
T: 01727.326/9572.2
075.10 =−=−=ns
yt
d
d µ
df = n-1 = 25
P: From t-table with df=25, 0.005 < p-val < 0.01. Hence reject Ho.
C: Evidence shows that students had higher marks in Assignment 2 than
in Assignment 1 on average.
95% C.I. for µd = 26
9572.2060.275.11 ±=± − n
sty d
nd
= (0.555, 2.945)
We are 95% confident that average difference in marks between
Assignment 2 and Assignment 1 lies between 0.555 and 2.945.
(Check that the CI above excludes the null value 0.)
4
Question 2 Research Question: Is there a difference in the marks obtained in Assignment 2 (Ass 2) between students who attended at least 50% of classes and those who did not? Perform, by hand, an appropriate hypothesis test to address the above research question. Use the following information to help you. Do NOT use EcStat to do the hypothesis test.
Attendance Size Mean SE StDevAttend >50% 15 19.467 1.088 4.764Attend <50% 11 15.500 1.271 3.294
H: H0: µ1=µ2
A:
• The 2 histograms indicate that the 2 samples could come from 2
normal populations.
• The two sample standard deviations are close, and so are the 2
corresponds IQRs (boxes). Thus it is reasonable to assume the 2
population standard deviations are equal, i.e. σ1 = σ2.
T: 1014
294.310764.4142
)1()( 22
21
222
211
+×+×=
−+−+−=
nn
snsns p
= 4.2143
111
151
2143.4
5.15467.1911
21
21
+
−=+
−=
nns
yyt
p
= 2.3713
df = 15+11-2=24
P: From t-table with df=24, 0.02 < p-val < 0.05 Hence reject Ho.
C: The average Assignment 2 marks is higher for those students who had
50% or more attendance than those who had less than 50% attendance.
95% CI for µ1-µ2 =
21242121
11)(
nnstyySEtyy p +×±−=×±− υ
Attend <50%
Attend >50%
5 10 15 20 25 30Ass2
Attendance
5
11
1
15
12143.4064.2)5.15467.19( +×±−=
= (0.514, 7.420)
We are 95% confident that the average Assignment 2 marks is 0.514 up
to 7.420 higher for those who had 50% or more attendance than those
who had less than 50% attendance.
(Check that the CI excludes the null value 0.)
6
Question 3 Research Question: Is the mark obtained by a student in Assignment 1 (Ass1) a useful predictor for his or her mark in the final exam (Exam)?
(a) Perform, by hand, an appropriate hypothesis test to address the above research question. Use the above information to help you. Do NOT use EcStat to do the hypothesis test. H: Ho: β = 0
A: From the scatter plot, the relation looks linear. The residuals seem to
have normal distribution and constant spread.
T: 106.7461.0276.3
)(===
bSE
bt
df = 26-2 = 24
P: From t-able, using df=24, p-val< 0.0005. Hence reject Ho.
C: There is a positive linear relation between Exam and Assignment 1
marks.
For extra 1 mark increase in Assignment 1, there corresponds an increase
of 3.276 marks in Exam.
95% CI for β = bSEtb ×± 24 = 3.276 ± 2.064*0.461 = (2.325, 4.228)
We are 95% confident that the true increase β in the population lies
between 2.325 and 4.228. (b) Write down the value of the goodness-of-fit statistic. Interpret the meaning of this value. r2 = 0.677
67.7% of the variation in Exam marks can be explained (accounted for) by
the variation in Assignment 1 marks. (c) Calculate the value of the correlation coefficient. Interpret the meaning of this value. r =√0.667 = +0.8167 There is a strong positive linear relationship between Exam marks and
Assignment 1 marks.
30
40
50
60
70
80
90
100
8 13 18 23 28Ass1
Exam
df: 24coeff SE t p-value
13.6238 7.573 1.7990 0.085 -2.006 29.2543.2760 0.461
r-sq: 0.677 Resid SS: 1602.188 s: 8.171
outcome:predictorconstantAss1
Exam95% C.I.
7
Question 4 Research Question: Which of the following 4 variables, Ass1, Ass2, Gender and Attendance, are significant in affecting Exam? (a) Use EcStat to perform analysis on each of the independent variables with Exam. Paste the outputs in the spaces below. Do NOT write anything here.
30
40
50
60
70
80
90
100
8 13 18 23 28Ass1
Examdf: 24coeff SE t p-value
13.6238 7.573 1.7990 0.085 -2.006 29.2543.2760 0.461 7.0989 0.000 2.324 4.229
r-sq: 0.677 Resid SS: 1602.188 s: 8.171
Fitted line: Exam = 13.6238 + 3.276 Ass1
outcome:predictorconstantAss1
Exam95% C.I.
30
40
50
60
70
80
90
100
5 10 15 20 25 30Ass2
Examdf: 24
coeff SE t p-value17.9088 5.402 3.3149 0.003 6.759 29.0592.7129 0.294 9.2138 0.000 2.105 3.321
r-sq: 0.780 Resid SS: 1094.590 s: 6.753
Fitted line: Exam = 17.9088 + 2.7129 Ass2
outcome:predictorconstantAss2
Exam95% C.I.
female
male
30 50 70 90Exam
GenderGender Size Mean SE StDev
male 12 63.290 4.075 14.662female 14 68.633 3.773 13.636
Resid SS: 4781.95 r-sq:factor df t p-val s diff
Gender 24 0.962 0.3456 14.116 5.343
ExamTwo-sample t-test:
Attend <50%
Attend >50%
30 50 70 90Exam
AttendanceAttendance Size Mean SE StDev
Attend >50% 15 67.606 3.687 14.830Attend <50% 11 64.204 4.305 13.469
Resid SS: 4892.98 r-sq: 0.01factor df t p-val s diff CI/2
Attendance 24 0.600 0.5540 14.278 3.402 11.698
ExamTwo-sample t-test:
8
(b) Using your EcStat outputs in (a), write a brief statistical report to address the research question. Your report MUST contain the four sections: Introduction, Methods, Results and Conclusion. Some marks will be allocated to the organization of your report. You are advised to word-process the report on A4 paper and limit the length to at most 2 pages. Hints: 1. Although not compulsory, it is advisable to summarize the results into an appropriate table. 2. To cull the “bad” variables and to select the “good” ones, we suggest that you follow these steps: Step 1: Look for any case where assumptions of the relevant tests are violated, and then “disqualify” those variables. Step 2: To select the relevant independent variables affecting Exam, discard those having p-values > 0.05. INTRODUCTION
Researchers are interested to determine which of the 4 independent
variables Ass1, Ass2, Gender and Attendance, are significant in
affecting the dependent variable Exam?
METHODS
The sample consisted of 26 students, assumed randomly selected from all
students enrolled in a second year bioinformatics unit. The target
population is obviously all students enrolled in the bioinformatics unit. In
the 4 independent variables, Ass1, Ass2 are numerical, while Gender and
Attendance are categorical (and binary). The first two require
regressions, while the latter two demand 2-sample t-tests.
RESULTS
We shall look at the two methods separately.
A. Regression
For the 2 regressions involving Ass1 and Ass2 with Exam, the 2 scatter
plots show that the 3 conditions for regression, namely linearity, constant
spread for residuals, and normality of residuals are satisfied. The
results are summarized in the table below.
Independent
variable
Assumptions
satisfied?
p-val Significant
predictor?
(Reject Ho: β=0?)
r2 Result
Ass1 Yes 0.000 Yes 0.677 sig predictor
Ass2 Yes 0.000 Yes 0.780 sig
9
predictor
Both variables Ass1 and Ass2 have p-values > 5%, both are significant
predictors for Exam.
(Note: r2 is actually NOT required here since r2 is only used to select the
best predictor. But the research question does not ask for the BEST
predictor.)
A. 2-sample t-test
The results are summarized in the table below.
Independent
variable
Assumptions
satisfied?
p-val Significant
variable?
(Reject Ho?)
Result
Gender Equal spread - Yes
Normality - ?
0.3456 No (p-val>5%) -----
Attendance Equal spread – Yes
Normality - ?
0.5540 No (p-val>5%) -----
In each of the 2-sample t-tests, the equal spread assumption seems to be
satisfied, according to the box plots and the corresponding sample
standard deviations. For normality condition, it is not directly verifiable
as the sample sizes are small and no histograms or stem-and-leaf plots
are available – unless we draw them ourselves. However, the p-values are
larger than 5% for both cases. Hence both variables Gender and
Attendance are discarded, and whether the normality condition is met or
not thus becomes irrelevant.
CONCLUSION
Of the 4 given independent variables Assignment 1, Assignment 2,
Gender and Attendance, only Assignment 1, Assignment 2 are
significant in affecting the dependent variable Exam.