8
HOMEWORK 1. LINEAR REGRESSION (calculate “manually” or use Excel) Introduction In the lecture part numerous concepts were introduced and illustrated, but the analysis was broken into disjoint pieces by sections. In this Practical Assignment we will go through a complete example of the use of correlation and regression analysis of data from start to finish, touching on all the topics of the lecture on Linear Regression in sequence. Problem to solve: In general educators are convinced that, all other factors being equal, class attendance has a significant bearing on course performance. To investigate the relationship between attendance and performance, an education researcher selects for study a multiple section “Statistics” course at a large university. Instructors in the course agree to keep an accurate record of attendance throughout one semester. At the end of the semester 26 students are selected at random. For each student in the sample two measurements are taken: x, the number of days the student was absent, and y, the student’s score on the common final exam in the course. The data are summarized in the Table 1 below. Table 1. Absence and score data Absence s Score x y 2 76 7 29 2 96 7 63 2 79 7 71 0 88 0 92 6 55 6 70 2 80 2 75 1 63 4 41 5 63 4 88 0 98 1 99 0 89 1 96 3 90 1 DR. VERONIKA ALHANAQTAH. ECONOMETRICS

dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

  • Upload
    buihanh

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

HOMEWORK 1. LINEAR REGRESSION (calculate “manually” or use Excel)

IntroductionIn the lecture part numerous concepts were introduced and illustrated, but the analysis was broken into disjoint pieces by sections. In this Practical Assignment we will go through a complete example of the use of correlation and regression analysis of data from start to finish, touching on all the topics of the lecture on Linear Regression in sequence.

Problem to solve: In general educators are convinced that, all other factors being equal, class attendance has a significant bearing on course performance. To investigate the relationship between attendance and performance, an education researcher selects for study a multiple section “Statistics” course at a large university. Instructors in the course agree to keep an accurate record of attendance throughout one semester. At the end of the semester 26 students are selected at random. For each student in the sample two measurements are taken: x, the number of days the student was absent, and y, the student’s score on the common final exam in the course. The data are summarized in the Table 1 below.

Table 1. Absence and score data

Absences Score

x y2 767 292 967 632 797 710 880 926 556 702 802 751 634 415 634 880 981 990 891 963 901 903 681 843 801 78

Step 1. Construct a scatter plot of the absence and exam score pairs.How to do this: Copy data from the Table 1 and insert it into MS Excel;

1

dr. veronika alhanaqtah. econometrics

Page 2: dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

Select data (only numbers) and choose a scatter diagram on the panel; Name the diagram. It should look as follows:

0 1 2 3 4 5 6 7 80

20

40

60

80

100

120Scatter plot of the absence and exam score pairs

The trend observed in the plot makes it reasonable to assume a relationship between x and y of the form y=β2 x+β1+ε

where β2and β1are unknown parameters and εis a normal random variable with mean zero and unknown standard deviation σ. Note carefully that this model is being proposed for the population of all students taking this course, not just those taking it this semester, and certainly not just those in the sample. The numbers β2,β1and σare parameters relating to this large population.

Step 2. Perform preliminary computations that you will need later. How to do this:

Construct a table of 5 columns and 27 rows in MS Excel:x y x2 xy y2

… … … … …

Copy from the Table 1 into the Table 2 data on x and y. Make computations of x2, xy and y2.

Step 3. Using data from Table 2 and formulas from Lecture on “Topic 1. Linear Regression”, compute:∑ x , ∑ y, ∑ xy , ∑ x2,∑ y2, x, y , SSxx, SSyy, SSxy.How to do this:

On the base of Table 2 construct Table 3, adding 28th and 29th rows and an additional column on the left:x y x2 xy y2

… … … … …

… … … … …

∑ … … … … …

x, y … … no computations in this cell

no computations

in this cell

no computations

in this cell

Using formulas from Excel, compute sums and average estimates: ∑ x , ∑ y, ∑ xy , ∑ x2,∑ y2, x, y . Using formulas below compute SSxx, SSyy, SSxy.

SSxx=∑ x2−1n

∙(∑ x )2

SSxy=∑ xy−1n

∙(∑ x) ∙(∑ y )

SSyy=∑ y2−1n

∙(∑ y)2

2

dr. veronika alhanaqtah. econometrics

Page 3: dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

Question 1.What are the estimates of ∑ x , ∑ y, ∑ xy , ∑ x2,∑ y2, x, y , SSxx, SSyy, SSxy?

Step 4. We begin the actual modeling by finding the least squares regression line (the line that best fits the data).Compute the slope and y-intercept, construct the regression line.How to do this: Using formulas below compute in MS Excel β2 and β1:

β2=SSxy

SSxxβ1= y− β2 x

Construct the regression line using formula: y= β2 x+ β1

You may also compute coefficients β2 and β1 easier by using special formulas in MS Excel: LINESTFor computing just the slope (β2) you may use special formula in MS Excel: SLOPE

Question 2.Write down a regression line with estimated parameters β2 and β1.

Step 5. Find the goodness of fit of this line to the scatter plot (the sum of its squared errors).How to do this: Use the processed data in previous steps and the formula:SSE=SS yy− β2 SSxy

This number is not particularly important in itself. But we use it to compute the important statistic.

Question 3.What is the estimate of SSE?

Step 6. Compute the statistic sε.

How to do this:

Use the formula:sε=√ SSEn−2

Question 4.What is the estimate of sε?

Step 7. Interpret the results by answering questions (choose True or False):Question 5.The statistic sεestimates the standard deviation σof the normal random variable εin the model.

(a) True;(b) False.

Question 6.The meaning of the statistic sεin the context of the problem is that: among all students with the same number ofabsences, the standard deviation of their scores on the final exam is about 12.1points.

(a) True;(b) False.

Question 7.The size and sign of the slope β2=−5.23indicate that, for every class missed, students tend to score about 5.23 fewer points lower on the final exam on average.

(a) True;(b) False.

Question 8.Similarly for every two classes missed students tend to score on average 2 × 5.23 = 10.46 fewer points on the final exam, or about a letter grade worse on average.

(a) True;(b) False.

Question 9.Since 0 is in the range of x-values in the data set, the y-intercept also has meaning in this problem. It is an estimate of the average grade on the final exam of all students who have perfect attendance. The predicted average of such students is β1=91.24.

(a) True;(b) False.

Step 8. Examine the utility of the linear regression model. We can do this in two ways:

3

dr. veronika alhanaqtah. econometrics

Page 4: dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

1)by computing the correlation coefficient r to see how strongly the number of absences x and the score y on the final

exam are correlated: r=SSxy

√ SSxx ∙ SS yy

We may also use a formula in Excel: CORREL

Question 10.Compute correlation coefficient and interpret the results, choosing True or False: there is a moderate negative correlation:

(a) True;(b) False.

2)bytesting the null hypothesis

H0 :β2 = 0 (the slope of the population regression line is zero, so x is not a good predictor of y) vs.

Ha :β2< 0 (the slope of the population regression line is negative, so final exam scores y godown as absences x go up).

Test a hypothesis:H 0 : β2=0 vs.H a : β2<0 α=0.05 From Appendix "Critical Values of t", with df= 26 − 2 = 24 (degrees of freedom) find t-critical statistics, i.e.

t0.025=…

Calculate t-observed statistics:t obs=β2−B0

Sε/√SSxx

Compare tobsand tcr. If t obs>t cr we reject H0.

Question 11.What a hypothesis is true:(a) H0;

(b) Ha.

Question 12.ChooseTrue or False: The data provide sufficient evidence (at the 5% level of significance) to conclude that β2is negative, meaning that as the number of absences increases average score on the final exam decreases.

(a) True;(b) False.

Step 9. Compute a confidence interval for β2.How to do this: The value β2 = −5.23 gives a point estimate of how much one additional absence is reflected in the average score

on the final exam. For each additional absence the average drops by about 5.23 points. We can widen this point estimate to a confidence interval for β2. From Appendix "Critical Values of t" with df= 26 − 2 = 24 degrees of freedom, find t-critical statistics: tα∕2 = t0.025 = ….

The 95% confidence interval for β2is:β2± t α / 2∙Sε

√ SSxx

Question 13.Compute a confidence interval for β2.

Question 14.Interpret the results: We are 95% confident that, among all students who ever take this course, for each additional class missed the average score on the final exam goes down by between 3.08 and 7.38 points.

(a) True; (b) False.

Step 10. Estimate the average score on the final exam for the students who have exactly 5 absences. How to do this: Use the least squares regression equation y=−5.23 x+91.24 and insert 5 instead of x.

4

dr. veronika alhanaqtah. econometrics

Page 5: dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

Question 15.What is the average score on the final exam for the students with 5 absences?Step 11. Compute a 95% confidence interval for the average score on the final exam for all students with 5 absences.How to do this:

Use the formula for confidence interval: y p ±t α2

sε √ 1n +

( xp−x )2

SSxx

Question 16.What is the 95% confidence interval for the average score on the final exam for the students with 5 absences?

Question 17.Interpret the results: This confidence interval suggests that the true mean score on the final exam for all students, who are absent from class exactly 5 times during the semester, is likely to be between the left and right frames of the computed above confidence interval.

(a) True;(b) False.

Step 12. Compute a 95% confidence interval for a particular student misses exactly 5 classes during the semester.How to do this:

Use the formula for prediction interval: y p ±t α2

sε √1+1n +

( x p−x )2

SSxx

Question 18.What is the 95% confidence interval for a particular student misses exactly 5 classes during the semester?

Question 19.Interpret the results (choose True or False):The same as the 95% confidence interval for the average score of all student with 5 absences gives real information, this interval also gives good information about what an individual student’s final exam score might be.

(a) True;(b) False.

Step 13. The proportion of the variability in the scores of students on the final exam, that is explained by the linear relationship between that score and the number of absences, is estimated by the coefficient of determination, R2. Compute R2.How to do this: Since we have already computed r above, we easily find that: R2=(r )2

Question 20.What is coefficient of determination (in %)?

Question 21.Interpret the results: Thus although there is a significant correlation between attendance and performance on the final exam, nevertheless less than half the total variation of the exam scores in the sample is explained by the number of absences.

(a) True;(b) False.

Question 22.Interpret the results (choose the right answer): (a) Factor of “attendance of classes” is a sufficient factor to explain students’ performance on exams.(b) There are many factors besides attendance that bear on students’ performance on exams; we have to include into the regression model other factors too.

It is a good idea to attend classes

Deadline is October 9 (full mark). October 11 (50 % of the total mark). After October 11 the Homework will not be graded!

Answers are accepted in a written form with computations.

5

dr. veronika alhanaqtah. econometrics

Page 6: dr. veronika alhanaqtah. econometrics Web viewStep 7. Interpret the ... take this course, for each additional class missed the average score on the final exam goes down by between

APPENDIX: Critical Values of t

6

dr. veronika alhanaqtah. econometrics