Simple Linear Regression - Theoretical Physics

Simple Linear Regression


Simple Linear Regression and the Least Squares Line

Often two variables are related in some way. The annualrevenue of a company should be related to the company’sexpenditure on advertising for example.

Similarly, the age of a car and its retail price should be related.

Other examples include the a student’s mark in 1st year andtheir mark in 4th year or the temperature of a chemicalreaction and its completion time.

The branch of statistics that studies the relationship betweentwo or more variables is known as regression analysis. Thesimplest form of regression analysis is simple linear regression.We shall consider this here.



We now describe the main points of the model we consider here.

1 We have two variables Y , (the dependent or responsevariable) and x (the independent or regressor variable).

2 For each fixed value of x , Y |x is itself a random variable - notall cars of the same age have the same retail price!

3 We are interested in seeing if the mean of this randomvariable is a function g(x) of x .

4 We shall only consider the simple case where this is a linearfunction.

g(x) = β0 + β1x

for real numbers β0, β1.



The simplest wayto see if there is arelationshipbetween twovariables x and Yis to collect asample of pairs ofdata(x1, y1), . . . , (xn, yn)and plot these.

x

y Linear relationshipseems appropriate



Of course, twovariables may berelated even whena linearrelationship doesnot exist.

x

yNonlinear relationshipseems more appropriate



1 For each fixed value x , Y is a random variable that satisfiesE (Y |x) = β0 + β1x for some real numbers β0, β1. Thus thereis a linear relationship between the value x and the meanvalue of the random variable Y .

2 The simple linear regression model for Y on x is:

Y = β0 + β1x + ε

where ε represents the error in the regression estimate of Ygiven x .

3 We shall assume that ε is a random variable which follows anormal distribution with mean 0 and variance σ2.

4 β0, β1 and σ2 are parameters of the model that we need toestimate from sample data.


Least Squares Estimators

Given a set of pairs of data points (xi , yi ), 1 ≤ i ≤ n, we wishto estimate the regression parameters β0, β1.

The estimators we use are the values β0, β1 that minimise thesum of squared deviations:

n∑i=1

(yi − β0 − β1xi )2.



x

y Minimise the sum ofsquared vertical distances to the line



Given the data, the values for β0, β1 can be found usingmulti-variable calculus and are given by

β0 = y − β1x

β1 =Sxy

Sxx



The notation on the last slide denotes:

Sxx =n∑

i=1

(xi − x)2

Sxy =n∑

i=1

(xi − x)(yi − y).

β0 and β1 are known as the least squares estimators of theintercept and the slope.

To estimate the third parameter of the model, σ2, we use the errorsum of squares:

SSE :=n∑

i=1

(yi − β0 − β1xi )2.


Properties of the Estimators

The key properties of the estimators β0 and β1 introduced aboveare summarised below.

β0 has mean β0 and variance σ2(

1n + x2

Sxx

)β1 has mean β1 and variance σ2

Sxx

SSE has mean (n − 2)σ2.

We shall use the sample statistic

σ2 =SSE

n − 2

to estimate σ2.


Hypothesis Test for Simple Linear Regression

We can now use the various estimators introduced above toperform hypothesis tests on the values of the regressionparameters. Hypothesis Tests on the Slope

1 We wish to test the null hypothesis H0 : β1 = β1,0 against thealternative hypothesis H1 : β1 6= β1,0.

2 If we take a sample of n data pairs (xi , yi ) then

t0 =β1 − β1,0√σ2/Sxx

follows a t-distribution with n − 2 degrees of freedom.

3 Therefore we would reject H0 at a significance level α ift0 < −tn−2,α/2 or t0 > tn−2,α/2.


Hypothesis Tests on Slope

An important special case of hypothesis testing on the slope istesting the null hypothesis

H0 : β1 = 0

against the alternative hypothesis

H1 : β1 6= 0.

This is usually referred to as testing the significance of regression.It is equivalent to testing whether or not the two variables arelinearly correlated.


Hypothesis Tests on the Intercept

1 We wish to test the null hypothesis H0 : β0 = β0,0 against thealternative hypothesis H1 : β0 6= β0,0.

2 If we take a sample of n data pairs (xi , yi ) then

t0 =β0 − β0,0√σ2(

1n + x2

Sxx

)follows a t-distribution with n − 2 degrees of freedom.

3 Therefore we would reject H0 at a significance level α ift0 < −tn−2,α/2 or t0 > tn−2,α/2.


Confidence Intervals for Simple Linear Regression

It is often necessary to obtain confidence intervals for the regressionparameters and for predictions on the mean or future value of thedependent variable given some value of the independent variable.

Using the facts concerning the distribution of the variables

t0 =β1 − β1,0√σ2/Sxx

and

t0 =β0 − β0,0√σ2(

1n + x2

Sxx

)it is possible to construct confidence intervals for β0 and β1.


Confidence Intervals for Slope and Intercept

Definition

A 100(1− α)% confidence interval for the slope β1 in simple linearregression is given by

β1 − tn−2,α/2

√σ2/Sxx ≤ β1 ≤ β1 + tn−2,α/2

√σ2/Sxx .

Definition

A 100(1− α)% confidence interval for the intercept β0 in simplelinear regression is given by

β0−tn−2,α/2

√σ2

(1

n+

x2

Sxx

)≤ β0 ≤ β0+tn−2,α/2

√σ2

(1

n+

x2

Sxx

).


Confidence Intervals for the Mean Response

If we are given a value x0 of X , then the mean value of theresponse variable µY |x0

is

µY |x0= β0 + β1x0.

Using our sample data, we estimate this with

µY |x0= β0 + β1x0.

The variance of µY |x0is

σ2

(1

n+

(x0 − x)2

Sxx

).


Confidence Intervals for the Mean Response

Using this fact and using σ2 to estimate σ2, a 100(1− α)%confidence interval for the mean response µY |x0

is

µY |x0− tn−2,α/2

√σ2

(1

n+

(x0 − x)2

Sxx

)≤ µY |x0

≤ µY |x0+ tn−2,α/2

√σ2

(1

n+

(x0 − x)2

Sxx

).


Confidence Intervals for Prediction Values

One of the major uses of simple linear regression is to predictfuture values of the variable Y given a value x0 of the regressorvariable X .

Writing Y |x0 for this random variable, it can be shown that thevariance of Y |x0 − µY |x0

is given by

σ2

(1 +

1

n+

(x0 − x)2

Sxx

).


Confidence Intervals for Prediction Values

From this we can construct confidence intervals for the future valueof the response variable given the value x0 of the regressor variable.A 100(1− α)% confidence interval for the value of the responseY |x0 is

µY |x0− tn−2,α/2

√σ2

(1 +

1

n+

(x0 − x)2

Sxx

)≤ Y |x0

≤ µY |x0+ tn−2,α/2

√σ2

(1 +

1

n+

(x0 − x)2

Sxx

).


Example

An Internet service provider is interested in modelling therelationship between the total monthly download volume of a user(in GB) (Y ) and the largest single file download of a user (in MB)(X ). A sample of 10 users is selected and the results for thissample are recorded below.

User Monthly Download Largest File Download1 17.2 3.42 11.8 2.33 8.6 1.34 18.7 4.65 14.4 3.46 13.8 2.77 9.1 1.88 12.6 2.69 16.9 3.3

10 13.5 2.8


Regression Example

(i) Fit the simple linear regression model to these data using themethod of least squares.

(ii) Test for the significance of regression at a significance level of5%.

(iii) Test the hypothesis that β1 = 3.5 at a 5% level of significance.

(iv) Construct a 95% confidence interval for the mean monthlydownload volume of users whose largest single download is3MB.

(v) Construct a 95% confidence interval for the monthly downloadvolume of a user whose largest single download is 3MB.


Regression Example

(i) We need to calculate the parameters β0 and β1. For this weneed Sxx , Sxy , x and y .

x = 2.82, y = 13.66.

Sxx = 7.76, Sxy = 26.54.

Thus

β1 =26.54

7.76= 3.42

β0 = 13.66− 3.419(2.82) = 4.018.


Regression Example

(ii) We need to calculate σ. First, we need SSE . Using the valuesof β0 and β1 above, we find that SSE = 9.2. Thus

σ2 =9.2

8= 1.15.

1 H0: β1 = 0;2 H1 : β1 6= 0;3 α = 0.05;4 Test Statistic:

t0 =β1√σ2/Sxx

5 Reject H0 if t0 < −t8,0.025 or t0 > t8,0.025. (t8,0.025 = 2.306).6 For our data, the value of t0 is

3.42

.385= 8.88.

7 We reject H0 at the 5% level of significance.Simple Linear Regression

Regression Example

(iii)

1 H0: β1 = 3.5;

2 H1 : β1 6= 3.5;

3 α = 0.05;

4 Test Statistic:

t0 =β1 − 3.5√σ2/Sxx

5 Reject H0 if t0 < −t8,0.025 or t0 > t8,0.025.

6 For our data, the value of t0 is

3.42− 3.5

.385= −0.21

7 We cannot reject H0 at the 5% level of significance.


Regression Example

(iv) The estimate of the response variable isy0 = 4.018 + 3.42(3) = 14.278. Thus a 95% confidence interval forthe mean monthly download volume when x0 = 3 is

14.278 − 2.306

√1.15

(1

10+

(3− 2.82)2

7.76

)≤ µY |3

≤ 14.278 + 2.306

√1.15

(1

10+

(3− 2.82)2

7.76

)which is

13.48 ≤ µY |3 ≤ 15.08.


Regression Example

(v) Similarly a 95% confidence interval for the value of theresponse Y when x = 3 is

14.278− 2.306

√1.15

(1 +

1

10+

(3− 2.82)2

7.76

)≤ µY |3

≤ 14.278 + 2.306

√1.15

(1 +

1

10+

(3− 2.82)2

7.76

)

11.68 ≤ Y |3 ≤ 16.88.


Documents

Simple Linear Regression - Theoretical Physics