33
Assignment in Statistics and Research Methods Indian Retail Stores Winter 2013/2014 Kirsten Marie Simonsen – XXXXXX-XXXX Kia Slæbæk Jensen – XXXXXX-XXXX Jasmin Sharzad – XXXXXX-XXXX Malene Louise Thomasen – XXXXXX-XXXX CTU: 25,230

Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

Embed Size (px)

Citation preview

Page 1: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

Assignment in Statistics and Research MethodsIndian Retail Stores

Winter 2013/2014

Kirsten Marie Simonsen – XXXXXX-XXXXKia Slæbæk Jensen – XXXXXX-XXXX Jasmin Sharzad – XXXXXX-XXXX Malene Louise Thomasen – XXXXXX-XXXX

CTU: 25,230

Page 2: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Table of ContentsQuestion 1........................................................................................................................................................3

Question 2........................................................................................................................................................4

Two Group Comparison................................................................................................................................4

Three Group Comparison.............................................................................................................................6

Question 3........................................................................................................................................................9

Two Group Comparison................................................................................................................................9

Three Group Comparison...........................................................................................................................10

Question 4......................................................................................................................................................11

Question 5......................................................................................................................................................13

Question 6......................................................................................................................................................13

a) Additive Model.......................................................................................................................................13

b) Fulfillment of Model Assumptions.........................................................................................................14

c) Statistical Significance of Predictors.......................................................................................................15

Question 7......................................................................................................................................................16

a) Logistic Regression Model......................................................................................................................16

b) Multiple Regression Model....................................................................................................................17

c) Significance: Perception..........................................................................................................................17

Page 2 of 26

Page 3: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Question 1

This question concerns a description of the variables in the data set. Furthermore, the minimum and

maximum store size will be calculated in both square feet and square meter.

Variables are the characteristics observed in a study. Thus, the variables in the data file from the research

papers: “Competition and labor productivity in India’s retail stores” and “Are labor regulations driving

computer usage in India’s retail stores?” are LogSize, Competition, Perception, Efficiency3yr, Efficiency,

Logsales3yr, Logsales, Store type, City and Computer use.

A variable can be either quantitative or categorical. The quantitative variables in the data set are

LogSize, Competition, Perception, Efficiency3yr, Efficiency, Logsales3yr and Logsales, since their values can

take any value within a certain interval. Furthermore the variables are continuous. LogSize, Logsales3yr

and Logsales have been converted into base 10 logarithm in order to diminish the spread of the data set.

The categorical variables in the data set are: Store type, City and Computer use as they all consist of

categories for which the concerned observation belongs to a certain category.

As for making descriptive statistics of quantitative and categorical variables graphs and numerical

summaries describe the main features of the variables. For quantitative variables, key features to describe

are the center and the variability, why a histogram is usually applied as to describe the data. Opposite, for

categorical variables, a key feature to describe is the relative number of observations in the various

categories. Consequently, a bar graph is commonly used as to describe the data.

Minimum and maximum

We will now describe the minimum and maximum of the logSize variable. As the store size will be

calculated in square feet we will lastly convert it into square meter.

By constructing a histogram in SAS JMP (see appendix 1), we get the maximum and minimum values of the

logSize variable. Since the data is given in the log(10) base we use the following formula x=logy to obtain

the minimum and maximum values in square feet:

Min: 101,079=11.995sq.ft.

Max: 104,176=14996.848 sq.ft.

These values can be converted into square meter by using the following formula:

Page 3 of 26

Page 4: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

m2= ft 2

10.764So by inserting the above results into the formula we obtain the following results:

Min: 11.99510.764

=1.114sq.m

Max: 14,996.84810.764

=1393.241 sq.m

Thus it can be concluded that there is a wide spread in the LogSize variable, meaning that there is a large

difference in the store sizes.

Question 2

This question regards a comparison of the store types in terms of their level of efficiency.

We will begin with constructing a significance test and a confidence interval for the two group comparison,

followed by a significance test and a confidence interval for the three group comparison.

Two Group Comparison

Significance Test

1. Assumptions

Since we are comparing means, the first assumption of the significance test is that the response variable

has to be quantitative, which is the case of the efficiency variable.

The second assumption suggests that the sample must be collected using randomization. The sampling

methodology for Enterprise Surveys is stratified random sampling. In a stratified random sample, all

population units are grouped within homogeneous groups and simple random samples are selected within

each group.

The third assumption states that each group must have an approximately normal distribution. By looking at

the histograms (see appendix 2) for each group, it can be observed that this is the case of both store types.

2. Hypotheses

The null hypothesis assumes that the two means are equal, and thus there is no difference in the level of

efficiency: H 0 : μ1=μ2The two-sided alternative hypothesis suggests that the means are different from one another, and thereby

implying an association between efficiency and store type: H a : μ1≠ μ2

Page 4 of 26

Page 5: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

3. Test Statistic

The test statistic states the distance between the value of the null hypothesis and the point estimate

parameter, wherewith the amount of standard errors determines this distance. The test statistic has

approximately a t distribution if H0 is true.

The following formula is used to construct the test statistic:

t=( x1−x2 )−0

seThe standard error can be calculated by using the following formula:

se=√ s12

n1+s22

n2The standard error is then derived to be:

se=√ s12

n1+s22

n2=√ 0 .450465268

+ 0 .458652

278=0 .061161988≈0 .06

And the t test statistic is calculated to be:

t=(5 .7624265−5 .3656043 )−00.061161988

=6 .488052677≈6 .488

4. P-Value

In order to infer the value from the test statistic, the p-value describes the probability that the test statistic

takes the observed value or a value more extreme. In this case the two-tail probability from the t

distribution will be used to construct the p-value, which has to be smaller than the significance level of

α=0.05 if H0 is to be rejected. We use table B on page A-3. Since df is larger than 100 and thereby

approximates infinity, we can conclude from the table, that when we have a t test statistic of 6.488, the

right tail probability must be significantly smaller than 0.001 which can be seen in table B. The P-value must

therefore be smaller than 0.002.

5. Conclusion

Seeing as the p-value is smaller than the significance level of 0.05 we can reject H0. Therefore we can

conclude that there is a difference in efficiency between the two store types.

Page 5 of 26

Page 6: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Confidence Interval

Since the response variable is quantitative, we are comparing means. Using SAS JMP (see appendix 2) to

obtain the means, we have constructed a confidence interval for the difference in efficiency between the

means of the store types Traditional and Consumer Durables.

The difference in means will be calculated by using the following formula for constructing a confidence

interval:

( x1−x2 )±t 0 .025⋅seHere df = n – 1, for the t-score t0.025 equals df = (68+278)-1 = 345. Since the degrees of freedom exceeds

100 we use t0.025 = 1.96.

Where the standard error (se) is calculated to be:

se=√ [se ( x1 )]2+[ se ( x2 )]2=√ s1

2

n1+s22

n2=√ 0 .450465268

+ 0 .458652

278=0.061161988≈0 .06

The confidence interval is then:

(5 .7624265−5.3656043 )±1.96⋅0 .06=0 .3968222±0 .1176=(0 .2792222;0 .5144222)≈(0 .279;0 .514 )Thus we can be 95% confident that the population mean difference for efficiency (μ1 – μ2) between

Consumer Durable Stores and Traditional FMCGs falls between 0.279 and 0.514. We can then infer that the

efficiency in the Consumer Durable Stores is between 0.279 and 0.514 points larger than the efficiency in

the Traditional FMCGs measured on a scale from 1 to 4.

Three Group Comparison

Significance Test

When comparing several means, the analysis of variance method is used by constructing a one-way ANOVA

test (see appendix 3). It investigates independence between efficiency and the three store types.

1. Assumptions

The first assumption states that the distributions for the groups are normal, where the standard deviations

for each group are the same. The standard deviations are not completely identical, with a difference from

the largest to the smallest of 0.064, but since the difference is smaller than 2, the general formula can be

used for calculation. Additionally the second assumption assumes randomization, which is fulfilled(see two

group comparison).

Page 6 of 26

Page 7: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

2. Hypotheses

The null hypothesis suggests that the means for each group are equal, thus there is no difference in the

level of efficiency between the store types. H 0 : μ1=μ2=...=μg

The one-sided alternative hypothesis states that at least two of the means are unequal, thus suggesting an

association between efficiency and store type.

3. Test Statistic

The ANOVA test has an F distribution, with two degrees of freedom values, which has a mean equal to

approximately 1, when H0 is true.

F test statistic:

F=BetweenGroups−VariabilityWithinGroups−Variab ility

=5.936260.20309

=29.2303≈29.23

4. P-Value

The p-value is the right-tail probability from the F distribution

df 1=(g−1)=3−1=2

df 2=(n−g)=388 –3=385

Using Table D on page A-5 we are able to find that, if the F test statistic is 3 or above then the right-tale

probability must equal 0.05 or smaller. Seeing as the F test statistic is 29.23, it is far larger than 3, therefore

the P-value must be much smaller than the significance level 0.05.

5. Conclusion

JMP reports the P-value to be smaller than 0.001, therefore we can reject the null hypothesis. Thus is can

be concluded that at least two of the groups have different means.

Confidence Interval

The confidence interval is used to estimate the differences between population means. The following

formula is used to construct a 95% confidence interval for two groups with different:

y i− y j±t0.025 s √ 1ni + 1n j

The t score has df = N – g

We will use the formula to obtain the confidence interval for the difference in the means for the three store

types.

Page 7 of 26

Page 8: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Consumer Durable Stores (CDS) vs. Modern Format Stores (MFS)

Common standard deviation from Mean Square Error, we got from JMP (see appendix 4):

s=√MS=√0 .20309=0.4506550787

yCDS− yMFS±t0 .025 s√1nCDS +1nMFS

=5.7624265−5 . 7359535±1 . 96⋅0 .4506550787√168 +143

=

(−0 .1456239392 ;0 .1985699392)≈(−0 .146 ;0.199 )Since the 95% confidence interval (-0.146; 0.199) contains zero we cannot reject that the mean of CDS

equals the mean of MFS.

MFS vs. Traditional FMCGs (TFMCG)

yMFS− yTFMCG± t0 .025 s√1nMFS+1nTFMCG

=5 .7359535−5 .3656043±1 .96⋅0 .4506550787√143 +1278

=

(0 .225606646;0 .0 .515091754 )≈(0 .226 ;0 .515 )Seeing as the confidence interval does not contain zero, we can conclude that the two means differ, and

that the efficiency of the Modern Format Stores will be between 0.226 and 0.515 larger than the efficiency

in the Traditional FMCGs.

CDS vs. Traditional FMCGs

As we have already made a confidence interval for the difference between Consumer Durable Stores and

the Traditional FMCGs, concluded that with a 95 % confidence the difference between the two types of

store’s population mean lies in the interval (0.279; 0.514). Here it is also evident that the Consumer Durable

Stores are more efficient than the Traditional FMCGs.

Since the confidence interval for the Consumer Durable Stores and the Modern Format Stores contains

zero, we cannot reject that their population means might equal each other, so they might be equally

efficient. However, we can conclude that the Traditional FMGCs are statistically significantly less efficient

than the Consumer Durable Stores and the Modern Format Stores.

Page 8 of 26

Page 9: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Question 3

This question considers a comparison of the probability of computer use in relation to store type.

Two Group Comparison

Significance Test

1. Assumptions

Since this question concerns a comparison of proportions, the first assumption states that the response

variable for the two groups has to be categorical. The second assumption is concerned with the sample size,

which has to be large enough so that there are at least five successes and five failures for each of the two

groups. The third assumption states that the data must be collected by using randomization. All the

assumptions are fulfilled.

2. Hypotheses

The null hypothesis assumes that the two proportions are equal, thereby suggesting that there is no

association, H 0 :P1=P2The two-sided alternative hypothesis assumes that the two proportions are different from one another,

thereby suggesting that there is an association between computer use and store type, H a :P1≠ P2

3. Test Statistic

When comparing proportions we do a z statistic, which describes the distance between the sample

estimate and the null hypothesis value, measured in the number of standard errors. We obtained the

proportion of computer use in the different store types by using SAS JMP (see appendix 5).

se=√ p1 (1− p1)n1

+p2(1− p2)

n2

se=√ 0 .0537(1−0.0537 )71+0 .0311(1−0 .0311 )283

=0 .028674012≈0 .03

z=( p1− p2−0 )

se

z=(0 .0537−0 .0311−0 )0 .03

=0 .789

Page 9 of 26

Page 10: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

4. P-Value

The P-value is the two tail probability of a normal distribution which can be obtained from Table A on page

A-2 in the book.

We obtain a cumulative probability of 0.7852, but since it is the area under the standard normal curve to

the left of z, and we are interested in the probability to the right of z, we subtract the cumulative

probability from 1. But since we need the two tail probability from the standard normal distribution of

values even more extreme than the observed z test statistic, we multiply by 2 and get the following p-value.

(1−z )⋅2=(1−0 .7852 )⋅2=0 .4294

5. Conclusion

We can conclude that since the P-value is larger than the significance level, we cannot reject the null

hypothesis. Thus there is not necessarily an association between computer and store type concerning

traditional and consumer durable stores.

NOTE

We made a huge mistake when handling in the assignment, which we saw afterwards. We should have

used the pooled estimate! When doing this we got a new z-score which is 6.19 with (p<0,00001)

Calculations are:

pooled e = (19+11)/(71+283) = 0.084746

se = kvad. rod(0.08476 * (1 - 0.08476) * (1/71+1/283))

z = (0.267-0.0389-0) / (kvad. rod(0.08476 * (1 - 0.08476) * (1/71+1/283)))

z = 6.18

The P-value is extremely low so we reject H0, which means that our conclusion is that there IS a connection

between computer use and storetype

Confidence Interval

We will now construct a confidence interval, by using the following formulas for finding the standard error

and the confidence interval.

( p1− p2 )±z0.025( se )

Page 10 of 26

Page 11: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

se=√ p1 (1− p1)n1

+p2(1− p2)

n2

se=√0 .0537(1−0.0537 )71+0 .0311(1−0 .0311)283

=0 .028674012≈0 .03

(0 .0537−0 .0311 )±1 .96⋅0 .03=0.0226±0 .0588=(−0 .0362; 0.0814 )

Since the confidence interval contains zero, it is plausible that ( p1− p2 )equals zero, which means that the

population proportions might be equal. This indicates that there is not necessarily an association between

computer use and Traditional FMCGs or Consumer Durable Stores.

NOTE

Given the last note, we have calculated a wrong CI, as we should have used conditional proportions (row% in JMP).

(0.1232 ; 0.334) (NOTE it does NOT contain 0!!!)

Calculations:

se = kvad. rod (0.2676 * (1-0.2676) / 71 + (0.0389 * (1 - 0.0389) / 283)) = 0.0537

CI er da:

(0.02676 - 0.0389) +-1.96 * 0.0537

Our Ci tells us that there is approx. 12% to 33% bigger chance of having a computer in a consumer durable

store than in a traditional FMCG.

Three Group Comparison

Chi Squared Test

This is a test of independence, which compares the observed counts with the expected counts, by looking

at the difference between the two and summarizing the squares.

When H0 is true, thus assuming independence, the Chi squared test has a small value, and if there is an

association, the chi squared test has a large value.

X2=∑ (ObservedCount−ExpectedCount )2

ExpectedCount

The expected cell count can be calculated by using:

Page 11 of 26

Page 12: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Expected Cell Count=(RowTotal )· (ColumnTotal)

Total Sample ¿¿¿

SAS JMP (see appendix 6) provides the Chi Squared value (Pearson) to be 116.147. Since this is a rather

large number, we can conclude that there is an association between the three store types and computer

use. Seeing as our two group comparison between Traditional FMCGs and Consumer Durable Stores gave

us a result where there was not necessarily an association, there must be a bigger chance of having a

computer in a Modern Format Store.

Question 4

This question concerns a comparison between the two variables Efficiency3yr and Efficiency.

In order to compare the two variables, we construct a new column in SAS JMP called Efficiency-

Efficiency3yr, for which we get the change in efficiency. This will be followed by a significance test and a

confidence interval, in order to confirm whether or not there has been a significant change in efficiency.

Significance Test

1. Assumptions

Three assumptions apply to the comparison of means. The first assumption states that the variables need

to be quantitative, which is the case of the efficiency variables. The second assumption states that the data

set has to be collected using randomization. The third assumption states that the population distribution

has to be approximately normal. All the assumptions are fulfilled.

2. Hypothesis

The null hypothesis assumes that the means are equal, which indicates that there is no change in efficiency

during the three years. H 0 : μ=μ0The two-sided alternative hypothesis assumes that the means differ, thus indicating that there has been a

change in efficiency during the three years: H 0 : μ≠ μ0

3. Test Statistic

The test statistic uses the standard error to measure the distance between the sample mean x (see

appendix 7) and the value of the null hypothesis μ0.

The following formula can be used to calculate the t test statistic:

Page 12 of 26

Page 13: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

t=( x−μ0 )

se=

( x−μ0 )s /√n

=

(0 .066003−0 )0 .3096432/√338

=3 .918866589≈3 .92

4. P-Value

The p-value is a two tail probability of getting more extreme values than the t test statistic. We can obtain

the right tail probability of the t test statistic by using table B on page A-3. As df= n-1, df is larger than 100

and thereby approximates infinity, we can conclude from the table, that when we have a t test statistic of

3.92, the right tale probability must be significantly smaller than 0.001. The P-value must therefore be

smaller than 0.002, why we can reject H0 as the P-value is smaller than the significance level of α=0.05.

Furthermore, SAS JMP provides us with the p-value of 0.0001, which is significantly smaller than the

significance level.

5. Conclusion

Since the p-value is smaller than the significance level, it is evident that H0 must be rejected. This means

that there has been a significant increase in efficiency during the three years.

Confidence Interval

After rejecting H0, we construct a confidence interval in order to confirm within which interval the

population mean change in efficiency, through the past three years, lies.

To construct a 95% confidence interval, we use the following formula and the values given to us by SAS JMP

(see appendix 7):

x±t 0.025( se)where se= s

√nThe sample mean that will be investigated is found to be 0.066003 by using SAS JMP (see appendix 7). The

confidence interval will now be calculated.

se=0.3096432√338

=0 .16842369728331

0 .066003±1 .96⋅0 .016842369728331=(0 .0329919553 ;0 .0990140447 )≈(0 .03299;0 .099 )

Which is approximately the same as the confidence interval given by SAS JMP of (0.03287;0.09913).

Seeing as the confidence interval does not contain zero, we can conclude that there has been a positive

change in efficiency during the last three years. Furthermore we can say with 95% confidence that the

average increase in efficiency lies within the interval of (0.03299;0.099).

Page 13 of 26

Page 14: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Question 5

This question regards fitting a simple linear regression with efficiency as the response variable and logSize

as the explanatory variable. We will start out by discussing whether the assumptions of the model are

violated, and then compute a 95% confidence interval.

Model Assumptions

The first assumption states that the population satisfies the regression line μ y=α+βx . This is shown in

the data from SAS JMP (see appendix 8) which gives a regression line of y=4.7291198+0.3390093x. This

regression has an R2 value of 0.114002, which states that the correlation between efficiency and logSize is

not very strong, since the value has to be as close to 1 as possible.

The second assumption concerns randomization, and as stated in question 2. The third assumption states

that the population y values at each x value have normal distribution with the same standard deviation at

each x value. The data set approxiametely fulfills the model assumptions.

Further violations of the model could include the sample size, but this is not relevant for this data set.

Confidence Interval

We construct a confidence interval as to determine whether 0 is part of the interval as to conclude whether

x and y are statistically independent. A 95% confidence interval for the slope has the formula:

b±t0.025( se )=0 .3390092±1 .96⋅0 ,048042=(0 .244847 ,0 .433172)

The standard error is supplied by SAS JMP (see appendix 8).

The value of t0.025 is df=n-2=389-2=387, thus t0.025=1.96 (Table B from page A-3)

Since the 95% confidence interval does not contain 0, we can infer that we can reject H0 in a significance

test. As the slope is not equal to 0, the variables are linearly associated meaning that they are dependent

on each other. On average we infer that the maximum increase in efficiency increases by at least 0.244847

and at most 0.433172, for an increase of 1 logSize.

Question 6

a) Additive Model

We will now construct an additive model and explain the most important parts of the output.

The bivariate regression can be extended to a multiple regression equation.

μ y=α+β1 x1+β2 x2. . .+ βn xn

Page 14 of 26

Page 15: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

For practical matters we use SAS JMP to calculate the multiple regression model (see appendix 9). We can

explain the most important parts of the output by looking at the prediction expression. The interpretation

of the estimates of β depends on whether the term is categorical or quantitative. If it is quantitative, the

given estimates of β describes what one unit increase in the term adds or subtracts to the overall efficiency.

If it is categorical, only the estimate related to the particular information is used e.g. if the store type is a

consumer durable store 0.33 will be added to the efficiency, whereas the estimate of modern format store

will not be a part of the equation.

R2

Furthermore we can look at the correlation R2, which SAS JMP has calculated to be 0.26. This indicates that

the multiple regression equation has 26% less error than ȳ (population mean) which was found to equal

5.476. Additionally, by using the R2 we could look at how much R2 increases as variables are included. It

cannot get smaller, only be the same or increase.

b) Fulfillment of Model Assumptions

Model Assumptions:

First assumption concerns each explanatory variable which has a straight-line relation with μy, with the

same slope for all combinations of values of other predictors in the model. The second assumption states

that the data must be gathered using randomization. The third assumption states that data must have a

normal distribution for y with same standard deviation at each combination of values of other predictors in

the model.

The model assumptions are fairly satisfied, since the standard errors are somewhat equal, with the biggest

difference of about 0.1 standard errors. Furthermore the sample is collected using randomization and the

distributions are approximately normal.

Nonlinear effects

In order to test whether there are any nonlinear effects we use SAS JMP (see appendix 10) to see if any of

the terms have a polynomial tendency. From the data we can see that logSize has a polynomial tendency

since it has a p-value lower than the significance level. On the other hand competition has a linear tendency

since the p-value is a lot higher than the significance level. In regards to the nonlinear tendencies of the

logSize variable, this might lead to violations of the model assumptions, but it cannot be concluded with

certainty that the assumption has been violated.

Page 15 of 26

Page 16: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Interactions

Interactions between two explanatory variables occur when there is a change in the slope of the

relationship between μy and one of the explanatory variables, as the other explanatory variable changes.

There is an interaction when the change in one variable affects the outcome of the other variable. Thus we

can test for interactions in SAS JMP (see appendix 11). If the obtained probability of the crossed estimates

is below the significance level of 0.05 there is an interaction. Thus there is an interaction between logSize

and storetype as the value of 0.0002 is below the significance level.

As there is only an interaction between logSize and storetype, we would have to eliminate one of them

from the model in order to make it more accurate, since they affect each other.

c) Statistical Significance of Predictors

We could have added the parameters into the model stepwise, as to see whether each parameter increases

R2 and thereby enhancing the predictability of the model. If the value increases by the addition of a

parameter then the parameter is statistically significant. Opposite, if the R2 value remains constant it can be

concluded that the parameter is statistically insignificant. We can obtain the R2 by using the following

formula:

But we can also look at the effect test in JMP. We set a significance level of 0.05, and conclude from the

above table that the city parameter is not statistically significant since 0.5197>0.05. Opposite, the

parameter of storetype, logSize and competition are all statistically significant, since their p-value<0.05.

Confidence interval:

The 95% confidence interval of the effect of competition is given by the below formula:

Estimated slope±t 0.025(se )

The t-score has df = n-number if parameters in regression equation. So df =389-9=380.

Thus the t-score is equal to 1.96, seen from Table B on page A-3 in the book. We obtain the estimated slope

from the “indicator parametization” in JMP and conclude that b=0.5250564.

Page 16 of 26

R2=∑ ( y− y )2−∑ ( y− y )2

∑ ( y− y )2

Page 17: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Thus the 95% confidence interval for the effect of competition is given by the interval (0.2334871,

0.8166256). So the estimated effect of the competition variable on efficiency falls with 95% certainty within

at least 0.2334871 and at most 0.8166256 when increasing competition by 1 unit.

Question 7

a) Logistic Regression Model

We have fitted a logistic regression model in SAS JMP in order to predict the computer usage based on

logSize (see appendix 12). Thus we obtain the below estimates of the model.

Odds ratio:

As seen from the unit odds ratio from SAS JMP in the above graph, the probability of having a computer

with respect to logSize enhances as the store size increases. Since the odds ratio is equal to 0.035066, and

thus below 1, the chance of having a computer increases with respect to store size.

The odds ratio of 0.035066 is corresponding to a 10 fold increase in size, seeing as size is described in

logscale of 10.

Page 17 of 26

Page 18: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Confidence Interval

From SAS JMP (see the above graph) we can see that the confidence interval for the odds ratio is

(0.014086;0.07909). This means that we can be 95% confident that the odds ratio will fall between

0.014086 and 0.07909.

b) Multiple Regression Model

We will now use SAS JMP to construct a logistic regression with multiple explanatory variables. The graph

below shows the estimates.

Thus the logistic model with the predictors: logSize, Perception, City and StoreType, is given by SAS JMP as

the above parameter estimate.

c) Significance: Perception

We will describe the effect of perception statistically and in the real world. From the below table, it can be

seen that Perception is statistically significant as it is below the significance level (0.0412<0.05).

Just because the value is statistically significant, it does not mean that there is a connection between

perception and computer use. Thus it is important to distinguish between statistical significance and real

world significance as e.g. it does not necessarily make sense that computer use is determined by how

people perceive a store.

Page 18 of 26

Page 19: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix

Appendix 1 – Question 1

Appendix 2- Question 2

Page 19 of 26

Page 20: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix 3 – Question 2

Appendix 4 – Question 2

Page 20 of 26

Page 21: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix 5 (Question 3)

Appendix 6 – Question 3

Appendix 7 – Question 4

Page 21 of 26

Page 22: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix 8 – Question 5

Page 22 of 26

Page 23: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix 9 – Question 6

Page 23 of 26

Page 24: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix 10 – Question 6B

Appendix 11 – Question 6B

Page 24 of 26

Page 25: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Appendix 12 – Question 6B

Appendix 13 – Question 7

Page 25 of 26

Page 26: Assignment in Statistics and Research Methods - IBP Web viewAuthor: CTU: 25,230 Created Date: 01/22/2014 05:14:00 Title: Assignment in Statistics and Research Methods Subject: Indian

BSc IBP Statistics and Research Methods Winter 13/14

Page 26 of 26