19
Introduction The purpose with this writing is to solve three exercises, or more specific; to analyze three sets of data, as an examination assignment in the course Linear Models and Extensions. The exercises are taken from the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering in this document is consistent with the book's numbering; hence exercise 5:2 is the second exercise in chapter 5. A generalized linear model, GLM, is, as it sounds, a generalization of the standard linear regression mo- del. Instead of directly associate the response, μ, to the predictor variables in a linear way, a GLM assoc- iates a linear predictor, η, which is a function of the response, to the covariates, x i , in the following way: η i = β 0 + β 1 x i1 + … + β q x iq . This is done since there could be some kind of restriction on the response, for example that μ is a count or a proportion. By using an appropriate link function, g, we can make sure that correct values will be assigned to the response, for example a value between 0 and 1 if the respon- se is a proportion. Thus the link function describes how the response is linked to the covariates through the linear predictor; η = g(μ). This assignment will treat three types of generalizations of the standard linear model. The first exercise concerns binominal data, or more specific; binary data, the second exercise treats ordered multinomial responses and in the third exercise the linear model is extended to include both fixed effects and random effects in a so called split-plot design. The statistical software R is used to analyze the datasets in all three exercises and the code can be found in the Appendix, once again with consistent numbering. Complete datasets are not included in this document due to their size but they can be found in the package faraway in R and references to the datasets can be found in the end of this document. Exercise 2:2 Background: The dataset wbca comes from a study of breast cancer in Wisconsin. It consists of medical data from 681 women who has potentially cancerous tumors, of which 238 are actually malignant while the remaining 443 are benign. Determining whether a tumor is really malignant is traditionally done by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure, called fine needle aspiration, which draws only a small sample of tissue, could be effective in deter- mining tumor status. The response variable in the dataset is Class, which is a binary variable that is 0 if the tumor is malignant and 1 if it is benign. The nine predictor variables are: Adhes - marginal adhesion, BNucl - bare nuclei, Chrom - bland chromatin, Epith - epithelial cell size, Mitos – mitoses, NNucl - normal nucleoli, Thick - clump thickness, UShap - cell shape uniformity and USize - cell size uniformity. A doctor who has observed the cells from the small sample of tissue determines these predictor values by rating them on a scale from 1 to 10 with respect to the particular characteristic. A predictor is assigned the value 1 if it is normal and the value 10 if it is most abnormal. The first six lines of the dataset are displayed below:

Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Introduction The purpose with this writing is to solve three exercises, or more specific; to analyze three sets of data, as an examination assignment in the course Linear Models and Extensions. The exercises are taken from the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering in this document is consistent with the book's numbering; hence exercise 5:2 is the second exercise in chapter 5. A generalized linear model, GLM, is, as it sounds, a generalization of the standard linear regression mo-del. Instead of directly associate the response, μ, to the predictor variables in a linear way, a GLM assoc-iates a linear predictor, η, which is a function of the response, to the covariates, xi, in the following way: ηi = β0 + β1xi1 + … + βqxiq. This is done since there could be some kind of restriction on the response, for example that μ is a count or a proportion. By using an appropriate link function, g, we can make sure that correct values will be assigned to the response, for example a value between 0 and 1 if the respon-se is a proportion. Thus the link function describes how the response is linked to the covariates through the linear predictor; η = g(μ). This assignment will treat three types of generalizations of the standard linear model. The first exercise concerns binominal data, or more specific; binary data, the second exercise treats ordered multinomial responses and in the third exercise the linear model is extended to include both fixed effects and random effects in a so called split-plot design. The statistical software R is used to analyze the datasets in all three exercises and the code can be found in the Appendix, once again with consistent numbering. Complete datasets are not included in this document due to their size but they can be found in the package faraway in R and references to the datasets can be found in the end of this document.

Exercise 2:2

Background: The dataset wbca comes from a study of breast cancer in Wisconsin. It consists of medical data from 681 women who has potentially cancerous tumors, of which 238 are actually malignant while the remaining 443 are benign. Determining whether a tumor is really malignant is traditionally done by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure, called fine needle aspiration, which draws only a small sample of tissue, could be effective in deter-mining tumor status. The response variable in the dataset is Class, which is a binary variable that is 0 if the tumor is malignant and 1 if it is benign. The nine predictor variables are: Adhes - marginal adhesion, BNucl - bare nuclei, Chrom - bland chromatin, Epith - epithelial cell size, Mitos – mitoses, NNucl - normal nucleoli, Thick - clump thickness, UShap - cell shape uniformity and USize - cell size uniformity. A doctor who has observed the cells from the small sample of tissue determines these predictor values by rating them on a scale from 1 to 10 with respect to the particular characteristic. A predictor is assigned the value 1 if it is normal and the value 10 if it is most abnormal. The first six lines of the dataset are displayed below:

Page 2: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering
Page 3: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Class Adhes BNucl Chrom Epith Mitos NNucl Thick UShap USize

1 1 1 1 3 2 1 1 5 1 1

2 1 5 10 3 7 1 2 5 4 4

3 1 1 2 3 2 1 1 3 1 1

4 1 1 4 3 3 1 7 6 8 8

5 1 3 1 3 2 1 1 4 1 1

6 0 8 10 9 7 1 7 8 10 10

Part a) Fit a binominal regression with Class as the response and the other nine variables as predictors. Report the residual deviance and the associated degrees of freedom and explain why or why not this information can be used to determine if this model fits the data.

Solution: A binomial regression is a GLM where the response is a probability, in this case the probability that a tumor is benign, so a link function that makes sure that 0 ≤ p ≤ 1 must be used. One common choice of such a function is the logit link function: η = log(p/(1-p)), which is used in this exercise. The following binomial model, called Model 1, is the one that are fitted:

η = β0+β1*Adhes+β2*BNucl+β3*Chrom+β4*Epith+β5*Mitos+β6*NNucl+β7*Thick+β8*UShap+β9*USize

The output below gives the estimates of the β-coefficients, the standard errors for the coefficients, the residual deviance, the corresponding degrees of freedom and also the AIC-value:

Call:

glm(formula = Class ~ Adhes + BNucl + Chrom + Epith + Mitos +

NNucl + Thick + UShap + USize, family = binomial(logit),

data = wbca)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 11.16678 1.41491 7.892 2.97e-15 ***

Adhes -0.39681 0.13384 -2.965 0.00303 **

BNucl -0.41478 0.10230 -4.055 5.02e-05 ***

Chrom -0.56456 0.18728 -3.014 0.00257 **

Epith -0.06440 0.16595 -0.388 0.69795

Mitos -0.65713 0.36764 -1.787 0.07387 .

NNucl -0.28659 0.12620 -2.271 0.02315 *

Thick -0.62675 0.15890 -3.944 8.01e-05 ***

UShap -0.28011 0.25235 -1.110 0.26699

USize 0.05718 0.23271 0.246 0.80589

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 881.388 on 680 degrees of freedom

Residual deviance: 89.464 on 671 degrees of freedom

AIC: 109.46

The residual deviance is 89.46 for this model and the corresponding degrees of freedom are 671. The deviance is a measure of fit since it compares fitted values to the data. Provided that Y (Class in this case) is truly binomial and that the number of “trials”, ni, are relatively large, the deviance is approx-imately Χ2-distributed with n – s degrees of freedom (where s is the number of parameters). A p-value can then be calculated for the reported deviance with the corresponding degrees of freedom, and if the p-value is larger than 0.05, conclusions that the model fits sufficiently well can be drawn. However, in this case, the Y is binary, which means that ni = 1 since Class is either 0 or 1. When ni are small, the deviance is not approximately Χ2-distributed so it cannot be used to judge goodness of fit for binary data. Other methods such as the Hosmer-Lemeshow test can be used instead.

Page 4: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Part b) Use AIC as the criterion to determine the best subset of variables.

Solution: AIC (Akaike Information Criterion) is a measure of goodness of fit that is defined as: AIC = deviance + 2*dim(β), where a smaller AIC indicates a better fit. Starting from Model 1, the step-function is used to determine the best subset of main effects when AIC is used as the criterion:

Call:

glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl +

Thick + UShap, family = binomial(logit), data = wbca)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 11.0333 1.3632 8.094 5.79e-16 ***

Adhes -0.3984 0.1294 -3.080 0.00207 **

BNucl -0.4192 0.1020 -4.111 3.93e-05 ***

Chrom -0.5679 0.1840 -3.085 0.00203 **

Mitos -0.6456 0.3634 -1.777 0.07561 .

NNucl -0.2915 0.1236 -2.358 0.01837 *

Thick -0.6216 0.1579 -3.937 8.27e-05 ***

UShap -0.2541 0.1785 -1.423 0.15461

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 881.388 on 680 degrees of freedom

Residual deviance: 89.662 on 673 degrees of freedom

AIC: 105.66

From the output it can be seen that Epith and USize is no longer in the model, called Model 2, and the AIC is now minimized to 105.66, compared to 109.46 for Model 1. Thus Model 2 is the main effect-model with the lowest AIC value. An ANOVA-test comparing Model 1 and Model 2 gives the p-value 0.9, indicating that Epith and Usize have no significant effect on the model and hence, Model 2 is “best”. A model, called Model 3, with all nine main effects and all 36 possible two-way interaction terms is also defined and once again AIC is used as the criterion for reducing the model. The lowest AIC then found is 85.39 and belongs to Model 4, which includes all the nine main effects and 19 different two-way int-eractions. Further simplifications of Model 4 are done, where non-significant effects are deleted from the model, but it doesn’t result in any lower AIC-value. A comparison between Model 2 and Model 4, using ANOVA, gives a p-value much smaller than 0.001, indicating that the included two-way interacti-ons add some significant predictive information to the model. Despite this result, Model 2 is chosen for further analysis, since it is much smaller than Model 4 (which is overfitted) and hence easier to inter-pret and also since the diagnostic plots, see Figure 1 and 2, are more satisfactory for Model 2.

Figure 1: Diagnostic plots for Model 2 Figure 2: Diagnostic plots for Model 4

Page 5: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Part c) Use the reduced model to predict the outcome for a new patient with predictor variables: Adhes= 1, BNucl= 1, Chrom= 3, Epith= 2, Mitos= 1, NNucl= 1, Thick= 4, UShap= 1 and USize = 1. Give a confidence interval for the prediction.

Solution: The reduced model is Model 2:

η = β0+β1*Adhes+β2*BNucl+β3*Chrom+β4*Mitos+β5*NNucl+β6*Thick+β7*UShap

Since Epith and USize are not in the model, the values for these covariates are not taken into account for the prediction. Using the estimated coefficients for Model 2 (see part b), a predicted value on the logit scale is calculated to be 4.8344, which corresponds to the response value 0.9921. This implies that the predicted probability that the tumor is benign for a woman with the predictor variables specified above is about 99 %, or the other way, the predicted probability that this woman’s tumor is malignant is about 1 %. An approximate 95 % confidence interval for the predicted probability can be obtained by using normal approximation. The interval on logit scale is then: [η0 – 1.96*se(η0), η0 + 1.96*se(η0)], where η0 is the predicted value 4.8344 and the standard error is 0.5815 = se(η0). Using these numbers and the relation p = eη/(1-eη), a confidence interval in the probability scale is found to be: (0.9757, 0.9975).

Part d) Suppose that a cancer is classified as benign if p > 0.5 and malignant if p < 0.5. Compute the number of errors of both types that will be made if this method is applied to the current data with the reduced model.

Solution: The probabilities for all 681 women in the study are predicted with Model 2 and if the predicted probability is smaller than 0.5, their tumor will be classified as malignant. The predicted clas-sification is then compared with the initial data and the result is shown in Table 1. A total of 20 errors are made, corresponding to 2.9 % misclassifications. 11 malignant tumors are classified as benign and 9 benign tumors are classified as malign.

Classified as malignant Classified as benign

Is malignant 227 (33.3 %) 11 (1.6 %)

Is benign 9 (1.3 %) 434 (63.7 %) Table 1

Part e) Suppose the cutoff is changed to 0.9, so that p < 0.9 is classified as malignant and p > 0.9 is classified as benign. Compute the number of errors in this case and discuss the issues of determining the cutoff.

Solution: Same procedure as above are done but with the cutoff set to 0.9 instead of 0.5, resulting in a total of 17 misclassifications or 2.5 %, see Table 2. One malignant tumor is classified as benign and 16 benign tumors are classified as malign.

Classified as malignant Classified as benign

Is malignant 237 (34.8 %) 1 (0.15 %)

Is benign 16 (2.3 %) 427 (62.7 %) Table 2

Page 6: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

From the results in part d) and e), it is clear that it is really important how you determine the cutoff, however, this is a really hard task to do. If the cutoff is really high, as the model are defined in this case where a high probability means a benign tumor, the risk of missing a malign tumor is small but instead more benign tumors will be classified as malignant, probably causing some unnecessary pain for the affected women. On the other hand, if the cutoff is too low, more malignant tumors will be misclassi-fied which, at least in my opinion, is more serious. A missed malignant tumor can in worst case scenario cause the woman to die if she doesn’t get the right treatment in time. My opinion is that an exact cutoff will never be correct and that an interval is better to use than a single point. If the probability is higher than the interval the tumor will be classified as benign, if it is lower the tumor is classified as malign and if the probability lies within the interval there will be further examina-tions. However, there will still be cases that are on the “wrong” side of the interval, but then I think it’s better to determine a cutoff that reduces the “classified as benign but actually malignant”-errors instead of the opposite error.

Part f) It is usually misleading to use the same data to fit a model and test its predictive ability. To investigate this, split the data into two parts in the following way: assign every third observation to a test set and the remaining two thirds of the data to a training set. Use the training set to determine the model and the test set to assess its predictive performance. Compare the outcome to the previously obtained results.

Solution: Just as specified above, every third observation is assigned to a test set and the remain-ing two thirds are assigned to a training set. A main effects model with all nine covariates is then fitted to the data in the training set and it is called Model 6. For simplicity, both concerning calculations and interpretation, no model including any interaction terms is specified, though this would be important to do in a more comprehensive study since the result from part b) indicates that there are some significant interactions between the predictor variables. Model 6 has an AIC-value of 77.65. Using the step-function to see which subset of the variables that gives the smallest AIC-value result in the same model as was obtained using the entire dataset, which is a model where Epith and USize are not included as predictors. The output below shows a summary of that model, called Model 7:

Call:

glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl +

Thick + UShap, family = binomial(logit), data = trainingset)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 11.5571 1.8285 6.321 2.60e-10 ***

Adhes -0.4249 0.1441 -2.949 0.00318 **

BNucl -0.3341 0.1187 -2.815 0.00487 **

Chrom -0.5963 0.2422 -2.462 0.01382 *

Mitos -0.5822 0.4872 -1.195 0.23207

NNucl -0.4192 0.1604 -2.614 0.00895 **

Thick -0.6037 0.1924 -3.138 0.00170 **

UShap -0.2943 0.2034 -1.447 0.14795

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 592.796 on 453 degrees of freedom

Residual deviance: 59.536 on 446 degrees of freedom

AIC: 75.536

Page 7: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

When Model 6 and Model 7 are compared using ANOVA, the resulting p-value is 0.39, indicating that the larger model is not significant and that the smaller Model 7 is preferable. Hence Model 7 is the one that are determined as best for the training set and used, together with the test set, for prediction. Predictions are made for the 227 women in the test set using Model 7 and the result is then compared to the initial classification. Figure 3 shows the predicted probabilities plotted against the true classificat-ions. Remember that the true classification only takes the value 0 if the tumor is malignant and 1 if it is benign, whiles the predicted values is in the range from 0 to 1. Hence values close to the points (0, 0) and (1, 1) are good predictions while the further away from these points, the larger is the risk for a misclassification depending on the determined cutoff. Most of the cases are in the desirable corners but there are also some cases that are far from them and hence will be misclassified.

Another way to assess the predictive performance and to compare it with the results when the entire dataset is used, is to use the same classification methods as in part d) and part e), with cutoffs set to first 0.5 and then to 0.9. The resulting errors are shown in Table 3 and Table 4.

Classified as malignant Classified as benign

Is malignant 70 (30.8 %) 5 (2.2 %) Is benign 2 (0.88 %) 150 (66.1 %)

Table 3: Cutoff = 0.5 Table 4: Cutoff = 0.9

When the cutoff is 0.5 misclassifications is done in 3.1 % of the cases and when the cutoff is 0.9 2.2 % of the tumors are misclassified. Compared to the result when the entire dataset is used, the total error proportion is larger with the test set for the cutoff 0.5 (3.1 % compared to 2.9 %) but smaller for the cutoff 0.9 (2.2 % compared to 2.5 %). The differences are not so large and conclusions can be drawn that the predictions are rather robust.

Classified as malignant Classified as benign

Is malignant 73 (32.2 %) 2 (0.88 %) Is benign 3 (1.3 %) 149 (65.6 %)

Figure 3

Page 8: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Exercise 5:2

Background: The data happy is collected from 39 students at the University of Chicago Graduate School of Business to test the hypothesis that ‘love’ and ‘work’ are the important factors in determining an individual's happiness. The variables ‘money’ and ‘sex’ are also included in the study, where sex refers to sexual activity rather than gender. The first six lines of the data are displayed below: happy money sex love work

1 10 36 0 3 4

2 8 47 1 3 1

3 8 53 0 3 5

4 8 35 1 3 3

5 4 88 1 1 2

6 9 175 1 3 4

The response variable happy, representing the students happiness, is measured on a 10-point scale with 1 representing a suicidal state, 5 representing a feeling of "just muddling along," and 10 represen-ting a euphoric state. The variable money is measured by annual family income in thousands of dollars. Sex is measured by a dummy variable taking the value 1 if the student has a satisfactory level of sexual activity and 0 if it hasn’t. Love is measured on a 3-point scale, with 1 representing loneliness and isola-tion, 2 representing a secure relationships, and 3 representing a deep feeling of belonging and caring in the context of some family or community. The last variable work is measured on a 5-point scale, with 1 indicating that an individual has no job or is seeking other employment, 3 indicating the job is "OK," and 5 indicating that the job is great.

Part a) Build a model for the level of happiness as a function of the other variables.

Solution: The multinomial distribution is an extension of the binomial distribution where the res-ponse can take more than two values. The response variable happy in the dataset happy (unfortunat-ely they have the same name) is multinomial distributed since it can take one of the finite values 1,2,…,10 depending on how happy the student is. The happy-variable is ordered, since it is a 10-point scale where 1 is the lowest happiness and 10 the highest. This means that happy is an ordered multino-mial response variable and methods that takes this into account must be used for building the model. With an ordered response, Yi, it is often easier to work with the cumulative probabilities, γij = P(Yi ≤ j), where i is the individual and j is the category, j=1, … , J. The γ:s are then linked to the covariates x in the following way: g(γij) = θj - xi

Tβ. θj are the intercepts so the vector xi does not include any intercept and β does not depend on the category j. The latent variable Zi is a continuous variable that can be thought of as the real underlying response and a discretized version of Zi is observed in the form of the Yi, where Yi = j if θj-1 < Zi ≤ θj. This means that the θj:s define a grid of thresholds in Z scale that separates the differ-ent response categories and the effect of the covariates is then to move that grid in different directions. When the link function g is the logit link, the model is called proportional odds model and is defined as:

, j = 1,…, J-1 where γj(xi) = P(Yi ≤ j | xi).

A proportional odds model with happy as the response and the other four variables; money, sex, love and work, as covariates are fitted for the dataset using functions in R. An AIC-based variable selection is then used to reduce the model:

Page 9: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Start: AIC=122.48

happyF ~ money + sexF + loveF + workF

Df AIC

- sexF 1 121.68

<none> 122.48

- money 1 123.31

- workF 4 123.81

- loveF 2 149.91

Step: AIC=121.68

happyF ~ money + loveF + workF

Df AIC

<none> 121.68

- money 1 122.22

- workF 4 124.43

- loveF 2 148.55

The output shows that the model with the smallest AIC-value, 121.68, is the model that doesn’t include sex as a covariate. The two models, the one with and the one without the variable sex, is then compared using ANOVA and the resulting p-value is 0.27, indicating that sex has no significant effect on the model and can be deleted. The summary for the chosen model is shown below:

Call:

polr(formula = happyF ~ money + loveF + workF, data = happy)

Coefficients:

Value Std. Error t value

money 0.01657 0.01064 1.5575

loveF[T.2] 3.72947 1.55696 2.3954

loveF[T.3] 7.61448 1.81523 4.1948

workF[T.2] -1.35175 1.67099 -0.8090

workF[T.3] 0.17151 1.57972 0.1086

workF[T.4] 1.92834 1.53488 1.2563

workF[T.5] 1.65821 1.93485 0.8570

Intercepts:

Value Std. Error t value

2|3 0.0389 1.6530 0.0236

3|4 0.9184 1.5697 0.5851

4|5 3.3868 1.8364 1.8443

5|6 5.2861 1.9860 2.6617

6|7 5.8676 2.0121 2.9162

7|8 8.1714 2.1389 3.8203

8|9 11.9646 2.5211 4.7457

9|10 13.7166 2.7376 5.0104

Residual Deviance: 91.68405

AIC: 121.6840

The proportional odds model assumes that the relative odds for moving from one response category to another are the same independent of the current response category. If the assumption holds for this dataset is questionable. For example, an extra $10000 in annual income for an individual that doesn’t have so high income to start with will probably cause him a larger change in happiness than an extra $10000 for an individual that are very rich. Another model for ordered multinomial data is the ordered probit model. If the latent variable Zi is assumed to have a standard normal distribution, the probit link function is used, i.e: Φ-1(γj(xi)) = , j = 1 ,…, J-1. This model is also fitted to the dataset happy, first with all four predictor

variables but once again both the AIC-based variable selection procedure and the ANOVA-test indicate that the variable sex could be deleted from the model. The AIC for the larger model is 121.65 and for

Page 10: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

the smaller model is 120.93 and the p-value from the ANOVA is 0.26. The summary for the model without the covariate sex is shown below:

Call:

polr(formula = happyF ~ money + loveF + workF, data = happy,

method = "probit")

Coefficients:

Value Std. Error t value

money 0.009012 0.005722 1.57491

loveF[T.2] 2.087900 0.878363 2.37703

loveF[T.3] 4.383023 0.995157 4.40435

workF[T.2] -0.916242 0.970893 -0.94371

workF[T.3] 0.045830 0.969656 0.04726

workF[T.4] 1.085030 0.930140 1.16652

workF[T.5] 0.890822 1.161339 0.76706

Intercepts:

Value Std. Error t value

2|3 -0.0939 0.9672 -0.0971

3|4 0.3891 0.9426 0.4129

4|5 1.7742 1.0814 1.6407

5|6 2.8948 1.1546 2.5072

6|7 3.2428 1.1670 2.7787

7|8 4.6104 1.2242 3.7660

8|9 6.8026 1.3804 4.9279

9|10 7.6831 1.4349 5.3544

Residual Deviance: 90.93076

AIC: 120.9308

The conclusions from the analysis are that money, love and work has an effect on the individual’s happ-iness, while the sexual activity doesn’t. No models with interactions are fitted since these would be

overfitted due to a small dataset compared to all possible parameters. Notice that since no one in the original dataset had a happiness level of 1, the model only fits values for the response levels 2 ,…, 10.

Part b) Interpret the parameters of your chosen model.

Solution: The interpretations are done using the proportional odds model with the covariates money, love and work. For coefficient and intercept numbers, see the second output in part a).

The proportional odds model is:

, so the intercept terms in the output corres-

ponds to the θj:s. The chosen model is created so that the default level is money = 0, love = 1 and work = 1, corresponding to a person that has no annual family income, is lonely and has no job. The log-odds for this default person to be in happiness category 2 or smaller against 3 or higher is 0.0389, hence the odds is exp(0.0389) = 1.04. This means that the odds is larger for the person to be in the lower categori-es than the higher, and the corresponding probability for being in category 2 or lower is ilogit(0.0389) = 0.510, where ilogit is the inverse logit function.

The odds for the same default person to be in category 3 or smaller against 4 or higher is exp(0.9184) = 2.51 and the probability of being in category 3 is ilogit(0.9184 ) -ilogit(0.0389)=0.205. The intercepts for the remaining categories can be interpreted in a similar way where a positive log-odds corresponds to an odds larger than 1 and hence the odds is higher to be in the lower happiness categories than the higher ones. From the output it can be seen that all the intercept terms, i.e. the log-odds, for the default person are positive, hence there are a larger predicted possibility to be in the lower categories than in the hig-her, which seems logical since the default person has the lowest values possible on all three covariates.

Page 11: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

The coefficients in the output corresponds to the β:s and can be interpreted in the following way. If the income is increased by one unit ($1000) the odds of moving from a given happiness category to one category higher increase by a factor of exp(0.01657) = 1.0167. This is equivalent as to say that standing in happiness category 2 (for example), the log-odds for being in that category or lower will be smaller if the money-variable is increased with, say, 3 units, i.e. log-odds = 0.0389 – 3*0.01657. In a similar way, if changing from love level 1 to love level 2, the odds for moving one category higher in happiness level will be increased by a factor of exp(3.72947), and if changing from love level 1 to 3, the odds for being a level happier will be increased by a factor of exp(7.61448). The coefficients for the different work levels are interpreted in the same way.

Part c) Predict the happiness distribution for an individual whose parents earn $30000 a year, who is lonely, not sexually active and has no job.

Solution: That a person is lonely corresponds to the covariate value love = 1, that he or she doesn’t have a job corresponds to the covariate value work = 1 and that the parent’s annual income is $30000 corresponds to money = 30, since it is measured in thousands of dollars. That a person is not sexually active corresponds to the predictor value sex = 0, but since sex is not included in the model, this will be ignored. The happiness distribution is predicted for a person with these values, both for the proportional odds model and for the ordered probit model and the results are:

Proportional odds model:

2 3 4 5 6 7 8 9 10

0.387 0.216 0.344 0.044 0.004 0.004 0.000 0.000 0.000

Ordered probit model:

2 3 4 5 6 7 8 9 10

0.358 0.189 0.386 0.062 0.003 0.001 0.000 0.000 0.000

Even if the coefficients for the two model differs a bit (see part a), the predictions are rather similar. A person who is lonely, has no job and whose parents earn $30000 a year is most likely to be in the happiness category 2, 3, or 4. To test the two models predictive performance, predictions is done using the original dataset and the predicted happiness is compared to the true happiness level. The results are shown in Table 5 and 6:

Predicted \ True 2 3 4 5 6 7 8 9 10

2 0 0 1 0 0 0 0 0 0

3 0 0 0 0 0 0 0 0 0

4 1 1 2 1 0 0 0 0 0

5 0 0 1 2 1 2 0 0 0

6 0 0 0 0 0 0 0 0 0

7 0 0 0 2 0 3 1 0 0

8 0 0 0 0 1 3 12 2 1

9 0 0 0 0 0 0 1 1 0

10 0 0 0 0 0 0 0 0 0

Table 5: Proportional odds model Table 6: Ordered probit model

Once again it is clear that the predictions for the two different models are rather similar and also quite good. Not all predictions are correct but the biggest difference, or error, between a true and a predicted happiness category is 2 levels, for example one true category 10 is predicted as a category 8.

Predicted \ True 2 3 4 5 6 7 8 9 10

2 0 0 1 0 0 0 0 0 0

3 0 0 0 0 0 0 0 0 0

4 1 1 2 1 0 0 0 0 0

5 0 0 1 2 1 2 0 0 0

6 0 0 0 0 0 0 0 0 0

7 0 0 0 2 1 3 1 0 0

8 0 0 0 0 0 3 13 3 1

9 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 0

Page 12: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Exercise 8:7

Background: The dataset semicond is from an experiment that was conducted to optimize the manufacture of semi-conductors. The response variable is the numeric variable resistance, which is the resistance record-ed on the wafer. The experiment was conducted during four different time periods with three different wafers during each period, so the variable ET is a factor with levels 1 to 4 representing the etching time period and the variable wafer is a factor with levels 1 to 3. The Grp variable is a combination of ET and wafer. The last variable is position, which is a factor with levels 1 to 4. The first six lines of the data are shown below: resistance ET Wafer position Grp

1 5.22 1 1 1 1/1

2 5.61 1 1 2 1/1

3 6.11 1 1 3 1/1

4 6.33 1 1 4 1/1

5 6.13 1 2 1 1/2

6 6.14 1 2 2 1/2

Exercise: Analyze the semicond data as a split plot experiment where ET and position are con-sidered as fixed effects. Since the wafers are different in experimental time periods, the Grp variable should be regarded as the block or group variable. Determine the best model for the data and check all appropriate diagnostics.

Solution: Split plot designs originated in agriculture but are frequently used in other areas as well. The design arises as a result of restriction of a full randomization. The idea is that main plots are split into several subplots, where the main plot is treated with a level of one factor while the levels on some other factor are allowed to vary with the subplots. In this case the etching time, ET, could be consider-ed as the main plots, hence the levels are fixed, while the Wafer could be considered as the subplots and are therefore random. This implies that the combination of ET and Wafer, Grp, is also random because one of the components is random, and the Wafer variable alone should not be included in the model since it is accounted for in Grp. The position variable could actually be thought of as a sub-subplot, but in this exercise it should be considered as a fixed effect. The model first fitted is:

Yijk = μ + ETi + positionj + (ET*position)ij + (ET*Wafer)ik + εijk ,

where μ, ETi, positionj and (ET*position)ij are fixed effects and the rest is random effects having variances σw

2 and σε2 respectively. The output below is a summary of the fitted model together with an

ANOVA analysis to check the significance of the fixed effects:

Linear mixed model fit by REML

Formula: resistance ~ ET * position + (1 | Grp)

Data: semicond

AIC BIC logLik deviance REMLdev

86.65 120.3 -25.33 30.15 50.65

Random effects:

Groups Name Variance Std.Dev.

Grp (Intercept) 0.10579 0.32525

Residual 0.11115 0.33339

Number of obs: 48, groups: Grp, 12

Page 13: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Fixed effects:

Estimate Std. Error t value

(Intercept) 5.61333 0.26891 20.875

ET[T.2] 0.38000 0.38029 0.999

ET[T.3] 0.52333 0.38029 1.376

ET[T.4] 0.72667 0.38029 1.911

position[T.2] -0.16333 0.27221 -0.600

position[T.3] -0.06000 0.27221 -0.220

position[T.4] 0.27333 0.27221 1.004

ET[T.2]:position[T.2] 0.35667 0.38497 0.926

ET[T.3]:position[T.2] 0.37333 0.38497 0.970

ET[T.4]:position[T.2] 0.37667 0.38497 0.978

ET[T.2]:position[T.3] -0.16667 0.38497 -0.433

ET[T.3]:position[T.3] -0.30333 0.38497 -0.788

ET[T.4]:position[T.3] -0.38333 0.38497 -0.996

ET[T.2]:position[T.4] -0.35000 0.38497 -0.909

ET[T.3]:position[T.4] -0.31667 0.38497 -0.823

ET[T.4]:position[T.4] -0.07333 0.38497 -0.190

Analysis of Variance Table

Df Sum Sq Mean Sq F value

ET 3 0.64726 0.21575 1.9411

position 3 1.12889 0.37630 3.3855

ET:position 9 0.80948 0.08994 0.8092

From the summary it can be seen that the two estimated variance components, ̂w

2 = 0.106 and ̂ε2 =

0.111, are similar and hence the variation between wafers and the variation due to random errors cont-ribute about equally to the model variation. The ANOVA-table indicates that the interaction term, (ET* position) has little effect on the model since an F-value smaller than 1 is not significant. A look at the t-values for the interaction coefficients gives the same conclusions since none of them are larger than |1|. This implies that the interaction term can be removed and a model without interaction is fitted:

Linear mixed model fit by REML

Formula: resistance ~ ET + position + (1 | Grp)

Data: semicond

AIC BIC logLik deviance REMLdev

71.09 87.93 -26.54 40.12 53.09

Random effects:

Groups Name Variance Std.Dev.

Grp (Intercept) 0.10724 0.32747

Residual 0.10537 0.32460

Number of obs: 48, groups: Grp, 12

Fixed effects:

Estimate Std. Error t value

(Intercept) 5.64375 0.22607 24.964

ET[T.2] 0.34000 0.29841 1.139

ET[T.3] 0.46167 0.29841 1.547

ET[T.4] 0.70667 0.29841 2.368

position[T.2] 0.11333 0.13252 0.855

position[T.3] -0.27333 0.13252 -2.063

position[T.4] 0.08833 0.13252 0.667

Analysis of Variance Table

Df Sum Sq Mean Sq F value

ET 3 0.61357 0.20452 1.9411

position 3 1.12889 0.37630 3.5714

Page 14: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

The difference between the two variance components is now even smaller and hence the conclusion that the wafer and the error contribute equally to the model variation is substantiated. Even if the AIC-criterion for model selection could be questioned it suggests that the model without interaction is better since it has an AIC-value of 71.09 compared to 86.65. Together with the previous result that the interaction was non-significant, the conclusion is that the following model is considered “best”: Yijk = μ + ETi + positionj + (ET*Wafer)ik + εijk. The diagnostic plots for both models (with and without interaction) are shown in Figure 4. The plots are rather similar and no obvious problems with the model assumptions can be seen from the plots. The residuals don’t show any obvious trend or pattern (which they shouldn’t) and the normality assumption look appropriate according to the qq-plots.

Figure 4: Diagnostic plots for the split plot models.

A look at the estimated coefficients and their t-values, together with the ANOVA-table, reveals that the fixed effects ET and position are probably both significant. Time period 4 seems to differ most from the other time periods and position 3 differs most from the other three positions. The intercept term, 5.64, corresponds to the resistance when in time period 1 and position 1.

Page 15: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

References

Theory and methods: Faraway, J. J. (2006). Extending the Linear Model with R. Chapman & Hall/CRC.

Dataset in Exercise 2:2 Bennett, K. P., & Mangasarian, O.L. (1992). Neural network training via linear programming. In P. M. Pardalos, editor, Advances in Optimization and Parallel Computing, pages 56-57. Elsevier Science.

Dataset in Exercise 5:2 George, E. I., & McCulloch, L.E. (1993). Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88, 881-889.

Dataset in Exercise 8:7 Littel, R. C., & Milliken, G. A., & Stroup, W. W., & and Wolfinger, R. D. (1996). SAS System for Mixed Models. SAS Institute (Data Set 2.2(b)).

Page 16: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

Appendix

R-Code to exercise 2:2

####################################

####### Ch 2: Exercise 2 #######

####################################

library(faraway)

data(wbca)

help(wbca)

attributes(wbca)

head(wbca)

wbca

summary(wbca)

########## (a) ##########

# Model with all main effects:

modell1 <- glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl +

Thick +UShap + USize, family=binomial(logit), data=wbca)

summary(modell1)

# Diagnostic plots:

halfnorm(residuals(modell1))

par(mfrow=c(2,2))

plot(modell1)

########## (b) ##########

# Reduced model with only main effects:

modell2 <- step(modell1,trace=F)

summary(modell2)

anova(modell2,modell1) # Model 2 is better!

pchisq(0.2,2,lower=F)

# Model with all two-way interactions:

modell3 <- glm(Class ~ (Adhes + BNucl + Chrom + Epith + Mitos + NNucl +

Thick +UShap + USize)^2, family=binomial(logit), data=wbca)

summary(modell3)

# Reduced model with two-way interactions:

modell4 <- step(modell3, trace=F)

summary(modell4)

# Further reduced model with two-way interactions:

modell5 <- glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +UShap +

USize + Adhes:BNucl + Adhes:Epith + Adhes:Thick + Adhes:UShap + BNucl:Chrom +

BNucl:UShap + BNucl:USize + Chrom:UShap + Chrom:USize + Epith:Thick,

family=binomial(logit),data=wbca)

summary(modell5)

anova(modell2,modell4)

pchisq(62,21,lower=F) # Model 4 is better!

# Diagnostic plots:

halfnorm(residuals(modell2))

par(mfrow=c(2,2))

plot(modell2)

halfnorm(residuals(modell4))

par(mfrow=c(2,2))

plot(modell4)

########## (c) ##########

# Prediction:

x0 <- c(1,1,1,3,1,1,4,1)

eta0 <- sum(x0*coef(modell2))

Page 17: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

ilogit(eta0)

#Alternative:

predict(modell2, newdata=data.frame(Adhes=1,BNucl=1, Chrom=3, Mitos=1, NNucl=1,

Thick=4, UShap=1),type="response", se=T)

# Confidensinterval for prediction:

modell2sum <- summary(modell2)

(cm <- modell2sum$cov.unscaled)

se <- sqrt(t(x0) %*% cm %*% x0) #Standard error on the logit scale.

ilogit(c(eta0 - 1.96*se,eta0+1.96*se)) #Confidence interval on the probability scale.

# Alternative:

predict(modell2, newdata=data.frame(Adhes=1,BNucl=1, Chrom=3, Mitos=1, NNucl=1,

Thick=4, UShap=1), se=T)

ilogit(c(4.834428-1.96*0.5815185, 4.834428+1.96*0.5815185)) # Same as above.

########## (d) ##########

predsann <- predict(modell2, type = "response")

prsa0.5 <- predsann>0.5

test0.5=wbca$Class-prsa0.5

sum(test0.5==-1) # -1 = Initial malign but classified as benign.

sum(test0.5==1) # 1 = Initial benign but classified as malign.

sum(test0.5==0) # 0 = Correct classification.

# Alternative:

table(wbca$Class,1*(predsann>0.5))

########## (e) ##########

prsa0.9 <- predsann>0.9

test0.9=wbca$Class-prsa0.9

sum(test0.9==-1) # -1 = Initial malign but classified as benign.

sum(test0.9==1) # 1 = Initial benign but classified as malign.

sum(test0.9==0) # 0 = Correct classification.

# Alternative:

table(wbca$Class,1*(predsann>0.9))

########## (f) ##########

# Every third observation in a test set:

vartredje <- which((1:681)%%3 == 0)

testset <- wbca[vartredje,]

# The remaining two thirds in a training set:

trainingset <- wbca[-vartredje,]

# Fit a main effects-model:

modell6 <- glm(Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl +

Thick +UShap + USize, family=binomial(logit), data=trainingset)

summary(modell6)

# Reduce the main effects-model:

modell7 <- step(modell6,trace=F)

summary(modell7)

anova(modell7,modell6)

pchisq(1.89,2,lower=F) # The smaller model, Model 7, is better.

# Predict values for test set:

skattn <- predict(modell7, newdata=testset,type="response", se=T)

# Plot predicted values against true values:

plot(skattn$fit,testset$Class, xlab="Predicted probabilities", ylab="True

classification")

# Test its predictive performance:

table(testset$Class,1*(skattn$fit>0.5))

table(testset$Class,1*(skattn$fit>0.9))

####################################

Page 18: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

R-Code to exercise 5:2

####################################

####### Ch 5: Exercise 2 #######

####################################

library(faraway)

data(happy)

help(happy)

attributes(happy)

head(happy)

happy

summary(happy)

library(MASS)

########## (a) ##########

# Creates factors of the variables that are factors:

happy$happyF <- factor(happy$happy)

happy$sexF <- factor(happy$sex)

happy$loveF <- factor(happy$love)

happy$workF <- factor(happy$work)

# A proportional odds model:

modell1 <- polr(happyF ~ money + sexF + loveF + workF, happy)

summary(modell1)

c(deviance(modell1),modell1$edf)

# AIC-based variable selection method:

modell2 <- step(modell1) # The variable sex doesn't seem to be significant.

summary(modell2)

c(deviance(modell2),modell2$edf)

# Comparison:

anova(modell1,modell2) # Model 2 is better!

# An ordered probit model:

modell3 <- polr(happyF ~ money + sexF + loveF + workF, method="probit", happy)

summary(modell3)

c(deviance(modell3),modell3$edf)

# AIC-based variable selection method:

modell4 <- step(modell3) # The variable sex doesn't seem to be significant.

summary(modell4)

c(deviance(modell4),modell4$edf)

# Comparison:

anova(modell3, modell4) # Model 4 is better!

########## (b) ##########

# Interpret the variables!

########## (c) ##########

# Predict with the proportional odds model:

round(predict(modell2,data.frame(money=30, sexF="0", loveF="1",

workF="1"),type="probs"),3)

# Check the predictive performance for the proportional odds model:

skattningar1 <- predict(modell2)

table(skattningar1,happy$happy)

# Predict with the ordered probit model:

round(predict(modell4,data.frame(money=30, sexF="0", loveF="1",

workF="1"),type="probs"),3)

# Check the predictive performance for the ordered probit model:

skattningar2 <- predict(modell4)

table(skattningar2,happy$happy)

####################################

Page 19: Introduction - umu.se · the book Extending the Linear Model with R – Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering

R-Code to exercise 8:7

####################################

####### Ch 8: Exercise 7 #######

####################################

library(faraway)

data(semicond)

help(semicond)

attributes(semicond)

head(semicond)

semicond

summary(semicond)

library(MASS)

library(lme4)

####### Analysis #######

str(semicond)

contrasts(semicond$ET) <- contr.treatment(4,1)

contrasts(semicond$position) <- contr.treatment(4,1)

contrasts(semicond$Wafer)

contrasts(semicond$Grp)

# Model with interaction between ET and position:

modell1 <- lmer(resistance ~ ET * position + (1|Grp), semicond)

summary(modell1)

# Check the fixed effects for significance:

anova(modell1) # The interaction are not significant since F < 1.

# Model without interaction:

modell2 <- lmer(resistance ~ ET + position + (1|Grp), semicond)

summary(modell2)

# Check the fixed effects for significance:

anova(modell2)

# Check diagnostic plots:

par(mfrow=c(2,2))

plot(fitted(modell1), resid(modell1), xlab="Fitted interaction", ylab="Residuals")

abline(0,0)

qqnorm(resid(modell1),main="Interaction")

plot(fitted(modell2), resid(modell2), xlab="Fitted no interaction", ylab="Residuals")

abline(0,0)

qqnorm(resid(modell2),main="No interaction")

####################################