15

APPLIED REGRESSION - Buch · PDF fileRegression model building II 189 5.1 ... It would also be suitable for use in an applied regression course ... Chapter 7 introduces extensions

  • Upload
    ngokhue

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

APPLIED REGRESSION MODELING

APPLIED REGRESSION MODELING Second Edition

IAIN PARDOE Thompson Rivers University

®WILEY A JOHN WILEY & SONS, INC., PUBLICATION

Copyright © 2012 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317)572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Pardoe, Iain, 1970-Applied regression modeling [electronic resource] / Iain Pardoe. — 2nd ed.

1 online resource. Includes index. Description based on print version record and CIP data provided by publisher; resource not viewed.

ISBN 978-1-118-34502-3 (pdf) — ISBN 978-1-118-34503-0 (mobi) — ISBN 978-1-118-34504-7 (epub) — ISBN 978-1-118-09728-1 (hardback) (print) 1. Regression analysis. 2. Statistics. I. Title. QA278.2 519.536—dc23 2012006617

Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

To Tanya, Bethany, and Sierra

CONTENTS

Preface Acknowledgments Introduction

1.1 1.2

Statistics in practice Learning statistics

1 Foundations

1.1 1.2 1.3 1.4

1.5 1.6

1.7 1.8

Identifying and summarizing data Population distributions Selecting individuals at random—probability Random sampling 1.4.1 Central limit theorem—normal version 1.4.2 Central limit theorem—t-version Interval estimation Hypothesis testing 1.6.1 The rej ection region method 1.6.2 The p-value method 1.6.3 Hypothesis test errors Random errors and prediction Chapter summary Problems

XI

XV

xvii xvii xix

1 5 9

11 12 14 15 19 19 21 24 25 28 29

vii

VIII CONTENTS

Simple linear regression 35

2.1 Probability model for X and Y 35 2.2 Least squares criterion 40 2.3 Model evaluation 45

2.3.1 Regression standard error 46 2.3.2 Coefficient of determination—R2 48 2.3.3 Slope parameter 52

2.4 Model assumptions 59 2.4.1 Checking the model assumptions 61 2.4.2 Testing the model assumptions 66

2.5 Model interpretation 66 2.6 Estimation and prediction 68

2.6.1 Confidence interval for the population mean, E(K) 68 2.6.2 Prediction interval for an individual Y-value 69

2.7 Chapter summary 72 2.7.1 Review example 74 Problems 78

Multiple linear regression 83

3.1 Probability model for (Xx,X2,...) and Y 83 3.2 Least squares criterion 87 3.3 Model evaluation 92

3.3.1 Regression standard error 92 3.3.2 Coefficient of determination—R2 94 3.3.3 Regression parameters—global usefulness test 101 3.3.4 Regression parameters—nested model test 104 3.3.5 Regression parameters—individual tests 109

3.4 Model assumptions 118 3.4.1 Checking the model assumptions 119 3.4.2 Testing the model assumptions 123

3.5 Model interpretation 124 3.6 Estimation and prediction 126

3.6.1 Confidence interval for the population mean, E(F) 126 3.6.2 Prediction interval for an individual Y-value 127

3.7 Chapter summary 130 Problems 132

Regression model building I 137

4.1 Transformations 138 4.1.1 Natural logarithm transformation for predictors 138 4.1.2 Polynomial transformation for predictors 144

CONTENTS IX

4.1.3 Reciprocal transformation for predictors 147 4.1.4 Natural logarithm transformation for the response 151 4.1.5 Transformations for the response and predictors 155

4.2 Interactions 159 4.3 Qualitative predictors 166

4.3.1 Qualitative predictors with two levels 167 4.3.2 Qualitative predictors with three or more levels 174

4.4 Chapter summary 182 Problems 184

Regression model building II 189

5.1 Influential points 189 189

194

196

199

199

202

206

209

212

213

215

217

218

221

224

231

234

6 Case studies 243

6.1 Home prices 243 6.1.1 Data description 243 6.1.2 Exploratory data analysis 245 6.1.3 Regression model building 246 6.1.4 Results and conclusions 247 6.1.5 Further questions 252

6.2 Vehicle fuel efficiency 253 6.2.1 Data description 253 6.2.2 Exploratory data analysis 253 6.2.3 Regression model building 255

5.2

5.3 5.4 5.5 5.6

5.1.1 5.1.2 5.1.3

Outliers Leverage Cook's distance

Regression pitfalls 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.2.7 5.2.8 Model

Nonconstant variance Autocorrelation Multicollinearity Excluding important predictor variables Overfitting Extrapolation Missing data Power and sample size

building guidelines Model selection Model interpretation using graphics Chapter summary Problems

X CONTENTS

6.2.4 6.2.5

6.3 Pharm 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6

Extensions

Results and conclusions Further questions

aceutical patches Data description Exploratory data analysis Regression model building Model diagnostics Results and conclusions Further questions

256

261

261

261

261

263

263

264

266

267

7.1 Generalized linear models 268 7.1.1 Logistic regression 268 7.1.2 Poisson regression 273

7.2 Discrete choice models 275 7.3 Multilevel models 278 7.4 Bayesian modeling 281

7.4.1 Frequentist inference 281 7.4.2 Bay esian inference 281

Appendix A: Computer software help 285

Problems 287

Appendix B: Critical values for t-distributions 289

Appendix C: Notation and formulas 293 C.l Univariate data 293 C.2 Simple linear regression 294 C.3 Multiple linear regression 295

Appendix D: Mathematics refresher 297 D.l The natural logarithm and exponential functions 297 D.2 Rounding and accuracy 298

Appendix E: Answers for selected problems 299

References 309

Glossary 315

Index 321

PREFACE

The first edition of this book was developed from class notes written for an applied regres-sion course taken primarily by undergraduate business majors in their junior year at the University of Oregon. Since the regression methods and techniques covered in the book have broad application in many fields, not just business, this second edition widens its scope to reflect this. Details of the major changes for the second edition are included below.

The book is suitable for any undergraduate statistics course in which regression analysis is the main focus. A recommended prerequisite is an introductory probability and statistics course. It would also be suitable for use in an applied regression course for non-statistics major graduate students, including MB As, and for vocational, professional, or other non-degree courses. Mathematical details have deliberately been kept to a minimum, and the book does not contain any calculus. Instead, emphasis is placed on applying regression analysis to data using statistical software, and understanding and interpreting results.

Chapter 1 reviews essential introductory statistics material, while Chapter 2 covers simple linear regression. Chapter 3 introduces multiple linear regression, while Chapters 4 and 5 provide guidance on building regression models, including transforming variables, using interactions, incorporating qualitative information, and using regression diagnostics. Each of these chapters includes homework problems, mostly based on analyzing real datasets provided with the book. Chapter 6 contains three in-depth case studies, while Chapter 7 introduces extensions to linear regression and outlines some related topics. The appendices contain a list of statistical software packages that can be used to carry out all the analyses covered in the book (each with detailed instructions available from the book website), a table of critical values for the t-distribution, notation and formulas used

XI

XII PREFACE

throughout the book, a glossary of important terms, a short mathematics refresher, and brief answers to selected homework problems.

The first five chapters of the book have been used successfully in quarter-length courses at a number of institutions. An alternative approach for a quarter-length course would be to skip some of the material in Chapters 4 and 5 and substitute one or more of the case studies in Chapter 6, or briefly introduce some of the topics in Chapter 7. A semester-length course could comfortably cover all the material in the book.

The website for the book, which can be found at www.iainpardoe.com/arm2e/, contains supplementary material designed to help both the instructor teaching from this book and the student learning from it. There you'll find all the datasets used for examples and homework problems in formats suitable for most statistical software packages, as well as detailed instructions for using the major packages, including SPSS, Minitab, SAS, JMP, Data Desk, EViews, Stata, Statistica, R, and S-PLUS. There is also some information on using the Microsoft Excel spreadsheet package for some of the analyses covered in the book (dedicated statistical software is necessary to carry out all of the analyses). The website also includes information on obtaining a solutions manual containing complete answers to all the homework problems, as well as further ideas for organizing class time around the material in the book.

The book contains the following stylistic conventions: • When displaying calculated values, the general approach is to be as accurate as

possible when it matters (such as in intermediate calculations for problems with many steps), but to round appropriately when convenient or when reporting final results for real-world questions. Displayed results from statistical software use the default rounding employed in R throughout.

• In the author's experience, many students find some traditional approaches to notation and terminology a barrier to learning and understanding. Thus, some traditions have been altered to improve ease of understanding. These include: using familiar Roman letters in place of unfamiliar Greek letters [e.g., E(Y) rather than μ and b rather than /3]; replacing the nonintuitive Ϋ for the sample mean of Y with ray; using NH and AH for null hypothesis and alternative hypothesis, respectively, rather than the usual Ho and//a.

Major changes for the second edition • The first edition of this book was used in the regression analysis course run by

Statistics.com from 2008 to 2012. The lively discussion boards provided an in-valuable source for suggestions for changes to the book. This edition clarifies and expands on concepts that students found challenging and addresses every question posed in those discussions.

• The foundational material on interval estimation has been rewritten to clarify the mathematics.

• There is new material on testing model assumptions, transformations, indicator vari-ables, nonconstant variance, autocorrelation, power and sample size, model building, and model selection.

• As far as possible, I've replaced outdated data examples with more recent data, and also used more appropriate data examples for particular topics (e.g., autocorrelation). In total, about 40% of the data files have been replaced.

PREFACE XÜi

• Most of the data examples now use descriptive names for variables rather than generic letters such as Y and X.

• As in the first edition, this edition uses mathematics to explain methods and techniques only where necessary, and formulas are used within the text only when they are instructive. However, this edition also includes additional formulas in optional sections to aid those students who can benefit from more mathematical detail.

• I've added many more end-of-chapter problems. In total, the number of problems has increased by nearly 25%.

• I've updated and added new references, nearly doubling the total number of refer-ences.

• I've added a third case study to Chapter 6. • The first edition included detailed computer software instructions for five major soft-

ware packages (SPSS, Minitab, SAS Analyst, R/S-PLUS, and Excel) in an appendix. This appendix has been dropped from this edition; instead, instructions for newer software versions and other packages (e.g., JMP and Stata) are now just updated on the book website.

IAIN PARDOE

Nelson, British Columbia

April 2012