60
Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data.

Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Embed Size (px)

Citation preview

Page 1: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fixing problems with the model

Transforming the data so that the simple linear regression model is

okay for the transformed data.

Page 2: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Options for fixing problems with the model

• Abandon simple linear regression model and find a more appropriate – but typically more complex – model.

• Transform the data so that the simple linear regression model works for the transformed data.

Page 3: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Abandoning the model

• If not linear: try a different function, like a quadratic (Ch. 7) or an exponential function (Ch. 13).

• If unequal error variances: use weighted least squares (Ch. 10).

• If error terms are not independent: try fitting a time series model (Ch. 12).

• If important predictor variables omitted: try fitting a multiple regression model (Ch. 6).

• If outlier: use robust estimation procedure (Ch. 10).

Page 4: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Choices for transforming the data

• Transform X values only.

• Transform Y values only.

• Transform both the X and the Y values.

Page 5: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming the X values only

Page 6: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming the X values only

• Appropriate when non-linearity is the only problem – normality and equal variance okay – with the model.

• Transforming the Y values would likely change the well-behaved error terms into badly-behaved error terms.

Page 7: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Memory retention

time prop1 0.845 0.7115 0.6130 0.5660 0.54120 0.47240 0.45480 0.38720 0.361440 0.262880 0.205760 0.1610080 0.08

• Subjects asked to memorize a list of disconnected items. Asked to recall them at various times up to a week later

• Predictor time = time, in minutes, since initially memorized the list.

• Response prop = proportion of items recalled correctly.

Example 1

Page 8: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fitted line plot

10000 5000 0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

time

pro

p

S = 0.152284 R-Sq = 57.1 % R-Sq(adj) = 53.2 %

prop = 0.525870 - 0.0000557 time

Regression Plot

Example 1

Page 9: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residual vs. fits plot

0.50.40.30.20.10.0

0.3

0.2

0.1

0.0

-0.1

-0.2

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is prop)

Example 1

Page 10: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plot

P-Value (approx): > 0.1000R: 0.9751W-test for Normality

N: 13StDev: 0.145801Average: -0.0000000

0.30.20.10.0-0.1-0.2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI1

Normal Probability Plot

Example 1

Page 11: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transform the X values

time prop log10_time1 0.84 0.000005 0.71 0.6989715 0.61 1.1760930 0.56 1.4771260 0.54 1.77815120 0.47 2.07918240 0.45 2.38021480 0.38 2.68124720 0.36 2.857331440 0.26 3.158362880 0.20 3.459395760 0.16 3.7604210080 0.08 4.00346

Change (“transform”) the predictor time to log10(time).

Example 1

Page 12: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fitted line plot using transformed X values

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log10time

pro

p

prop = 0.846415 - 0.182427 log10timeS = 0.0233881 R-Sq = 99.0 % R-Sq(adj) = 98.9 %

Regression Plot

Example 1

Page 13: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residuals vs. fits plot using transformed X values

0.90.80.70.60.50.40.30.20.1

0.04

0.03

0.02

0.01

0.00

-0.01

-0.02

-0.03

-0.04

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is prop)

Example 1

Page 14: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plotusing transformed X values

P-Value (approx): > 0.1000R: 0.9786W-test for Normality

N: 13StDev: 0.0223924Average: -0.0000000

0.030.00-0.03

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI1

Normal Probability Plot

Example 1

Page 15: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Predicting new proportion

Estimated regression function:

timeY 10log182.0846.0ˆ

Therefore, we predict the proportion of words recalled after 1000 minutes is:

30.03182.0846.0ˆ

1000log182.0846.0ˆ10

Y

Y

Example 1

Page 16: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Predicting new proportion

Example 1

Predicted Values for New Observations

New Fit SE Fit 95.0% CI 95.0% PI1 0.299 0.00765 (0.282, 0.316) (0.245, 0.353)

Values of Predictors for New Observations

New Obs log10tim1 3.00

We can be 95% confident that a person will recall between 24.5% and 35.3% of the words after 1000 minutes.

Page 17: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming the Y values only

Page 18: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming the Y values only

• Appropriate when non-normality and/or unequal variances are the problems.

• The transformation on Y may also help to “straighten out” a curved relationship.

Page 19: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Gestation time and birth weight for mammals

Mammal Birthwgt GestationGoat 2.75 155Sheep 4.00 175Deer 0.48 190Porcupine 1.50 210Bear 0.37 213Hippo 50.00 243Horse 30.00 340Camel 40.00 380Zebra 40.00 390Giraffe 98.00 457Elephant 113.00 670

• Predictor Birthwgt = birth weight, in kg, of mammal.

• Response Gestation = number of days until birth

Example 2

Page 20: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fitted line plot

0 50 100

200

300

400

500

600

700

Birthwgt

Ge

sta

tion

Gestation = 187.084 + 3.59137 BirthwgtS = 66.0943 R-Sq = 83.9 % R-Sq(adj) = 82.1 %

Regression Plot

Example 2

Page 21: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residual vs. fits plot

600500400300200

100

0

-100

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is Gestatio)

Example 2

Page 22: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plot

P-Value (approx): > 0.1000R: 0.9703W-test for Normality

N: 11StDev: 62.7025Average: -0.0000000

500-50-100

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI1

Normal Probability Plot

Example 2

Page 23: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transform the Y values

Mammal Birthwgt Gestation log10GestGoat 2.75 155 2.19033Sheep 4.00 175 2.24304Deer 0.48 190 2.27875Porcupine 1.50 210 2.32222Bear 0.37 213 2.32838Hippo 50.00 243 2.38561Horse 30.00 340 2.53148Camel 40.00 380 2.57978Zebra 40.00 390 2.59106Giraffe 98.00 457 2.65992Elephant 113.00 670 2.82607

Change (“transform”) the response Gestation to log10(Gestation).

Example 2

Page 24: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fitted line plot using transformed Y values

0 50 100

2.2

2.3

2.4

2.5

2.6

2.7

2.8

Birthwgt

log1

0G

est

log10Gest = 2.29256 + 0.0045211 BirthwgtS = 0.0939425 R-Sq = 80.3 % R-Sq(adj) = 78.1 %

Regression Plot

Example 2

Page 25: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residual vs. fits plotusing transformed Y values

2.3 2.4 2.5 2.6 2.7 2.8

-0.1

0.0

0.1

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is log10Gest)

Example 2

Page 26: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plotusing transformed Y values

P-Value (approx): > 0.1000R: 0.9743W-test for Normality

N: 11StDev: 0.0891217Average: -0.0000000

0.10.0-0.1

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI2

Normal Probability Plot

Example 2

Page 27: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Predicting new gestation Estimated regression function:

BirthwgtestG 0045.029.2)ˆ(log10

Therefore, since:

515.2500045.029.2)ˆ(log10 estG

we predict the gestation length of another mammal at 50 kgs to be:

3.3271010ˆ 515.2)ˆ(log10 estGestG

Example 2

Page 28: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Predicting new gestation

Example 2

Predicted Values for New Observations

New Fit SE Fit 95.0% CI 95.0% PI1 2.5186 0.0306 (2.4494, 2.5878) (2.2951, 2.7421)

Values of Predictors for New Observations

New Birthwgt1 50.0

3.19710 2951.2

2.55210 7421.2

We can be 95% confident that the gestation length for a new mammal at 50 kgs will be between 197.3 and 552.2 days.

Page 29: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming both the X and Y values

Page 30: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming both the X and Y values

• Appropriate when the error terms are not normal, have unequal variances, and the function is not linear.

• Transforming the Y values corrects the problems with the error terms (and may help the non-linearity).

• Transforming the X values corrects the non-linearity.

Page 31: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Diameter (inches) and volume (cu. ft.) of 70 shortleaf pines

Example 3

5 15 25

0

50

100

150

Diameter

Vo

lum

e

Volume = -41.5681 + 6.83672 DiameterS = 9.87485 R-Sq = 89.3 % R-Sq(adj) = 89.1 %

Regression Plot

Page 32: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residuals vs. fits plot

Example 3

100500

5

4

3

2

1

0

-1

-2

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is Volume)

Page 33: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plot

Example 3

P-Value (approx): < 0.0100R: 0.9409W-test for Normality

N: 70StDev: 1.02852Average: 0.0085024

543210-1-2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

SRES1

Normal Probability Plot

Page 34: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transform the Y values onlyDiameter Volume logVol 4.4 2.0 0.69315 4.6 2.2 0.78846 5.0 3.0 1.09861 5.1 4.3 1.45862 5.1 3.0 1.09861 5.2 2.9 1.06471 5.2 3.5 1.25276 5.5 3.4 1.22378 5.5 5.0 1.60944 5.6 7.2 1.97408 5.9 6.4 1.85630 5.9 5.6 1.72277 7.5 7.7 2.04122 7.6 10.3 2.33214… and so on …

Transform response volume to loge(volume)

Example 3

Page 35: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fitted line plotusing transformed Y values

5 15 25

0

1

2

3

4

5

6

Diameter

logV

ol

logVol = 0.451703 + 0.239531 DiameterS = 0.322919 R-Sq = 90.5 % R-Sq(adj) = 90.4 %

Regression Plot

Example 3

Page 36: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residuals vs. fits plotusing transformed Y values

654321

1

0

-1

-2

-3

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is logVol)

Example 3

Page 37: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plotusing transformed Y values

P-Value (approx): < 0.0100R: 0.9610W-test for Normality

N: 70StDev: 1.01888Average: -0.0077969

10-1-2-3

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

SRES4

Normal Probability Plot

Example 3

Page 38: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transform both the X and Y valuesDiameter Volume logDiam logVol 4.4 2.0 1.48160 0.69315 4.6 2.2 1.52606 0.78846 5.0 3.0 1.60944 1.09861 5.1 4.3 1.62924 1.45862 5.1 3.0 1.62924 1.09861 5.2 2.9 1.64866 1.06471 5.2 3.5 1.64866 1.25276 5.5 3.4 1.70475 1.22378 5.5 5.0 1.70475 1.60944 5.6 7.2 1.72277 1.97408 5.9 6.4 1.77495 1.85630 5.9 5.6 1.77495 1.72277 7.5 7.7 2.01490 2.04122 7.6 10.3 2.02815 2.33214… and so on …

Transform predictor diameter to

loge(diameter)

Transform response volume to loge(volume)

Example 3

Page 39: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Fitted line plotusing transformed X and Y values

Example 3

1.5 2.0 2.5 3.0

1

2

3

4

5

logDiam

logV

ol

logVol = -2.87179 + 2.56442 logDiamS = 0.170263 R-Sq = 97.4 % R-Sq(adj) = 97.3 %

Regression Plot

Page 40: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Residual plot using transformed X and Y values

Example 3

54321

3

2

1

0

-1

-2

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is logVol)

Page 41: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Normal probability plot using transformed X and Y values

Example 3

P-Value (approx): > 0.1000R: 0.9896W-test for Normality

N: 70StDev: 1.00930Average: -0.0028401

210-1-2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

SRES5

Normal Probability Plot

Page 42: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transformation strategies

Page 43: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Effects of transformations

• Transforming the Y values corrects the problems with the error terms – and may simultaneously help non-linearity.

• Transforming the X values can only correct non-linearity.

Page 44: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transformation strategies

• If form of the relationship between x and y is known, then it may be possible to find a linearizing transformation analytically.

• Fitting a regression model empirically generally requires trial and error – try different transformations to see which does best.

Page 45: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transformation strategies

Finding a linearizing transformation analytically

Page 46: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Knowing functional relationship is of the power form

If the relationship between x and y is of the power form:

xy

taking log of both sides transforms it into a linear form:

xy eee logloglog

Page 47: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Knowing functional relationship is of the exponential form

If the relationship between x and y is of exponential form:

xey

taking log of both sides transforms it into a linear form:

xy ee loglog

Page 48: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transformation strategies

Finding a transformation by trial and error

Page 49: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Family of power transformations

The most common transformation involves transforming the response by taking it to some power λ. That is:

yy Most commonly, for interpretation reasons, λ is a number between -1 and 2, such as -1, -0.5, 0, 0.5, (1), 1.5, and 2.

When λ = 0, the transformation is taken to be the log transformation. That is:

yy elog

Page 50: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Effect of loge transformation

10005000

5

0

-5

x

f(x)

Natural log function

Page 51: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Effect of loge transformation

543210

2

1

0

-1

-2

-3

-4

-5

-6

x

f(x)

Natural log function

Page 52: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Some guidelines for specifying λ

• To make smaller values more spread out, use a smaller λ.

• To make larger values more spread out, use a larger λ.

Page 53: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Possible transformations

x

y

2x

x y

y

y

ylog

y1

3x

x

x

Page 54: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Possible transformations

y

x y

y

2y

xlog

x1

3yx

xx

y

Page 55: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Possible transformations

x y

y

y

ylog

y1

x

xx

f(x)

xlog

ylog

xlog

x1

Page 56: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Possible transformations

2x

x y

y

y3x

x

xx

f(x)

2y

3y

Page 57: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transformation strategies

Variance stabilizing transformations

Page 58: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Common variance stabilizing transformations

If the response is a Poisson count, so that the variance is proportional to the mean, use the square root transformation:

yyy 21

If the response is a binomial proportion, use the arcsine square root transformation:

pp ˆsinˆ 1

Page 59: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Common variance stabilizing transformations

If the variance is proportional to the mean squared, use the natural log transformation:

yy elog

If the variance is proportional to the mean to the fourth power, use the reciprocal transformation:

yy 1

Page 60: Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data

Transforming data in Minitab

• Select Calc >> Calculator …• In box labeled “Store result in variable,”, tell

Minitab in which column (variable) you want the transformed data stored.

• Type (input) the expression for the desired transformation in the box labeled Expression. (Use the available functions.)

• Select OK. The data will appear in the column of the worksheet that you specified.