Discrete Multivariate Analysis

Discrete Multivariate Analysis

Analysis of Multivariate Categorical Data

References

1. Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass.

2. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press.

3. Alan Agresti (1990) Categorical Data Analysis, Wiley, New York.

Example 1

Data Set #1 - A two-way frequency table Serum Systolic Blood pressure

Cholesterol <127 127-146 147-166 167+ Total <200 117 121 47 22 307 200-219 85 98 43 20 246 220-259 119 209 68 43 439 260+ 67 99 46 33 245 Total 388 527 204 118 1237

In this study we examine n = 1237 individuals measuring X, Systolic Blood Pressure and Y, Serum Cholesterol

Example 2

The following data was taken from a study of parole success involving 5587 parolees in Ohio between 1965 and 1972 (a ten percent sample of all parolees during this period).

The study involved a dichotomous response Y– Success (no major parole violation) or – Failure (returned to prison either as technical

violators or with a new conviction)

based on a one-year follow-up.The predictors of parole success included are:

1. type of committed offence (Person offense or Other offense),

2. Age (25 or Older or Under 25), 3. Prior Record (No prior sentence or Prior

Sentence), and 4. Drug or Alcohol Dependency (No drug or

Alcohol dependency or Drug and/or Alcohol dependency).

• The data were randomly split into two parts. The counts for each part are displayed in the table, with those for the second part in parentheses.

• The second part of the data was set aside for a validation study of the model to be fitted in the first part.

Table

No drug or alcohol dependency Drug and/or alcohol dependency 25 or older Under 25 25 or Older Under 25 Person

offense Other

offense Person offense

Other offense

Person offense

Other offense

Person offense

Other offense

No prior Sentence of Any Kind Success 48 34 37 49 48 28 35 57 (44) (34) (29) (58) (47) (38) (37) (53) Failure 1 5 7 11 3 8 5 18 (1) (7) (7) (5) (1) (2) (4) (24) Prior Sentence Success 117 259 131 319 197 435 107 291 (111) (253) (131) (320) (202) (392) (103) (294) Failure 23 61 20 89 38 194 27 101 (27) (55) (25) (93) (46) (215) (34) (102)

Multiway Frequency Tables

• Two-Way

A

B

• Three -Way

A

B

C

A

B

C

• Three -Way

• four -Way

A

B

C

D

Analysis of a Two-way Frequency Table:

Frequency Distribution (Serum Cholesterol and Systolic Blood Pressure)

Serum Systolic Blood pressure Cholesterol <127 127-146 147-166 167+ Total

<200 117 121 47 22 307 200-219 85 98 43 20 246 220-259 119 209 68 43 439

260+ 67 99 46 33 245 Total 388 527 204 118 1237

Joint and Marginal Distributions (Serum Cholesterol and Systolic Blood Pressure)

Serum Systolic Blood pressure Marginal distn Cholesterol <127 127-146 147-166 167+ (Serum Chol.)

<200 9.46 9.78 3.80 1.78 24.82 200-219 6.87 7.92 3.48 1.62 19.89 220-259 9.62 16.90 5.50 3.48 35.49

260+ 5.42 8.00 3.72 2.67 19.81 Marginal distn (BP)

31.37 42.60 16.49 9.54 100.00

The Marginal distributions allow you to look at the effect of one variable, ignoring the other. The joint distribution allows you to look at the two variables simultaneously.

Conditional Distributions ( Systolic Blood Pressure given Serum Cholesterol )

The conditional distribution allows you to look at the effect of one variable, when the other variable is held fixed or known.


<200 38.11 39.41 15.31 7.17 100.00 200-219 34.55 39.84 17.48 8.13 100.00 220-259 27.11 47.61 15.49 9.79 100.00

260+ 27.35 40.41 18.78 13.47 100.00 Marginal distn (BP)

31.37 42.60 16.49 9.54 100.00

Conditional Distributions (Serum Cholesterol given Systolic Blood Pressure)

Serum Systolic Blood pressure Marginal distn Cholesterol <127 127-146 147-166 167+ (Serum Chol.)

<200 30.15 22.96 23.04 18.64 24.82 200-219 21.91 18.60 21.08 16.95 19.89 220-259 30.67 39.66 33.33 36.44 35.49

260+ 17.27 18.79 22.55 27.97 19.81 Total 100.00 100.00 100.00 100.00 100.00

GRAPH: Conditional distributions of Systolic Blood Pressure given Serum Cholesterol

127-146 147-166<127 167+

SYSTOLIC BLOOD PRESSURE

<200

200-219

260+

220-259

Marginal Distribution

SERUM CHOLESTEROL

40%

50%

30%

20%

10%

Notation:

Let xij denote the frequency (no. of cases) where X (row variable) is i and Y (row variable) is j.

1

c

i i ijj

x R x

1

r

j j iji

x C x

1 1 1 1

r c r c

ij i ji j i j

x N x x x

Different Models

,ij P X i Y j

11 1211 12 11 12

11

, , , rcxx xrc rc

rc

Nf x x x

x x

The Multinomial Model:Here the total number of cases N is fixed and xij follows a multinomial distribution with parameters ij

11 1211 12

11

!! !

rcxx xrc

rc

Nx x

ij ij ijE x N

11 1211 12 1| 2| |

1 1

, , , ic

ri xx x

rc i i c ii i ic

Rf x x x

x x

The Product Multinomial Model:Here the row (or column) totals Ri are fixed and for a given row i, xij follows a multinomial distribution with parameters j|i

|ij ij i j iE x R

11 121 1

, , ,!

ij

ij

xr cij

rci j ij

f x x x ex

The Poisson Model:In this case we observe over a fixed period of time and all counts in the table (including Row, Column and overall totals) follow a Poisson distribution. Let ij

denote the mean of xij.

ij ijE x

!

ij

ij

xij

ij ijij

f x ex

Independence

Multinomial Model ,ij P X i Y j P X i P Y j

i j

ij ij i jN N

if independent

and

The estimated expected frequency in cell (i,j) in the case of independence is:

ˆ ˆ ˆ jiij ij i j

xxm N N

N N

i j i jx x R CN N

The same can be shown for the other two models – the Product Multinomial model and the Poisson model

namelyThe estimated expected frequency in cell (i,j) in the case of independence is:

ˆ i j i jij ij

R C x xm

N x

Standardized residuals are defined for each cell:

ij ijij

ij

x mr

m

The Chi-Square Statistic

2

2 2

1 1 1 1

r c r cij ij

iji j i j ij

x mr

m

The Chi-Square test for independence

Reject H0: independence if

2

2 2/ 2

1 1

1 1r c

ij ij

i j ij

x mdf r c

m

TableExpected frequencies, Observed frequencies,

Standardized Residuals


<200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35

200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72

220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17

260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99

Total 388 527 204 118 1237 2 = 20.85 (p = 0.0133)

Example

In the example N = 57,407 cases in which individuals were victimized twice by crimes were studied.

The crime of the first victimization (X) and the crime of the second victimization (Y) were noted.

The data were tabulated on the following slide

Table 1: Frequencies

Second Victimization in Pair Ra A Ro PP/PS PL B HL MV Total Ra 26 50 11 6 82 39 48 11 273 A 65 2997 238 85 2553 1083 1349 216 8586

First Ro 12 279 197 36 459 197 221 47 1448 Victimization PP/PS 3 102 40 61 243 115 101 38 703

in pair PL 75 2628 413 229 12137 2658 3689 687 22516 B 52 1117 191 102 2649 3210 1973 301 9595 HL 42 1251 206 117 3757 1962 4646 391 12372 MV 3 221 51 24 678 301 367 269 1914 Total 278 8645 1347 660 22558 9565 12394 1960

Table 2: Expected Frequencies (assuming independence)

Ra A Ro PP/PS PL B HL MV TotalRa 1.32 41.11 6.41 3.14 107.27 45.49 58.94 9.32 273A 41.58 1292.98 201.46 98.71 3373.86 1430.58 1853.69 293.14 8586

Ro 7.01 218.06 33.98 16.65 568.99 241.26 312.62 49.44 1448PP/PS 3.40 105.87 16.50 8.08 276.24 117.13 151.78 24.00 703

PL 109.04 3390.72 528.32 258.86 8847.63 3751.56 4861.14 768.75 22516B 46.46 1444.92 225.14 110.31 3770.34 1598.69 2071.53 327.59 9595

HL 59.91 1863.12 290.30 142.24 4861.56 2061.39 2671.08 422.41 12372MV 9.27 288.23 44.91 22.00 752.10 318.91 413.23 65.35 1914

Total 278 8645 1347 660 22558 9565 12394 1960 57407

Table 3: Standardized residuals

Second Victimization in Pair Ra A Ro PP/PS PL B HL MV Ra 21.5 1.4 1.8 1.6 -2.4 -1.0 -1.9 0.6 A 3.6 47.4 2.6 -1.4 -14.1 -9.2 -11.7 -4.5

First Ro 1.9 4.1 28.0 4.7 -4.6 -2.8 -5.2 -0.3 Victimization PP/PS -0.2 -0.4 5.8 18.6 -2.0 -0.2 -4.1 2.9

in pair PL -3.3 -13.1 -5.0 -1.9 35.0 -17.9 -16.8 -2.9 B 0.8 -8.6 -2.3 -0.8 -18.3 40.3 -2.2 -1.5 HL -2.3 -14.2 -4.9 -2.1 -15.8 -2.2 38.2 -1.5 MV -2.1 -4.0 0.9 0.4 -2.7 -1.0 -2.3 25.2

11,430 (highly significant)

Table 3: Conditional distribution of second victimization given the first victimization (%)

Second Victimization in Pair Ra A Ro PP/PS PL B HL MV Ra 9.5 18.3 4.0 2.2 30.0 14.3 17.6 4.0 100.0 A 0.8 34.9 2.8 1.0 29.7 12.6 15.7 2.5 100.0

First Ro 0.8 19.3 13.6 2.5 31.7 13.6 15.3 3.2 100.0 Victimization PP/PS 0.4 14.5 5.7 8.7 34.6 16.4 14.4 5.4 100.0

in pair PL 0.3 11.7 1.8 1.0 53.9 11.8 16.4 3.1 100.0 B 0.5 11.6 2.0 1.1 27.6 33.5 20.6 3.1 100.0 HL 0.3 10.1 1.7 0.9 30.4 15.9 37.6 3.2 100.0 MV 0.2 11.5 2.7 1.3 35.4 15.7 19.2 14.1 100.0 Marginal 0.5 15.1 2.3 1.1 39.3 16.7 21.6 3.4 100.0

Log Linear Model

Recall, if the two variables, rows (X) and columns (Y) are independent then

ij ij i jN N

and

ln ln ln lnij i jN

In general let

1( ) 2( ) 12( , )ln ij i j i ju u u u

1 ln iji j

urc

1( )1 lni ij

j

u uc

2( )1 lnj ij

i

u ur

12( , ) 1( ) 2( )lni j ij i ju u u u

then

where1( ) 2( ) 12( , ) 12( , ) 0i j i j i j

i j i j

u u u u

(1)

Equation (1) is called the log-linear model for the frequencies xij.

Note: X and Y are independent if

1( ) 2( )ln ij i ju u u

In this case the log-linear model becomes

12( , ) 0 for all ,i ju i j

Comment:The log-linear model for a two-way frequency table:

is similar to the model for a two factor experiment

1( ) 2( ) 12( , )ln ij i j i ju u u u

ijji

ij jBiAy

and when ofmean the where

ijkij

ijkijjiijky

Three-way Frequency Tables

ExampleData from the Framingham Longitudinal Study of Coronary Heart Disease (Cornfield [1962])

Variables1. Systolic Blood Pressure (X)

– < 127, 127-146, 147-166, 167+

2. Serum Cholesterol– <200, 200-219, 220-259, 260+

3. Heart Disease– Present, Absent

The data is tabulated on the next slide

Three-way Frequency Table

Coronary Heart

Serum Cholesterol

Systolic Blood pressure (mm Hg)

Disease (mm/100 cc) <127 127-146 147-166 167+ <200 2 3 3 4

Present 200-219 3 2 0 3 220-259 8 11 6 6 260+ 7 12 11 11 <200 117 121 47 22

Absent 200-219 85 98 43 20 220-259 119 209 68 43 260+ 67 99 46 33

Log-Linear model for three-way tables

Let ijk denote the expected frequency in cell (i,j,k) of the table then in general

1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u

1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j

u u u u u

13( , ) 23( , ) 123( , , )i k j k i j ku u u

where

13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k

u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k

i j k

u u u

Hierarchical Log-linear models for categorical Data

For three way tables

The hierarchical principle:If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

1.Model: (All Main effects model)ln ijk = u + u1(i) + u2(j) + u3(k)

i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0.

Notation:[1][2][3]

Description:Mutual independence between all three variables.

2.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j)

i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0.

Notation:[12][3]

Description:Independence of Variable 3 with variables 1 and 2.

3.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k)

i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0.

Notation: [13][2]


4.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u23(j,k)

i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0.

Notation: [23][1]


5.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)

i.e. u23(j,k) = u123(i,j,k) = 0.

Notation:[12][13]

Description:Conditional independence between variables 2 and 3 given variable 1.

6.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k)

i.e. u13(i,k) = u123(i,j,k) = 0.

Notation:[12][23]


7.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k)

i.e. u12(i,j) = u123(i,j,k) = 0.

Notation: [13][23]



+ u23(j,k) i.e. u123(i,j,k) = 0.

Notation: [12][13][23]

Description:Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

9.Model: (the saturated model)ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)

+ u23(j,k) + u123(i,j,k)

Notation: [123]

Description:No simplifying dependence structure.

Hierarchical Log-linear models for 3 way table

Model Description[1][2][3] Mutual independence between all three variables.

[1][23] Independence of Variable 1 with variables 2 and 3.



[12][13] Conditional independence between variables 2 and 3 given variable 1.



[12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

[123] The saturated model

Maximum Likelihood Estimation

Log-Linear Model

For any Model it is possible to determine the maximum Likelihood Estimators of the parameters

Example Two-way table – independence – multinomial model

11 1211 12 11 12

11

, , , rcxx xrc rc

rc

Nf x x x

x x

11 12

11 12

11

!! !

rcxx xrc

rc

Nx x N N N

ij ij ijE x N orij

ij N

Log-likelihood

11 12, , ln ! ln !rc iji j

l N x

ln lnij ij iji j i j

N x x lnij ij

i j

K x where ln ! ln ! lnij

i j

K N x N N

1 2ln ij i ju u u

With the model of independence

and

1 1 1 2 1 2, , , , , ,c rl u u u u u K

1 2ij i ji j

x u u u

with 1 2 0i ji j

u u

1 2i ji ji j

K Nu x u x u

1 2 1 2i j i ju u u u uuij

i j i j i j

e e e e N

also

Let 1 2 21 1 1 2 1 2, , , , , , , , ,c rg u u u u u

1 2

1 11 2i ju uu

i ji j i j

u u e e e N

1 2i ji j

i j

K Nu x u x u

Now

1 2 1 0i ju uu

i j

g N e e e Nu

1

1 2

11

i ju uui

ji

g x e e eu

1

11 0i

i

u

i u

i

ex Ne

1

1

1i

i

ui i

u

i

x xeN Ne

1 111 and 0

ii i

i

xx

rN N N

Since

Now 1

1iu

ie x K

or 11 ln lniiu x K

11 ln ln 0iii i

u x r K

Hence

11ln lni ii

i

u x xr

11ln ln i

i

K xr

and

21ln lnj jj

i

u x xc Similarly

1 2 1 2i j i ju u u u uuij

i j i j i j

e e e e N

Finally

Hence

2

1

1

ju j

c c

jj

xe

x

Now

1 2i j

uu u

i j

Nee e

and

1

1

1

iu i

r r

ii

xe

x

11

1 1

r c cru

i ji ji j

i j

Ne x xx x

11

1 1

1 r c cr

i ji j

x xN

Hence

Note

1 1ln ln lni ji j

u x x Nr c

1 2ln ij i ju u u 1 1ln ln lni j

i j

x x Nr c

1 1ln ln ln lni i j ji i

x x x xr c

ln ln lni jN x x

or i jij

x xN

Comments• Maximum Likelihood estimates can be

computed for any hierarchical log linear model (i.e. more than 2 variables)

• In certain situations the equations need to be solved numerically

• For the saturated model (all interactions and main effects), the estimate of ijk… is xijk… .



Multiway Frequency Tables

• Two-Way

A

B

• four -Way

A

B

C

D

Log Linear Model

Two- way table

where1( ) 2( ) 12( , ) 12( , ) 0i j i j i j

i j i j

u u u u

1( ) 2( ) 12( , )ln ij i j i ju u u u

jiji

uuuuij

jiji eeee ,1221,1221

The multiplicative form:

Log-Linear model for three-way tablesLet ijk denote the expected frequency in cell (i,j,k) of the table then in general

1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u

1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j

u u u u u

13( , ) 23( , ) 123( , , )i k j k i j ku u u

where

13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k

u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k

i j k

u u u

Log-Linear model for three-way tablesLet ijk denote the expected frequency in cell (i,j,k) of the table then in general

1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u

13( , ) 23( , ) 123( , , )i k j k i j ku u u

or the multiplicative form1( ) 2( ) 3( ) 12 ( , )ln ij i j k i ju u u uu

ij e e e e e e 13( , ) 23( , ) 123( , , )i k j k i j ku u ue e e

13( , ) 23( , ) 123( , , )i k j k i j k 1( ) 2( ) 3( ) 12( , )i j k i j

Comments• The log-linear model is similar to the ANOVA

models for factorial experiments. • The ANOVA models are used to understand the

effects of categorical independent variables (factors) on a continuous dependent variable (Y).

• The log-linear model is used to understand dependence amongst categorical variables

• The presence of interactions indicate dependence between the variables present in the interactions






Notation:[1][2][3]

Description:Mutual independence between all three variables.


i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0.

Notation:[12][3]



i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0.

Notation: [13][2]



i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0.

Notation: [23][1]



i.e. u23(j,k) = u123(i,j,k) = 0.

Notation:[12][13]



i.e. u13(i,k) = u123(i,j,k) = 0.

Notation:[12][23]



i.e. u12(i,j) = u123(i,j,k) = 0.

Notation: [13][23]



+ u23(j,k) i.e. u123(i,j,k) = 0.

Notation: [12][13][23]



+ u23(j,k) + u123(i,j,k)

Notation: [123]


Hierarchical Log-linear models for 3 way table

Model Description[1][2][3] Mutual independence between all three variables.







[12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

[123] The saturated model

Goodness of Fit Statistics

These statistics can be used to check if a log-linear model will fit the

observed frequency table

Goodness of Fit StatisticsThe Chi-squared statistic

22 Observed Expected

Expected

The Likelihood Ratio statistic:

2 2 ln 2 lnˆ

ijkijk

ijk

xObservedG Observed xExpected

d.f. = # cells - # parameters fitted

2ˆ

ˆijk ijk

ijk

x

We reject the model if 2 or G2 is greater than2

/ 2

Example: Variables

Coronary Heart

Serum Cholesterol

Systolic Blood pressure (mm Hg)

Disease (mm/100 cc) <127 127-146 147-166 167+ <200 2 3 3 4

Present 200-219 3 2 0 3 220-259 8 11 6 6 260+ 7 12 11 11 <200 117 121 47 22

Absent 200-219 85 98 43 20 220-259 119 209 68 43 260+ 67 99 46 33

1. Systolic Blood Pressure (B)Serum Cholesterol (C)Coronary Heart Disease (H)

MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ ----- -- ----------- ------- ------- ------- B,C,H. 24 83.15 0.0000 102.00 0.0000 B,CH. 21 51.23 0.0002 56.89 0.0000 C,BH. 21 59.59 0.0000 60.43 0.0000 H,BC. 15 58.73 0.0000 64.78 0.0000 BC,BH. 12 35.16 0.0004 33.76 0.0007 BH,CH. 18 27.67 0.0673 26.58 0.0872 n.s. CH,BC. 12 26.80 0.0082 33.18 0.0009 BC,BH,CH. 9 8.08 0.5265 6.56 0.6824 n.s.

Goodness of fit testing of Models

Possible Models:1. [BH][CH] – B and C independent given H.2. [BC][BH][CH] – all two factor interaction model

Model 1: [BH][CH] Log-linear parameters

Heart disease -Blood Pressure Interaction

Bp Hd <127 127-146 147-166 167+ Pres -0.256 -0.241 0.066 0.431 Abs 0.256 0.241 -0.066 -0.431

,HB i ju

Bp Hd <127 127-146 147-166 167+ Pres -2.607 -2.733 0.660 4.461 Abs 2.607 2.733 -0.660 -4.461

,

,

HB i j

HB i j

u

uz

Multiplicative effect

,

, ,exp HB i juHB i j HB i ju e

Bp Hd <127 127-146 147-166 167+ Pres 0.774 0.786 1.068 1.538 Abs 1.291 1.272 0.936 0.65

, ,ln ijk H i B j C k HB i j HC i ku u u u u u

, ,H i B j C k HB i j HC i ku u u u uuijk e e e e e e

Log-Linear Model

, ,H i B j C k HB i j HC i k

Heart Disease - Cholesterol Interaction

Chol Hd <200 200-219 220-259 260+ Pres -0.233 -0.325 0.063 0.494 Abs 0.233 0.325 -0.063 -0.494

,HC i ku

,

,

HC i k

HC i k

u

uz

Chol Hd <200 200-219 220-259 260+ Pres -1.889 -2.268 0.677 5.558 Abs 1.889 2.268 -0.677 -5.558


,

, ,exp HB i kuHC i k HB i ku e

Chol Hd <200 200-219 220-259 260+ Pres 0.792 0.723 1.065 1.640 Abs 1.262 1.384 0.939 0.610

Model 2: [BC][BH][CH] Log-linear parameters

Blood pressure-Cholesterol interaction: ,BC j ku

Bp Chol <200 200-219 220-259 260+ <200 0.222 -0.019 -0.034 -0.169 200-219 0.114 -0.041 0.013 -0.086 220-259 -0.114 0.154 -0.058 0.018 260+ -0.221 -0.094 0.079 0.237

,

,

BC j k

BC j k

u

uz

Bp Chol <200 200-219 220-259 260+ <200 2.68 -0.236 -0.326 -1.291 200-219 1.27 -0.472 0.117 -0.626 220-259 -1.502 2.253 -0.636 0.167 260+ -2.487 -1.175 0.785 2.051

Bp Chol <200 200-219 220-259 260+ <200 1.248 0.981 0.967 0.844 200-219 1.120 0.960 1.013 0.918 220-259 0.892 1.166 0.944 1.018 260+ 0.802 0.910 1.082 1.267

Multiplicative effect ,

, ,exp HB j kuBC j k BC j ku e

Heart disease -Blood Pressure Interaction

Bp Hd <127 127-146 147-166 167+ Pres -0.211 -0.232 0.055 0.389 Abs 0.211 0.232 -0.055 -0.389

,HB i ju

Bp Hd <127 127-146 147-166 167+ Pres -2.125 -2.604 0.542 3.938 Abs 2.125 2.604 -0.542 -3.938

,

,

HB i j

HB i j

u

uz


,

, ,exp HB i juHB i j HB i ju e

Bp Hd <127 127-146 147-166 167+ Pres 0.809 0.793 1.056 1.475 Abs 1.235 1.261 0.947 0.678

Heart Disease - Cholesterol Interaction

Chol Hd <200 200-219 220-259 260+ Pres -0.212 -0.316 0.069 0.460 Abs 0.212 0.316 -0.069 -0.460

,HC i ku

,

,

HC i k

HC i k

u

uz

Chol Hd <200 200-219 220-259 260+ Pres -1.712 -2.199 0.732 5.095 Abs 1.712 2.199 -0.732 -5.095


,

, ,exp HB i kuHC i k HB i ku e

Chol Hd <200 200-219 220-259 260+ Pres 0.809 0.729 1.071 1.584 Abs 1.237 1.372 0.933 0.631

Another Example

In this study it was determined for N = 4353 males

1. Occupation category2. Educational Level3. Academic Aptidude

1. Occupation categoriesa. Self-employed Businessb. Teacher\Educationc. Self-employed Professionald. Salaried Employed

2. Education levelsa. Lowb. Low/Medc. Medd. High/Mede. High

3. Academic Aptitudea. Lowb. Low/Medc. High/Medd. High

Table Self-employed, Business Teacher Education Education

Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 42 55 22 3 122 Low 0 0 1 19 20

LMed 72 82 60 12 226 LMed 0 3 3 60 66 Med 90 106 85 25 306 Med 1 4 5 86 96

HMed 27 48 47 8 130 HMed 0 0 2 36 38 High 8 18 19 5 50 High 0 0 1 14 15 Total 239 309 233 53 834 Total 1 7 12 215 235

Self-employed, Professional Salaried Employed Education Education

Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 1 2 8 19 30 Low 172 151 107 42 472

LMed 1 2 15 33 51 LMed 208 198 206 92 704 Med 2 5 25 83 115 Med 279 271 331 191 1072

HMed 2 2 10 45 59 HMed 99 126 179 97 501 High 0 0 12 19 31 High 36 35 99 79 249 Total 6 11 70 199 286 Total 794 781 922 501 2998

Two-way Tables (With 2): Education vs Aptitude Education vs Occcupation

(2 = 178.6) (2 = 1254.1) Low Lmed HMed High Total Low Lmed HMed High Total

Low 215 208 138 83 644 SEB 239 309 233 53 834 Lmed 281 285 284 197 1047 SEP 6 11 70 199 286 Med 372 386 446 385 1589 TCHR 1 7 12 215 235

HMed 128 176 238 186 728 SEM 794 781 922 501 2998 High 44 53 131 117 345 Total 1040 1108 1237 968 4353 Total 1040 1108 1237 968 4353

Aptitude vs Occupation

(2 = 35.8) SEB SEP TCHR SEM Total

Low 122 30 20 472 644 Lmed 226 51 66 704 1047 Med 306 115 96 1072 1589

HMed 130 59 38 501 728 High 50 31 15 249 345 Total 834 286 235 2998 4353

• It is common to handle a Multiway table by testing for independence in all two way tables.

• This is similar to looking at all the bivariate correlations

• In this example we learn that:

1. Education is related to Aptitude2. Education is related to Occupational category3. Education is related to Aptitude

Can we do better than this?

Fitting various log-linear models

Goodness of fit

Model Likelihood

Ratio DF Sig. Pearson DF Sig. [Occ][Ed][Apt] 1356.9702 69 0.0000 1519.802 69 0.0000 [Occ, Ed] [Apt] 228.2215 60 0.0000 226.6615 60 0.0000 [Apt, Ed][Occ] 1179.6403 57 0.0000 1336.765 57 0.0000 [Apt, Occ][Ed] 1319.561 57 0.0000 1424.1488 57 0.0000 [Occ, Ed] [Occ,Apt] 190.8123 48 0.0000 184.6386 48 0.0000 [Apt, Ed] [Occ,Apt] 1142.2311 45 0.0000 1301.1317 45 0.0000 [Apt, Ed] [Occ, Ed] 50.8915 48 0.3605 48.0105 48 0.4724 [Apt, Ed] [Occ, Ed] [Occ, Apt] 25.1048 36 0.9134 23.6465 36 0.9436

Simplest model that fits is: [Apt,Ed][Occ,Ed]This model implies conditional independence betweenAptitude and Occupation given Education.

Log-linear ParametersAptitude – Education Interaction

Education Aptitude Low Low-Med High-Med High

Low 0.4602 0.3225 -0.2752 -0.5075 Low-Med 0.1857 0.0953 -0.0957 -0.1853

Med 0.0399 -0.0277 -0.0706 0.0584 High-Med -0.2250 -0.0111 0.1032 0.1329

High -0.4607 -0.3791 0.3383 0.5015

Aptitude – Education Interaction (Multiplicative)

Education Aptitude Low Low-Med High-Med High

Low 1.584 1.381 0.759 0.602 Low-Med 1.204 1.100 0.909 0.831

Med 1.041 0.973 0.932 1.060 High-Med 0.799 0.989 1.109 1.142

High 0.631 0.684 1.403 1.651

Occupation – Education Interaction

Occupation Education SEB T SEP SAL

Low 1.241 -1.528 -0.718 1.005 LowMed 0.800 -0.280 -0.810 0.290 HighMed -0.050 -0.309 0.472 -0.112

High -1.991 2.117 1.057 -1.182

Occupation – Education Interaction (Multiplicative)

Occupation Education SEB T SEP SAL

Low 3.460 0.217 0.488 2.731 LowMed 2.226 0.756 0.445 1.336 HighMed 0.951 0.734 1.603 0.894

High 0.137 8.303 2.877 0.307

Conditional Test Statistics

• Suppose that we are considering two Log-linear models and that Model 2 is a special case of Model 1.

• That is the parameters of Model 2 are a subset of the parameters of Model 1.

• Also assume that Model 1 has been shown to adequately fit the data.

In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is:

2 2 22 1 2 1G G G

1

2

2Expected

ObservedExpected

2 1df df df

Example

Table 1: Cross-Classification of a Sample of 1008 consumers according to: (1) The Softness of the Laundry Used (2) The Previous Use of Detergent Brand M (3) The Temperature of the Laundry Water Used (4) The preference of Detergent Brand X over Brand M in a Consumer Blind Trial. Previous user of M Previous nonuser of M

Water Softness

Brand Preference

High Temperature

Low Temperature

High Temperature

Low Temperature

Soft X 19 57 29 63 M 29 49 27 53 Medium X 23 47 33 66 M 47 55 23 50 Hard X 24 37 42 68 M 43 52 30 42

Model d.f. G2 p - valueAll k-factor models[1][2][3][4] 18 42.9 0.00083 G2(1)[12][13][14][23][24][34] 9 9.9 0.35864 G2(2)[123][124][134][234] 2 0.7 0.70469 G2(3)[1234] 0 0.0 G2(4)

Goodness of Fit test for the all k-factor models

Model d.f. G2 p - valuetwo-factor interactions 9 33.0 0.00013 G2(1|2)= G2(1)-G2(2)three-factor interactions 7 9.2 0.23861 G2(2|3)= G2(2)-G2(3)four-factor interaction 2 0.7 0.70469 G2(3|4)= G2(3)-G2(4)

Conditional tests for zero k-factor interactions

Conclusions

1. The four factor interaction is not significant G2(3|4) = 0.7 (p = 0.705)

2. The all three factor model provides a significant fit G2(3) = 0.7 (p = 0.705)

3. All the three factor interactions are not significantly different from 0, G2(2|3) = 9.2 (p = 0.239).

4. The all two factor model provides a significant fit G2(2) = 9.9 (p = 0.359)

5. There are significant 2 factor interactions G2(1|2) = 33.0 (p = 0.00083.

Conclude that the model should contain main effects and some two-factor interactions

There also may be a natural sequence of progressively complicated models that one might want to identify.In the laundry detergent example the variables are:

1. Softness of Laundry Used2. Previous use of Brand M3. Temperature of laundry water used4. Preference of brand X over brand M

A natural order for increasingly complex models which should be considered might be:

1. [1][2][3][4]2. [1][3][24]3. [1][34][24]4. [13][34][24]5. [13][234]6. [134][234]

The all-Main effects model Independence amongst all four variables

Since previous use of brand M may be highly related to preference for brand M, add first the 2-4 interaction

Brand M is recommended for hot water add 2nd the 3-4 interactionbrand M is also recommended for Soft laundry add 3rd the 1-3 interaction

Add finally some possible 3-factor interactions

Models d]f] G2

[1][3][24] 17 22.4[1][24][34] 16 18[13][24][34] 14 11.9[13][23][24][34] 13 11.2[12][13][23][24][34] 11 10.1[1][234] 14 14.5[134][24] 10 12.2[13][234] 12 8.4[24][34][123] 9 8.4[123][234] 8 5.6

Likelihood Ratio G2 for various models

Table 2: A Partitioning of the Likelihood Ratio Chi-Square Statistic for Complete Independence (Model (a) = [1][2][3][4], Model (b) = [1][3][24], Model (c) = [1][24][34], Model (d) = [13][24][34], Model (e) = [13][234], Model (f) = [123][234]) Model d.f. G2 Model (a) 18 42.9* Difference between models (b) and (a) 1 20.5* Model (b) 17 22.4 Difference between models (c) and (b) 1 4.4* Model (c) 16 18.0 Difference between models (d) and (c) 2 6.1* Model (d) 14 11.9 Difference between models (e) and (d) 2 3.5 Model (e) 12 8.4 Difference between models (f) and (e) 4 2.8 Model (f) 8 5.6



Log-Linear model for three-way tables

Let ijk denote the expected frequency in cell (i,j,k) of the table then in general

1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u

1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j

u u u u u

13( , ) 23( , ) 123( , , )i k j k i j ku u u

where

13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k

u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k

i j k

u u u




Models for three-way tables



Notation:[1][2][3]Description:Mutual independence between all three variables.

Comment: For any model the parameters (u, u1(i) , u2(j) , u3(k)) can be estimated in addition to the expected frequencies (ijk) in each cell


i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0.

Notation:[12][3]



i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0.

Notation: [13][2]



i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0.

Notation: [23][1]



i.e. u23(j,k) = u123(i,j,k) = 0.

Notation:[12][13]



i.e. u13(i,k) = u123(i,j,k) = 0.

Notation:[12][23]



i.e. u12(i,j) = u123(i,j,k) = 0.

Notation: [13][23]



+ u23(j,k) i.e. u123(i,j,k) = 0.

Notation: [12][13][23]



+ u23(j,k) + u123(i,j,k)

Notation: [123]


Goodness of Fit StatisticsThe Chi-squared statistic

22 Observed Expected

Expected

The Likelihood Ratio statistic:

2 2 ln 2 lnˆ

ijkijk

ijk

xObservedG Observed xExpected

d.f. = # cells - # parameters fitted

2ˆ

ˆijk ijk

ijk

x

We reject the model if 2 or G2 is greater than2

/ 2

Conditional Test Statistics

In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is:

2 2 22 1 2 1G G G

1

2

2Expected

ObservedExpected

2 1df df df

Stepwise selection procedures

Forward SelectionBackward Elimination

Forward Selection: Starting with a model that under fits the data, log-linear parameters that are not in the model are added step by step until a model that does fit is achieved. At each step the log-linear parameter that is most significant is added to the model:To determine the significance of a parameter added we use the statistic:

G2(2|1) = G2(2) – G2(1)Model 1 contains the parameter.Model 2 does not contain the parameter

Backward Elimination: Starting with a model that over fits the data, log-linear parameters that are in the model are deleted step by step until a model that continues to fit the model and has the smallest number of significant parameters is achieved.At each step the log-linear parameter that is least significant is deleted from the model:

To determine the significance of a parameter deleted we use the statistic:

G2(2|1) = G2(2) – G2(1)Model 1 contains the parameter.Model 2 does not contain the parameter

Example: Fitting a Log-linear model – Forward Selection Table: Dyke -Patterson Data - N=1729 individuals classified according to five variables (1) Reading Newspapers (2) Listen to radio (3) Do "solid'" reading (4) Attend Lectures (5) Knowledge regarding cancer

Radio No Radio Solid

Reading No solid Reading

Solid Reading

No solid Reading

Good Poor Good Poor Good Poor Good Poor Newspaper Lectures 23 8 8 4 27 18 7 6 None 102 67 35 59 201 177 75 156 None Lectures 1 3 4 3 3 8 2 10 None 16 16 13 50 67 83 84 393

MODEL D.F. CHI-SQUARE PROB CHI-SQUARE PROB ----- ---- ---------- ---- ---------- ---- K,L,N,S,R. 26 596.84 0.0000 751.31 0.0000 MODELS FORMED BY ADDING TERMS TO MODEL -- K,L,N,S,R. LIKELIHOOD-RATIO PEARSON MODEL D.F. CHI-SQUARE PROB CHI-SQUARE PROB ----- ---- ---------- ---- ---------- ---- KL,N,S,R. 25 579.68 0.0000 691.18 0.0000 DIFF. DUE TO ADDING KL. 1 17.16 0.0000 KN,L,S,R. 25 491.06 0.0000 533.89 0.0000 DIFF. DUE TO ADDING KN. 1 105.78 0.0000 KS,L,N,R. 25 446.39 0.0000 497.12 0.0000 DIFF. DUE TO ADDING KS. 1 150.45 0.0000 KR,L,N,S. 25 572.59 0.0000 674.61 0.0000 DIFF. DUE TO ADDING KR. 1 24.25 0.0000 K,LN,S,R. 25 575.24 0.0000 688.89 0.0000 DIFF. DUE TO ADDING LN. 1 21.60 0.0000 K,LS,N,R. 25 573.09 0.0000 692.25 0.0000 DIFF. DUE TO ADDING LS. 1 23.74 0.0000 K,LR,N,S. 25 577.89 0.0000 698.17 0.0000 DIFF. DUE TO ADDING LR. 1 18.95 0.0000 K,L,NS,R. 25 343.13 0.0000 383.90 0.0000 DIFF. DUE TO ADDING NS. 1 253.71 0.0000 K,L,NR,S. 25 522.61 0.0000 615.20 0.0000 DIFF. DUE TO ADDING NR. 1 74.23 0.0000 K,L,N,SR. 25 575.76 0.0000 680.88 0.0000 DIFF. DUE TO ADDING SR. 1 21.08 0.0000 STEP 1. BEST MODEL FOUND IS -- K,L,NS,R.

K = knowledge

N = Newspaper

R = Radio

S = Reading

L = Lectures

KL,NS,R. 24 325.97 0.0000 339.14 0.0000 DIFF. DUE TO ADDING KL. 1 17.16 0.0000 KN,L,NS,R. 24 237.35 0.0000 258.87 0.0000 DIFF. DUE TO ADDING KN. 1 105.78 0.0000 KS,L,NS,R. 24 192.68 0.0000 216.12 0.0000 DIFF. DUE TO ADDING KS. 1 150.45 0.0000 KR,L,NS. 24 318.88 0.0000 329.40 0.0000 DIFF. DUE TO ADDING KR. 1 24.25 0.0000 K,LN,NS,R. 24 321.53 0.0000 341.35 0.0000 DIFF. DUE TO ADDING LN. 1 21.60 0.0000 K,LS,NS,R. 24 319.39 0.0000 348.68 0.0000 DIFF. DUE TO ADDING LS. 1 23.75 0.0000 K,LR,NS. 24 324.18 0.0000 341.62 0.0000 DIFF. DUE TO ADDING LR. 1 18.95 0.0000 K,L,NR,NS. 24 268.90 0.0000 280.86 0.0000 DIFF. DUE TO ADDING NR. 1 74.23 0.0000 K,L,SR,NS. 24 322.05 0.0000 347.33 0.0000 DIFF. DUE TO ADDING SR. 1 21.08 0.0000 STEP 2. BEST MODEL FOUND IS -- KS,L,NS,R.

KL,KS,NS,R. 23 175.52 0.0000 182.86 0.0000 DIFF. DUE TO ADDING KL. 1 17.16 0.0000 KN,KS,L,NS,R. 23 152.96 0.0000 163.87 0.0000 DIFF. DUE TO ADDING KN. 1 39.72 0.0000 KR,KS,L,NS. 23 168.43 0.0000 173.32 0.0000 DIFF. DUE TO ADDING KR. 1 24.25 0.0000 KS,LN,NS,R. 23 171.08 0.0000 184.56 0.0000 DIFF. DUE TO ADDING LN. 1 21.60 0.0000 LS,KS,NS,R. 23 168.93 0.0000 202.28 0.0000 DIFF. DUE TO ADDING LS. 1 23.74 0.0000 KS,LR,NS. 23 173.73 0.0000 178.08 0.0000 DIFF. DUE TO ADDING LR. 1 18.95 0.0000 KS,L,NR,NS. 23 118.45 0.0000 128.83 0.0000 DIFF. DUE TO ADDING NR. 1 74.23 0.0000 SR,KS,L,NS. 23 171.60 0.0000 198.23 0.0000 DIFF. DUE TO ADDING SR. 1 21.08 0.0000 STEP 3. BEST MODEL FOUND IS -- KS,L,NR,NS.

LN,KL,SR,KR,KN,LR,LS,KS,NR,NS. 16 19.56 0.2406 21.21 0.1706 DIFF. DUE TO ADDING SR. 1 0.42 0.5147 KLN,KR,LR,LS,KS,NR,NS. 16 18.86 0.2762 21.53 0.1589 DIFF. DUE TO ADDING KLN. 1 1.13 0.2878 LN,KLS,KR,KN,LR,NR,NS. 16 15.99 0.4538 15.63 0.4794 DIFF. DUE TO ADDING KLS. 1 4.00 0.0456 LN,KLR,KN,LS,KS,NR,NS. 16 19.28 0.2543 20.81 0.1860 DIFF. DUE TO ADDING KLR. 1 0.70 0.4015 LN,KL,KR,KNS,LR,LS,NR. 16 16.78 0.4000 18.74 0.2821 DIFF. DUE TO ADDING KNS. 1 3.21 0.0733 LN,KL,KNR,LR,LS,KS,NS. 16 19.90 0.2247 21.27 0.1682 DIFF. DUE TO ADDING KNR. 1 0.09 0.7704 LNS,KL,KR,KN,LR,KS,NR. 16 19.58 0.2397 20.98 0.1794 DIFF. DUE TO ADDING LNS. 1 0.41 0.5239 LNR,KL,KR,KN,LS,KS,NS. 16 18.11 0.3176 18.80 0.2790 DIFF. DUE TO ADDING LNR. 1 1.88 0.1706 STEP 10. BEST MODEL FOUND IS -- LN,KLS,KR,KN,LR,NR,NS.

Continuing after 10 steps

LN,SR,KLS,KR,KN,LR,NR,NS. 15 15.55 0.4127 15.15 0.4406 DIFF. DUE TO ADDING SR. 1 0.44 0.5072 KLN,KLS,KR,LR,NR,NS. 15 12.98 0.6041 13.84 0.5379 DIFF. DUE TO ADDING KLN. 1 3.01 0.0827 LN,KLR,KLS,KN,NR,NS. 15 15.10 0.4446 15.06 0.4471 DIFF. DUE TO ADDING KLR. 1 0.89 0.3446 LN,KNS,KLS,KR,LR,NR. 15 13.21 0.5861 13.19 0.5878 DIFF. DUE TO ADDING KNS. 1 2.78 0.0955 LN,KLS,KNR,LR,NS. 15 15.93 0.3870 15.48 0.4173 DIFF. DUE TO ADDING KNR. 1 0.06 0.8034 LNS,KLS,KR,KN,LR,NR. 15 15.87 0.3905 15.60 0.4089 DIFF. DUE TO ADDING LNS. 1 0.12 0.7343 LNR,KLS,KR,KN,NS. 15 14.23 0.5085 13.75 0.5446 DIFF. DUE TO ADDING LNR. 1 1.76 0.1842 STEP 11. BEST MODEL FOUND IS -- KLN,KLS,KR,LR,NR,NS.

The final step

The best model was found a the previous step• [LN][KLS][KR][KN][LR][NR][NS]

Modelling of response variables

Independent → Dependent

Logit Models

To date we have not worried whether any of the variables were dependent of independent variables. The logit model is used when we have a single binary dependent variable.

Example: Logit Models Table: The Effect of planting depth on mortality of Pine seedlings Longleaf Seedlings Slash Seedlings

Depth of Planting Dead Alive Totals Dead Alive Totals Too High 41 59 100 12 88 100 Too Low 11 89 100 5 95 100

Totals 52 148 200 17 183 200 Table: Loglinear Models Fit to Data in Above Table and their Goodness of Fit Statistics Model 2 G2 df [12][13][23] 1.37 1.28 1 [13][23] 26.54 27.79 2 [12][13] 24.03 25.03 2 [13][2] 54.70 50.10 3

The variables1. Type of seedling (T)

a. Longleaf seedlingb. Slash seedling

2. Depth of planting (D)a. Too low.b. Too high

3. Mortality (M) (the dependent variable)a. Deadb. Alive

The Log-linear Model

Note: ij1 = # dead when T = i and D = j.

ln ijk T i D j M ku u u u

, , , , ,TD i j TM i k DM j k TDM i j ku u u u

ij2 = # alive when T = i and D = j.

1

2

ij

ij

deadalive

= mortality ratio when T = i and D = j.

Hence

1T i D j Mu u u u

, ,1 ,1 , ,1TD i j TM i DM j TDM i ju u u u

11 2

2

ln ln ln log-mortality ratioijij ij

ij

since

2T i D j Mu u u u

, ,2 ,2 , ,2TD i j TM i DM j TDM i ju u u u

1 ,1 ,1 , ,12 2 2 2M TM i DM j TDM i ju u u u

2 1 ,2 ,1, ,M M TM i TM iu u u u

,2 ,1 , ,2 , ,1,DM j DM j TDM i j TDM i ju u u u

The logit model:1

1 22

ln ln ln log-mortality ratioijij ij

ij

where ,T i D j TD i jv v v v

1 ,1 ,12 , 2 , 2 , andM T i TM i D j DM jv u v u v u

, , ,12TD i j TDM i jv u

Thus corresponding to a loglinear model there is logit model predicting log ratio of expected frequencies of the two categories of the independent variable.

Also k +1 factor interactions with the dependent variable in the loglinear model determine k factor interactions in the logit modelk + 1 = 1 constant term in logit modelk + 1 = 2, main effects in logit model

Example: Logit Models Table: The Effect of planting depth on mortality of Pine seedlings Longleaf Seedlings Slash Seedlings

Depth of Planting Dead Alive Totals Dead Alive Totals Too High 41 59 100 12 88 100 Too Low 11 89 100 5 95 100

Totals 52 148 200 17 183 200 Table: Loglinear Models Fit to Data in Above Table and their Goodness of Fit Statistics Model 2 G2 df [12][13][23] 1.37 1.28 1 [13][23] 26.54 27.79 2 [12][13] 24.03 25.03 2 [13][2] 54.70 50.10 3

1 = Depth, 2 = Mort, 3 = Type

Log-Linear parameters for Model: [TM][TD][DM]Main Effects: Mort Mort ------ Dead Alive ------------------- -0.946 0.946 Type Type ------ Lleaf Slash ------------------- 0.240 -0.240 Depth Depth ------ low high ------------------- 0.257 -0.257

Two-Factor Interactions: Type-Mort Type Mort ------ ------ Dead Alive --------------------------- Lleaf 0.354 -0.354 Slash -0.354 0.354

Depth-Mort Depth Mort ------ ------ Dead Alive --------------------------- low 0.376 -0.376 high -0.376 0.376 Mort -Type Depth Type ------ ------ Lleaf Slash --------------------------- low -0.063 0.063 high 0.063 -0.063

Logit Model for predicting the Mortality

ln D i T kMR v v v

D i T kv vvdeadMR e e ealive

or

Log-Linear Logit Multconst -0.946 -1.892 0.151Depth- High 0.354 0.708 2.030

Low -0.354 -0.708 0.493Type-Long 0.376 0.752 2.121

Slash -0.376 -0.752 0.471

Example: Fitting a Log-linear model – Forward Selection Table: Dyke -Patterson Data - N=1729 individuals classified according to five variables (1) Reading Newspapers (2) Listen to radio (3) Do "solid'" reading (4) Attend Lectures (5) Knowledge regarding cancer

Radio No Radio Solid

Reading No solid Reading

Solid Reading

No solid Reading

Good Poor Good Poor Good Poor Good Poor Newspaper Lectures 23 8 8 4 27 18 7 6 None 102 67 35 59 201 177 75 156 None Lectures 1 3 4 3 3 8 2 10 None 16 16 13 50 67 83 84 393

The best model was found by forward selection was[LN][KLS][KR][KN][LR][NR][NS]

To fit a logit model to predict K (Knowledge) we need to fit a loglinear model with important interactions with K (knowledge), namely

[LNRS][KLS][KR][KN]The logit model will containMain effects for L (Lectures), N (Newspapers), R (Radio), and S (Reading)Two factor interaction effect for L and S

The Logit Parameters for the Model : LNSR, KLS, KR, KN ( Multiplicative effects are given in brackets, Logit Parameters = 2 Loglinear parameters)The Constant term:

-0.226 (0.798)The Main effects on Knowledge:Lectures Lect 0.268 (1.307)

None -0.268 (0.765)Newspaper News 0.324 (1.383)

None -0.324 (0.723)Reading Solid 0.340 (1.405)

Not -0.340 (0.712)Radio Radio 0.150 (1.162)

None -0.150 (0.861)

The Two-factor interaction Effect of Reading and Lectures on Knowledge

Reading Lectures Solid Not

Lect -0.180 (0.835) 0.180 (1.197) None 0.180 (1.197) -0.180 (0.835)

ratio goodKpoor

Fitting a Logit Model with a Polytomous Response Variable

Example: Table

Observed Cross-Classification of 2294 Males Who Failed to Pass the Armed Forces Qualification Test

Father's Respondent's Education Race Age Education Grammar School Some HS HS Graduate

GS 39 29 8 < 22 Some HS 4 8 1 HS Grad 11 9 6 NA 48 17 8

White GS 231 115 51 22 Some HS 17 21 13 HS Grad 18 28 45 NA 197 111 35 GS 19 40 19 < 22 Some HS 5 17 7 HS Grad 2 14 3 NA 49 79 24

Black GS 110 133 103 22 Some HS 18 38 25 HS Grad 11 25 18 NA 178 206 81

NA – Not available

The variables

1. Race – white, black2. Age - < 22, ≥ 223. Father’s education – GS, some HS, HS grad,

NA4. Respondents Education - GS, some HS, HS

grad – the response (dependent) variable

Table: Various Loglinear Models Fit to the 3 4 2 2 Table above Model d.f. G2 p-value [234][1] 30 254.8 0.0000 [234][12] 24 162.6 0.0000 [234][13] 28 242.7 0.0000 [234][14] 28 152.8 0.0000 [234][12][13] 22 151.5 0.0000 [234][12][14] 22 46.7 0.0016 [234][13][14] 26 142.5 0.0000 [234][12][13][14] 20 36.9 0.0120 [234][123][14] 14 27.9 0.0147 [234][124][13] 14 18.1 0.2023 [234][134][12] 18 33.2 0.0158 [234][123][124] 8 9.7 0.2867

Techniques for handling Polytomous Response VariableApproaches1. Consider the categories 2 at a time. Do this for all

possible pairs of the categories.2. Look at the continuation ratios

i. 1 vs 2ii. 1,2 vs 3iii. 1,2,3 vs 4iv. etc

Table Estimated Logit Effects for The Three Logit Models

Corresponding to the Log Linear Model - [234][124][13]

Grammar vs Some HS

log(m1jkl/m2jkl)

Grammar vs HS Grad

log(m1jkl/m3jkl)

Some HS vs HS Grad

log(m2jkl/m3jkl) Constant -0.289 0.451 0.740

Race White 0.395 0.390 -0.005 Black -0.395 -0.390 0.005

Age < 22 -0.120 0.099 0.219 ≥ 22 0.120 -0.099 -0.219 Grammar 0.380 0.406 0.026

Father's Some HS -0.371 -0.355 0.016 Education HS Grad -0.441 -0.918 -0.477

NA 0.432 0.867 0.435

Race - Father's Education Interaction Grammar 0.063 0.345 0.282

White by Some HS -0.128 -0.016 0.112 HS Grad 0.030 -0.429 -0.459 NA 0.035 0.101 0.066 \Grammar -0.063 -0.345 -0.282

Black by Some HS 0.128 0.016 -0.112 HS Grad -0.030 0.429 0.459 NA -0.035 -0.101 -0.066

Table Multiplicative Logit Effects for The Three Logit Models Corresponding to the Log Linear Model - [234][124][13]

Grammar vs Some HS

log(m1jkl/m2jkl)

Grammar vs HS Grad

log(m1jkl/m3jkl)

Some HS vs HS Grad

log(m2jkl/m3jkl) Constant 0.749 1.570 2.096

Race White 1.484 1.477 0.995 Black 0.674 0.677 1.005

Age < 22 0.887 1.104 1.245 ≥ 22 1.127 0.906 0.803 Grammar 1.462 1.501 1.026

Father's Some HS 0.690 0.701 1.016 Education HS Grad 0.643 0.399 0.621

NA 1.540 2.380 1.545

Race - Father's Education Interaction Grammar 1.065 1.412 1.326

White by Some HS 0.880 0.984 1.119 HS Grad 1.030 0.651 0.632 NA 1.036 1.106 1.068 Grammar 0.939 0.708 0.754

Black by Some HS 1.137 1.016 0.894 HS Grad 0.970 1.536 1.582 NA 0.966 0.904 0.936

Table Various Logit Models for thre Log Continuation ratios in the first Table

a log

m2jkm1jk

b log

m3jkm1jk m2jk

Combined Fit

Model d.f. G2 d.f. G2 d.f. G2 [234][1] 15 131.5 15 123.3 30 254.8 [234][12] 12 97.9 12 64.7 24 162.6 [234][13] 14 123.3 14 119.4 28 242.7 [234][14] 14 49.0 14 102.8 28 152.8 [234][12][13] 11 91.9 11 60.3 22 152.2 [234][12][14] 11 16.1 11 35.6 22 51.7 [234][13][14] 13 43.7 13 98.7 26 142.4 [234][12][13][14] 10 12.4 10 29.8 20 42.2 [234][123][14] 7 9.3 7 23.2 14 32.5 [234][124][13] 7 9.3 7 23.2 14 18.5 [234][134][12] 9 8.6 9 29.7 18 38.3 [234][123][124] 4 8.5 4 1.2 8 9.7

Causal or Path Analysis for Categorical Data

When the data is continuous, a causal pattern may be assumed to exist amongst the variables.The path diagramThis is a diagram summarizing causal relationships.Straight arrows are drawn between a variable that has some cause and effect on another variable X YCurved double sided arrows are drawn between variables that are simply correlated

X Y

Example 1 The variables – Job stress, Smoking, Heart DiseaseThe path diagram

Job Stress

Heart Disease

Smoking

In Path Analysis for continuous variables, one is interested in determining the contribution along each path (the path coefficents)

Example 2The variables – Job stress, Alcoholic Drinking, Smoking, Heart DiseaseThe path diagram Job

Stress

Heart Disease

SmokingDrinking

In analysis of categorical data there are no path coefficients but path diagrams can point to the appropriate logit analysis

ExampleIn this example the data consists of a two wave, two variable panel data for a sample of n =3398 schoolboys.It is looking at “membership” and “attitude towards” the leading crowd.

The path diagram: A B C D This suggest predicting B from A, thenC from A and B and finallyD from A, B and C.

Examples of Causal Analysis Using Recursive Systems of Logit Models Example 1 Two-Wave Two-Variable Panel Data for 3398 Schoolboys: Membership in and attitude toward the "Leading Crowd".

Second Interview Membership + + - - Attitude + - + -

Membership Attitude + + 458 140 110 49 First + - 171 182 56 87 Interview - + 184 75 531 281 - - 85 97 338 554

A = Membership at first interview , B = Attitude at first interview C = Membership at second interview, D = Attitude at second interview

Two-way Analysis for determining the effect of A on B Attitude(B)

+ - + 757 496 Membership

(A)

- 1071 1074

Goodness of Fit Statistics for determining the effect of A, B on C 1. [AB][AC][BC] (1 df; G2 = 0.0) 2. [AB][BC] (2 df; G2 = 1005.1) 3. [AB][AC] (2 df; G2 = 27.2) Identified Logit Model (Model # 1. [AB][AC][BC])

logitAB|C

ij log

mAB|Cij1

mAB|Cij2

wAB|C wAB|C

1i wAB|C2j

Goodness of Fit Statistics for determining the effect of A, B, C on D 4. [ABC][AD][BD][CD] (4 df; G2 = 1.2) 5. [ABC][BD][CD] (5 df; G2 = 4.0) 6. [ABC][AD][CD] (5 df; G2 = 262.5) 7. [ABC][AD][BD] (5 df; G2 = 15.7)

Identified Logit Model (Model # 5. [ABC][BD][CD])

logitABC|D

ijk wABC|D wABC|D2j wABC|CD

3k

Example 2In this example we are looking at 1. Social Economic Status (SES)2. Sex3. IQ4. Parental Encouragement for Higher

Education (PE)5. College Plans(CP)

Social Class, Parental Encouragement,IQ, and Educational Aspirations College Parental SES Sex IQ Plans Encouragement L LM UM H M L Yes Low 4 2 8 4 High 13 27 47 39 No Low 349 232 166 48 High 64 84 91 57 LM Yes Low 9 7 6 5 High 33 64 74 123 No Low 207 201 120 47 High 72 95 110 90 UM Yes Low 12 12 17 9 High 38 93 148 224 No Low 126 115 92 41 High 54 92 100 65 H Yes Low 10 17 6 8 High 49 119 198 414 No Low 67 79 42 17 High 43 59 73 54 M L Yes Low 5 11 7 6 High 9 29 36 36 No Low 454 285 163 50 High 44 61 72 58 LM Yes Low 5 19 13 5 High 14 47 75 110 No Low 312 236 193 70 High 47 88 90 76 UM Yes Low 8 12 12 12 High 20 62 91 230 No Low 216 164 174 48 High 35 85 100 81 H Yes Low 13 15 20 13 High 28 72 142 360 No Low 96 113 81 49 High 24 50 77 98

The Path Diagram

SES

Sex

IQ

PE

CP

The path diagram suggests

1. Predicting Parental Encouragement from Sex, SocioEconomic status, and IQ, then

2. Predicting College Plans from Parental Encouragement, Sex, SocioEconomic status, and IQ.

Goodness of Fit Statistics for determining the effect of A, B, C on D (A = Social class, B = IQ, C = Sex, D = Parental Encouragement, E = College Plans) 1. [ABC][AD][BD][CD] (24 df; G2 = 55.81) 2. [ABC][ABD][CD] (15 df; G2 = 34.60) 3. [ABC][BCD][ACD] (18 df; G2 = 31.48) 4. [ABC][ABD][BCD] (12 df; G2 = 22.44) 5. [ABC][ABD][ACD] (12 df; G2 = 22.45) 6. [ABC][ABD][ACD][BCD] (9 df; G2 = 9.22)

Logit Parameters: Model [ABC][ABD][ACD][BCD]

Constant term wABC|D = 0.124 Main Effects Social Class L LM UM H w1(i)

ABC|D = -1.178, -0.384, 0.222, 1.340 IQ L LM UM H w2(j)

ABC|D = -0.772, -0.226, 0.210, 0.788 Sex M F w3(k)

ABC|D = 0.304, -0.304

Two factor Interactions

IQ by Social Class IQ L LM UM H L -0.016 -0.098 -0.058 -0.026 Social LM 0.066 0.032 0.144 -0.244 Class UM 0.074 -0.044 -0.138 0.108 H -0.126 -0.086 0.048 0.164

Social Class by Sex Sex M F L 0.140 -0.140 Social LM -0.052 0.052 Class UM 0.018 -0.018 H -0.106 0.106

IQ by Sex Sex M F L -0.126 0.126 IQ LM -0.016 0.016 UM 0.018 -0.018 H 0.122 -0.122

Goodness of Fit Statistics for determining the effect of A, B, C, D on E (A = Social class, B = IQ, C = Sex, D = Parental Encouragement, E = College Plans) 7. [ABCD][E][CD] (63 df; G2 = 4497.51) 8. [ABCD][AE][BE][CE][DE] (55 df; G2 = 73.82) 9. [ABCD][BCE][AE][DE] (52 df; G2 = 59.55)

Logit Parameters for Predicting College Plans Using Model 9:[ABCD][BCE][AE][DE]

Constant term wABCD|E = - 1.292 Main Effects Social Class L LM UM H w1(i)

ABCD|E = -0.650, -0.200, 0.062, 0.790 IQ L LM UM H w2(j)

ABCD|E = -0.840, -0.300, 0.266, 0.876 Sex M F w3(k)

ABCD|E = 0.082, -0.082 Parental Encouragement L H w4(l)

ABCD|E = -1.214, 1.214

Two Factor Interactions IQ by Sex Sex M F L -0.134 0.134 IQ LM -0.078 0.078 UM 0.094 -0.094 H 0.118 -0.118

Documents

Discrete Multivariate Analysis