26
1.1 Analysis of categorical response data Topic covered in lecture 1: What is categorical data Response and explanatory variables Measurement scales for categorical data Course coverage Tabulated count data and related questions Non tabulated categorical data Sampling design for tables Links with other methods

Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

  • Upload
    hahanh

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.1

Analysis of categorical response data

Topic covered in lecture 1:

• What is categorical data

Response and explanatory variables

Measurement scales for categorical data

• Course coverage

• Tabulated count data and related questions

• Non tabulated categorical data

• Sampling design for tables

• Links with other methods

Page 2: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.2

What is categorical data?: The measurement scale for

the response consists of a number of categories

Variable Measurement Scale

Farm system Dairy, Beef, Tillage etc.

Mortality Dead, alive

Food texture Very soft, Soft, Hard, Very hard

Litter size 0, 1, 2, 3 and >3

Types of data discussed in this course

Response variable(s) is categorical

Explanatory variable(s) may be categorical or

continuous

Example 1: Does Post-operative survival (categorical response) depend on the explanatory variables?

Sex (categorical) Age (continuous) Example 2: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.

Farm system (categorical) Attitude to EU (categorical/ordinal)?

(Two response variables - no explanatory variables) Could one of these be regarded as explanatory?

Page 3: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.3

Measurement scales for categorical data

Nominal - no underlying order

Variable Measurement Scale

Farm system Dairy, Beef, Tillage etc.

Weed Species Stellaria media, Poa annua, etc.

Ordinal - underlying order in the scale

Variable Measurement Scale

Food texture Very soft, Soft, Hard, Very hard

Disease diagnosis Very likely, Likely, Unlikely

Education Primary, Secondary, Tertiary

Interval - underlying numerical distance between scale

points

Variable Measurement Scale

Litter size 0, 1, 2, 3 and >3

Age class <1, 1-2, 2-3.5, 3.5-5, >5

Education years in education

Page 4: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.4

Tabulated count data and questions

Single level table

Example 1: A geneticist carries out a crossing

experiment between F1 hybrids of a wild type and a

mutant genotype and obtains an F2 progeny of 90

offspring with the following characteristics.

Wild Type Mutant Total

80 10 90

Evidence that a wild tpe is dominant, giving on average

3:1 offspring phenotype in its favour?

Two-way table

Example 1- A sample 124 mice was divided into two

groups, 84 receiving a standard dose of pathogenic

bacteria followed by an antiserum and a control group

of 40 not receiving the antiserum. After 3 weeks the

numbers dead and alive i9n each group were counted.

Outcome

Dead Alive Total % dead

+ antiserum 19 65 84 23

- antiserum 18 22 40 45

Total 37 87 124

Association between mortality and treatment ?

Is the mortality rate the same for both treatments?

Page 5: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.5

Example 2 - Categorical response and categorical

explanatory variable: The opinion poll after the Good

Friday Agreement with respondents classified by

religion (R - Catholic or Protestant)

Favour Oppose Undec. Total %

Favour

Catholic 258 32 62 352 73 Protestant 149 91 208 448 33 Total 407 123 270 800 51 % Cath 63 26 23

1. Evidence that a majority of decided voters (all

voters) support the agreement?

2. Support pattern the same for Protestants and

Catholics?

Page 6: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.6

Example 3 (Snedecor and Cochran): Categorical

response and interval categorical explanatory variable.

The table below shows the number of aphids alive and

dead after spraying with four concentrations of solutions

of sodium oleate. Has the higher concentration given a

significantly different percentage kill? Is there a

relationship between concentration and mortality?

Concentration of sodium

oleate (%)

0.65 1.10 1.6 2.1 Total

Dead 55 62 100 72 289

Alive 22 13 12 5 52

Total 77 75 112 77 341

% Dead 71.4 82.7 89.3 93.5 84.8

Is mortality related to sodium oleate concentration?

Page 7: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.7

Example 4 Categorical response and interval

categorical explanatory variable (Cornfield 1962):

Blood pressure (BP) was measured on a sample of

males aged 40-59, who were also classified by

whether they developed coronary heart disease (CHD)

in a 6-year follow-up period. The data were classified

by BP (interval categorical variable in 8 classes) and

CHD (CHD or No-CHD).

BP CHD No

CHD

Total % CHD

<117 3 153 156 1.9

117 - 126 17 235 252 6.7

127 - 136 12 272 284 4.2

137 - 146 16 255 271 5.9

147 - 156 12 127 139 8.6

157 - 166 8 77 85 9.4

167 - 186 16 83 99 16.2

>186 8 35 43 18.6

Total 92 1237 1329

1.Is the incidence of CHD independent of BP?

2.Simple relationship between the probability of CHD

and the level of BP?

Page 8: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.8

Multiway table - relationship between categorical

responses or categorical response and several

categorical explanatory variables:

Example 1: The NI opinion poll with respondents further classified by where they lived in Northern Ireland (L) (ARL table)

West - rural and strong nationalist/Catholic Belfast - mixed population North East - industrial and Unionist/Protestant.

Favour Oppose Undecided West Catholic 73 20 20 Protestant 47 34 69

Belfast Catholic 90 9 21 Protestant 54 23 66

North East Catholic 95 3 21 Protestant 48 34 73

Total 407 123 270

1. Evidence that a majority of decided voters (all voters)

support the agreement?

2. Difference in support pattern between Protestants

and Catholics?

3. Difference in support pattern between Protestants

and Catholics consistent over region?

4. Within the Catholic (Protestant) population does the

strength of support change with region? ETC ETC

Page 9: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.9

Example 2: Grouped binomial data - patterns of

psychotropic drug consumption in a sample from West

London (Murray et al 1981, Psy Med 11,551-60). Sex Age

Group Psych. case

On drugs

Total

M 1 No 9 531 M 2 No 16 500 M 3 No 38 644 M 4 No 26 275 M 5 No 9 90 M 1 Yes 12 171 M 2 Yes 16 125 M 3 Yes 31 121 M 4 Yes 16 56 M 5 Yes 10 26 F 1 No 12 588 F 2 No 42 596 F 3 No 96 765 F 4 No 52 327 F 5 No 30 179 F 1 Yes 33 210 F 2 Yes 47 189 F 3 Yes 71 242 F 4 Yes 45 98 F 5 Yes 21 60

Is Pychotropic drug use affected by gender, age or

psychological state and are there interactions among

these effects?

Page 10: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.10

Non-tabulated data and questions

Example 1: Individual plants were monitored the

survival of plants of Legousia in an experiment to

see whether they survived after 3 months. Survived -

yes is scored 1 and Survived -no scored 0. Also

recorded were

CO2 treatment – 2 levels low and high

Density of Legousia

Density of companion species

Height of the plant (mm) two weeks after planting.

Most individuals will have a unique profile in these

three additional variables and so tabulation of the data

by them is not feasible. The individual data is

presented. Density

Subject Surv CO2 Ht Leg. Comp 1 0 L 35 20 30 2 1 L 68 22 27 3 1 H 43 16 33 4 0 L 27 4 16 … … … … … … … … … … … …

1.Is survival related to the explanatory variables

(CO2, Height, density-self, density-companions.)?

2.Can the probability of survival be predicted from the

subject’s profile?

Page 11: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.11

Example 2: A sample of 62 patients who had

angioplasty for coronary artery disease were

followed to see if they reblocked (restenosed) after 6

months RS -yes is scored 1 and RS -no scored 0 (a

binary response categorical variable). Also

recorded were

Age in years - ‘continuous’ variate

Blood pressure (BP) - continuous variate

Sex - nominal categorical (?)

Cholesterol - continuous

Most individuals will have a unique profile in these four

additional variables and so tabulation of the data by

them is not feasible. The individual data is presented.

Subject RS Age BP Sex Cholest.

1 0 35 117 m 1 2 1 68 154 f 5 3 1 43 123 f 2 4 0 27 110 m 3 … … … … … …

3.Is RS related to the explanatory variables (Age, BP,

Sex and Cholesterol)?

4.Can the probability of RS be predicted from the

subject’s profile?

Page 12: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.12

Sampling designs - two and multiway tables

Single sample (no margin fixed) simultaneously

classified by several categorical variables. Used in

Cross-sectional studies.

Example: A simple random sample of 200 students

was classified by gender and attitude to EU

integration.

EU integration

Favour Oppose Total

Male 43 53 96

Female 61 33 104

Total 104 86

This is a snapshot of opinion at a moment in time -

hence Cross-sectional.

Page 13: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.13

One margin fixed: Samples of fixed size are selected

for one category and individuals are classified by the

other category(s).

Example 1 (Clinical trial - a prospective study): Of

400 HIV positive pregnant women 200 are assigned at

random to each of Breast feeding (BF) or Formula

feeding (FF). Two years after birth the child’s HIV

status is determined.

Child’s status (???) Total

HIV + HIV -

BF 62 138 200

FF 45 155 200

Example 2 (Cohort study - a prospective study): 400

HIV positive pregnant women are asked to select

either Breast feeding (BF) or Formula feeding (FF).

Two years after birth the child’s HIV status is

determined. Here the sample totals are determined by

the mothers’ choices.

Example 3 (Case-control or retrospective study): A

sample of 200 HIV+ and another of 200 HIV- two year

old children are selected and classified by whether

they were BF or FF. Here the HIV outcome numbers

are controlled - cannot compute % HIV from BF and

FF.

Page 14: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.14

Past Present Future

Cohort

Cases and controls

Cross-sectional

Page 15: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.15

Notes on sampling designs

• In more complex studies more than one margin may

be fixed.

Example 1: Any replicated factorial experiment

where the response is binary

Example 2: Physicians health study. NEJM 1988,

262-264. Four treatments

Treatment Aspirin beta

carotene A No No B Yes No C No Yes D Yes Yes

Example 3: 2x2 table with both margins fixed?

• The statistical properties differ considerably between

sampling schemes, nevertheless the methods to be

discussed below apply, with some modifications, to

data collected using any of these sampling schemes.

Page 16: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.16

Relationships with regression methods.

Traditionally categorical data analysis has been viewed

as completely distinct from and unconnected with

regression and ANOVA methods. We show that there

are many strong links and that many concepts transfer

naturally between the methods.

Page 17: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.17

SAS Analysis of example 1

A sample 124 mice was divided into two groups, 84

receiving a standard dose of pathogenic bacteria

followed by an antiserum and a control group of 40 not

receiving the antiserum. After 3 weeks the numbers

dead and alive i9n each group were counted.

Outcome

Dead Alive Total % dead

+ antiserum 19 65 84 23

- antiserum 18 22 40 45

Total 37 87 124

Association between mortality and treatment ?

Is the mortality rate the same for both treatments?

Page 18: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.18

SAS program for analysis of example 1 data

PROC FREQ OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA ANTISER; INPUT ANTISER $ MORTALI $ COUNT ; CARDS ; A__plus Dead 19 A_plus Alive 65 A_minus Dead 18 A_minus Alive 22 ; PROC FREQ ;

TABLES ANTISER*MORTALI/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL

NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;

Page 19: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.19

Table of ANTISER by MORTALI ANTISER MORTALI Frequency ‚ Expected ‚ Deviation ‚ Cell Chi-Square‚Alive ‚Dead ‚ Total ŲŲŲŲŲŲŲŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆ A_minus ‚ 22 ‚ 18 ‚ 40 ‚ 28.065 ‚ 11.935 ‚ ‚ -6.065 ‚ 6.0645 ‚ ‚ 1.3105 ‚ 3.0814 ‚ ŲŲŲŲŲŲŲŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆ A_plus ‚ 65 ‚ 19 ‚ 84 ‚ 58.935 ‚ 25.065 ‚ ‚ 6.0645 ‚ -6.065 ‚ ‚ 0.624 ‚ 1.4673 ‚ ŲŲŲŲŲŲŲŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆŲŲŲŲŲŲŲŲˆ Total 87 37 124 Statistics for Table of ANTISER by MORTALI Statistic DF Value Prob ChiChiChiChi----Square 1 6.4833 Square 1 6.4833 Square 1 6.4833 Square 1 6.4833 0.0109 0.0109 0.0109 0.0109 Likeli Ratio ChiLikeli Ratio ChiLikeli Ratio ChiLikeli Ratio Chi----Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122Squ 1 6.2846 0.0122

Sample Size = 124

Observed counts

Outcome

Dead Alive Total % Dead

+ antiserum 19 65 84 23

Page 20: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.20

- antiserum 18 22 40 45

Total 37 87 124 30

Expected (blue) counts if outcome is independent of treatment

Outcome

Dead Alive Total % Dead

+ antiserum .3*84

25.2

.7*84

58.8

84 23

- antiserum .3*40

12.0

.7*40

28.0

40 45

Total 37 87 124 30

Is there a discrepancy between obsewrved and expected? Chisquared = (Observed-expected)

2/expected

Page 21: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.21

SAS Analysis of example 3

The table below shows the number of aphids alive and

dead after spraying with four concentrations of solutions

of sodium oleate. Has the higher concentration given a

significantly different percentage kill? Is there a

relationship between concentration and mortality?

Concentration of sodium

oleate (%)

0.65 1.10 1.6 2.1 Total

Dead 55 62 100 72 289

Alive 22 13 12 5 52

Total 77 75 112 77 341

Is mortality independent of sodium oleate concentration?

Page 22: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.22

SAS program for analysis of Insecticide data

PROC FREQ OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA INSECT; INPUT SODOL D_AL COUNT ; CARDS ; 0.65 1 55 1.10 1 62 1.6 1 100 2.1 1 72 0.65 2 22 1.10 2 13 1.6 2 12 2.1 2 5 ; PROC FREQ ; TABLES D_AL*SODOL/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;

Page 23: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.23

Output from SAS PROC FREQ. TABLE OF D_AL BY SODOL D_AL SODOL FREQUENCY| EXPECTED | DEVIATION| CELL CHI2| 0.65| 1.1| 1.6| 2.1| TOTAL ---------+--------+--------+--------+--------+ 1 | 55 | 62 | 100 | 72 | 289 | 65.3 | 63.6 | 94.9 | 65.3 | | -10.3 | -1.6 | 5.1 | 6.7 | |1.61249 |.038436 |.271785 |.696522 | ---------+--------+--------+--------+--------+ 2 | 22 | 13 | 12 | 5 | 52 | 11.7 | 11.4 | 17.1 | 11.7 | | 10.3 | 1.6 | -5.1 | -6.7 | |8.96172 |.213617 | 1.5105 |3.87106 | ---------+--------+--------+--------+--------+ TOTAL 77 75 112 77 341

Page 24: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.24

STATISTICS FOR TABLE OF D_AL BY SODOL STATISTIC DF VALUE PROB ------------------------------------------------------ CHI-SQUARE 3 17.176 0.001

LIKELIHOOD RATIO CHI-SQUARE 3 16.633 0.001

MANTEL-HAENSZEL CHI-SQUARE 1 16.157 0.000 PHI 0.224 CONTINGENCY COEFFICIENT 0.219 CRAMER'S V 0.224

Conclusion: Insect mortality is not independent of dose. Mortality is not constant

as dose changes.

Sodium oleate (%)

0.65 1.10 1.6 2.1 Total

Dead 55 62 100 72 289

Alive 22 13 12 5 52

Total 77 75 112 77 341

% Dead 71.4 82.7 89.3 93.5 84.8

Page 25: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.25

Group two lowest and two highest levels

Page 26: Analysis of categorical response · PDF fileAnalysis of categorical response data Topic covered in lecture 1: ... Education years in education . 1.4 Tabulated count data and questions

1.26

Analysis of CHD data

Blood pressure (BP) was measured on a sample of

males aged 40-59, who were also classified by

whether they developed coronary heart disease (CHD)

in a 6-year follow-up period. The data were classified

by BP (interval categorical variable in 8 classes) and

CHD (CHD or No-CHD).

BP CHD No

CHD

Total % CHD

<117 3 153 156 1.9

117 - 126 17 235 252 6.7

127 - 136 12 272 284 4.2

137 - 146 16 255 271 5.9

147 - 156 12 127 139 8.6

157 - 166 8 77 85 9.4

167 - 186 16 83 99 16.2

>186 8 35 43 18.6

Total 92 1237 1329

3.Is the incidence of CHD independent of BP?

4.Simple relationship between the probability of CHD

and the level of BP?