23
I've Got a Categorical Variable?! Now What? Andrew Zieffler Department of Educational Psychology Research Methodology and Consulting Center (RMCC): Lunch & Learn October 03, 2018

I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

I've Got a Categorical Variable?!

Now What?

Andrew ZiefflerDepartment of Educational Psychology

Research Methodology and Consulting Center (RMCC): Lunch & Learn

October 03, 2018

Page 2: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Scales of Measurement

Classification system describing the nature of information within the values assigned to variables. (Stevens, 1946)

Scale Property Operations Examples

Nominal Class membership = and ≠ College major, Sex, Political affiliation

Ordinal Comparison < and > Likert data, Rankings, Scoville scale

Interval Difference + and – GRE scores, Temperature (F)

Ratio Magnitude × and ÷ Income, Class size, Years of experience

Categorical Data is Here

Categorical variables are at the nominal scale of measurement, and although, in practice, we assign numbers to represent the levels of the categorical variable, those numbers do not carry

any more information than group membership.

Page 3: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Dichotomous and Polytomous Variables

When a categorical variable has two levels we refer to it as dichotomous or binary. If it has more than two levels, it is polychotomous or polytomous.

Major

Kinesiology

Special Education

Special Education

Child Psychology

Kinesiology

Kinesiology

Child Psychology

STEM Major?

STEM

Non-STEM

Non-STEM

STEM

STEM

STEM

Non-STEM

Dichotomous Polytomous

Page 4: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Contingency Tables

A contingency table is simply a table that lists each level of the categorical variable and those level's counts/percentages.

Major

Kinesiology

Special Education

Special Education

Child Psychology

Kinesiology

Kinesiology

Child Psychology

Major Count

Kinesiology 3

Special Education 2

Child Psychology 2

Data

Contingency Table

Page 5: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Contingency tables can also be used to show cross-classifications of two (or more) variables.

Major Sex

Kinesiology Female

Special Education Female

Special Education Male

Child Psychology Female

Kinesiology Male

Kinesiology Male

Child Psychology Female

Sex

Major Female Male Total

Kinesiology 1 2 3

Special Education

1 1 2

Child Psychology 2 0 2

Total 4 3 7

Data

Contingency Table

Page 6: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Bar charts are just graphical summaries of the information in a contingency table.

Sex

Major Female Male Total

Kinesiology 1 2 3

Special Education

1 1 2

Child Psychology 2 0 2

Total 4 3 7

Page 7: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Many methods of analyzing categorical data are based off of contingency tables:

• Methods of association‣ Chi-squared () statistics and tests‣ Phi coefficient‣ Tetrachoric correlation‣ Cramer's V‣ Goodman & Kruskal's lambda‣ Goodman & Kruskal's gamma‣ Kendall's tau

• Log-linear modeling

• Correspondence analysis

Page 8: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Challenge #1: Many common statistical methods require quantitative variables

To alleviate this problem, we typically re-code (or treat) categorical variables so that they are quantitative. (Remember: The numbers only denote group membership.)

ID Major Recoded 1 Recoded 2 Recoded 3

1 Kinesiology 1 1 –1

2 Special Education 2 100 0

3 Special Education 2 100 0

4 Child Psychology 3 3000 1

5 Kinesiology 1 1 –1

6 Kinesiology 1 1 –1

7 Child Psychology 3 3000 1

Page 9: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

One common method for coding categorical variables is dummy coding (aka, reference coding).

• Dummy coding only uses the values 0 and 1• Typically 1 denotes membership in a particular level and 0 indicates not a member of

that level

Dummy/Reference Coding

ID Graduation Status Graduated

1 Graduated 1

2 Did not graduate 0

3 Graduated 1

4 Graduated 1

5 Graduated 1

6 Did not graduate 0

7 Graduated 1

8 Did not graduate 0

9 Graduated 1

10 Graduated 1

Page 10: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Dummy coding has several useful advantages. For example, the mean of a dummy coded variable is the proportion of cases coded as 1.

The proportion of students who graduated is 0.7.

ID Graduation Status Graduated

1 Graduated 1

2 Did not graduate 0

3 Graduated 1

4 Graduated 1

5 Graduated 1

6 Did not graduate 0

7 Graduated 1

8 Did not graduate 0

9 Graduated 1

10 Graduated 1

Mean =1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 1 + 1

10=

7

10= 0.7

<latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit><latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit><latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit><latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit>

Page 11: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

With polytomous variables we need to use more than one dummy variable to code all of the categories.

To distinctly code all of the categories we need to create a dummy variable for all categories except one.

ID Major Kinesiology Spec_Ed

1 Kinesiology 1 0

2 Special Education 0 1

3 Special Education 0 1

4 Child Psychology 0 0

5 Kinesiology 1 0

6 Kinesiology 1 0

7 Child Psychology 0 0

• Kinesiology majors: Kinesiology = 1 and Spec_Ed = 0• Special Education majors: Kinesiology = 0 and Spec_Ed = 1 • Child Psychology majors: Kinesiology = 0 and Spec_Ed = 0

Page 12: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Because there are multiple dummy variables, we compute multiple means.

Special Education =0 + 1 + 1 + 0 + 0 + 0 + 0)

7=

2

7= 0.286

<latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit><latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit><latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit><latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit>

Kinesiology =1 + 0 + 0 + 0 + 1 + 1 + 0)

7=

3

7= 0.429

<latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit><latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit><latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit><latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit>

• The proportion of Kinesiology majors is 0.429.• The proportion of Special Education majors is 0.286.

Child Psychology = 1� 0.429� 0.286 = 0285<latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit><latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit><latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit><latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit>

• The proportion of Child Psychology majors is 0.285.

Page 13: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Challenge #2: Categorical variables may have many levels

Imagine if we had a variable state that we wanted to analyze. If we were to code this into a set of dummy variables, we would need to create 49 dummy variables! (Or 51 if we include Puerto Rico and Washington, DC.)

ID State MN NY CA

1 Minnesota 1 0 0

2 New York 0 1 0

3 California 0 0 1

4 Iowa 0 0 0

5 North Dakota 0 0 0

6 Texas 0 0 0

7 Oregon 0 0 0

⋮ ⋮ ⋮ ⋮ ⋮

Need to add 46 more dummy variables…

Page 14: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

If we use a variable with many levels in an analysis (say we want to see if there are differences in ACT scores across states), we will need to adjust our p-values to account for the high number of comparisons (e.g., Bonferroni adjustment).

Page 15: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Potential Solution: Collapse the variable into fewer categories by combining several categories into a single category.

ID State Region

1 Minnesota Midwest

2 New York East

3 California West

4 Iowa Midwest

5 North Dakota Midwest

6 Texas South

7 Oregon West

⋮ ⋮ ⋮

For our state example we might collapse states into regions. This reduces the number of levels from 50 to 4 or 5 (depending how many regions we envision).

Page 16: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Challenge #3: One or more categories are very rare

If one or more categories have very few cases relative to others, they will offer little to no information in the analysis (too little variation). In some cases, models may fail to converge.

Page 17: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Potential Solution: Try to collapse these categories into other categories.

ID Self Identified Race/Ethnicity Collapsed Race

1 Hispanic Hispanic

2 African Cuban Other

3 White White

4 African American African American

5 Hispanic Hispanic

6 African American African American

7 Hispanic Hispanic

⋮ ⋮ ⋮

Survey responses might allow respondents to write-in information. Below Respondent #2 chose to write in her/his/their race/ethnicity. This could be (depending on the RQ) collapsed into an "Other" category along with other write-in responses that cannot be categorized.

Page 18: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Challenge #3: One category almost always occurs

If almost all of the observations fall into a single category the variable will offer little to no information in the analysis (too little variation). In some cases, models may fail to converge.

Page 19: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Challenge #4: Your outcome is categorical

When your outcome is categorical, the linear models (e.g., regression, ANOVA, t-tests) are no longer appropriate for analyzing your data.

Imagine a researcher examining whether ACT score is predictive of whether or not students graduate college. In this analysis the outcome, graduation = Yes/No, is a dichotomous categorical variable.

Page 20: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

A plot of the proportion of students who graduate by ACT score illustrates several problems with using methods that are meant for quantitative data:

• The curve that models the proportion of students who graduate is S-shaped; not linear.• This is even more apparent if we extrapolate to really low or really high ACT scores; the

proportion of students who graduate can never go below 0 or above 1 (they are bounds/asymptotes for our curve).

• If we are interested in inference, one of the assumptions of the linear model is conditional normality; proportions are not normally distributed—they are binomially distributed.

Page 21: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Potential Solution: Use methods that accommodate categorical outcomes.

• Bar charts• Mosaic plots• Biserial or Point-Biserial correlation coefficients• Goodman and Kruskal's Lambda• Chi-square tests of association/independence• Tests of Proportion• Generalized models (e.g., logistic regression)• ROC analysis• Survival models• Classification trees

Page 22: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

References and Resources

Agresti, A. (2012). Categorical data analysis (3rd ed.). New York: Wiley.

Agresti, A. (2012). Analysis of ordinal categorical data (2nd ed.). New York: Wiley.

Friendly, M. (2012). Visualizing categorical data: Data, stories, and pictures. Mosaic: A Journal For The Interdisciplinary Study Of Literature, 1–9. http://www.datavis.ca/books/vcd/vcdstory.pdf

Hardy, M. A. (1993). Regression with dummy variables. Thousand Oaks, CA: Sage.

Hosmer, D. A., & Lemeshow, S. (2013). Applied logistic regression (3rd ed.). New York: Wiley.

Klein, J. P., & Moeschberger, M. L. (2005). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). New York: Springer.

UCLA Institute for Digital Research and Education. Coding systems for categorical variables in regression analyses. https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/

Wendorf, C. A. (2004). Primer on multiple regression coding: Common forms and the additional case of repeated contrasts. Understanding Statistics, 3(1), 47–57.

Page 23: I've Got a Categorical Variable?! - CEHD | UMN€¦ · Challenge #1: Many common statistical methods require quantitative variables To alleviate this problem, we typically re-code

Research Methodology Consulting Center (RMCC)

Consulting for UMN faculty and researchers

• Grant proposal consulting• Funded project consulting and services• Unfunded projects consulting (CEHD only)

Consulting for CEHD graduate students

• General advice about methodology and statistical analysis for dissertation and thesis work

• Four, 45-minute consultations are provided each academic year at no cost.

Find out more at

http://www.cehd.umn.edu/research/consulting/