Download ppt - Factor Analysis & Structural Equation Models 1 Sociology 8811, Class 28 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission

Factor Analysis & Structural Equation Models 1

Sociology 8811, Class 28

Copyright © 2007 by Evan SchoferDo not copy or distribute without permission

Announcements

• Paper #2 due today!

• Schedule: Structural equation models• I’ll start with related issue:• Factor Analysis• Path Models

• Monday lab:• Factor analysis• Whatever else we can squeeze in (Path models, SEM)• NO graded lab assignment

Factor Analysis• Factor analysis is an exploratory tool

• Often called “Exploratory Factor Analysis”• Helps identify simple patterns that underlie complex

multivariate data– Not about hypothesis testing– Rather, it is more like data mining

• And also helps us understand some principles of SEM

– Note: Factor analysis is informally used to refer to two different methods

• Factor analysis (FA)• Principle component analysis (PCA)• Differences aren’t critical here

– I will focus on FA, which is most useful in understanding SEM– Most of lecture will apply to PCA.

Factor Analysis

• The basic idea: FA seeks to identify a small number of “underlying variables” that effectively summarize multivariate data

• Ex: Suppose we have many political opinion variables– Approval of president; environmental views; etc.

• Perhaps one unmeasured “factor” accounts for people’s positions on all those variables…

– Ex: Liberalism vs. conservatism…

• FA seeks to identify common patterns– But, it is up to the researcher to determine what the underlying

pattern really means…

Factor Analysis: ‘Depression’

• Suppose we believe in a theoretical construct such as “depression”.

• There is no single variable that perfectly measures it… but we believe it exists

• Hypothetical questions:• HAPPY: How happy are you? (1-10)• WORLDGOOD: How much do you agree with the

statement that “The world is a good place”? (1-5)• HOPELESS: Do you often feel hopeless? (1-5)• SAD: Do you often feel sad? (1-5)• TIRED: Do you often feel tired or discouraged? (1-10)

Example: ‘Depression’

• Strategy 1: We could ask many questions & create an index that combines all measures

• Note: we would have to flip signs on some measures• “Happy” would have to be reversed to effectively

measure ‘depression’

• Strategy 2: We could ask many questions and then conduct a factor analysis

• To see if answers to questions exhibit an underlying pattern (which we could label “depression”).

Factor Analysis: Depression• Hypothetical results from a factor analysis:

Factor Loadings

Factor 1 Factor 2

Happy -.86 …

WorldGood -.75 …

Hopeless .92 …

Sad .95 …

Tired .71 …

A factor is a variable that explains lots of variance among the variables being analyzed (Happy, sad, hopeless, etc)

Loadings are the correlation between each variable and the unobserved factor…

The loadings tell you a lot about patterns of variation among cases…Notably: People who score high on “sad” & “hopeless” & “tired” tend to score very low on “happy” and “worldgood” and vice versa…

Factor Analysis: Depression• Issue: It is wholly up to the researcher to

interpret the factors• We are just data mining… • To ascribe meaning to factors requires much careful

thought – and is ideally informed by theory…

Factor 1

Happy -.86

WorldGood -.75

Hopeless .92

Sad .95

Tired .71

What might factor 1 represent?

Does it seem like it captures “Depression”? Might it mean something else?

Factor Analysis: Depression• Factor analysis is agnostic to direction of

factor variables… results might look like this:

Factor 1

Happy .86

WorldGood .75

Hopeless -.92

Sad -.95

Tired -.71

For all intents & purposes, these results are identical… but flipped

The factor is capturing the inverse of depression… (happiness?)

Factor Analysis

• Things you can do with factor analysis:• 1. Examine factor loadings

– Use them to interpret factors that are identified in the data

• 2. Plot factor loadings– Vividly describe which variables “go together” (people score

high on one tend to score high on another or vice versa)

• 3. Compute factor scores– Estimate how individual cases score on underlying factors– How depressed is each case?

• 4. Determine variation explained by factors– See which factors account for the major patterns in your data

• 5. “Rotate” the factors– Modify them to enhance interpretability… Will discuss later.

FA Example: Civic Engagement

• How do people participate in politics?• Do people vary systematically in civic participation?• Is there such a thing as “civic engagement”?

– A common pattern of behavior that appears in empirical data?

– World Values Survey Data for USA:• Membership in civic groups• Volunteering• Participation in demonstrations• Participation in strikes• Participation in boycotts• Sign petitions.

FA Example: Civic Engagement• Factor analysis of US civic participation. factor member volunteer petition boycott demonstrate strike occupybldg

Factor analysis/correlation Number of obs = 1110 Method: principal factors Retained factors = 3 Rotation: (unrotated) Number of params = 18

-------------------------------------------------------------------------- Factor | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Factor1 | 1.51105 0.71238 0.8319 0.8319 Factor2 | 0.79867 0.67994 0.4397 1.2717 Factor3 | 0.11872 0.20190 0.0654 1.3370 Factor4 | -0.08318 0.04249 -0.0458 1.2912 Factor5 | -0.12567 0.05446 -0.0692 1.2221 Factor6 | -0.18013 0.04305 -0.0992 1.1229 Factor7 | -0.22318 . -0.1229 1.0000 -------------------------------------------------------------------------- LR test: independent vs. saturated: chi2(21) = 1405.19 Prob>chi2 = 0.0000

Initial output describes process of factor extraction – identifying factors within the data. Stata identifies many factors (all possible patterns until it runs out of variation). But, only factors with large eigenvalues explain a lot…

FA Example: Civic Engagement• Output (cont’d)Factor loadings (pattern matrix) and unique variances

----------------------------------------------------------- Variable | Factor1 Factor2 Factor3 | Uniqueness -------------+------------------------------+-------------- member | 0.7111 -0.5941 0.0984 | 0.1316 volunteer | 0.6689 -0.6450 0.0939 | 0.1278 petition | 0.3485 0.2288 -0.6927 | 0.3464 boycott | 0.6350 0.3756 -0.2149 | 0.4095 demonstrate | 0.6210 0.4021 -0.1098 | 0.4406 strike | 0.4035 0.4387 0.4021 | 0.4830 occupybldg | 0.2698 0.4038 0.5597 | 0.4509 -----------------------------------------------------------

Next, stata reports the main factors it finds.Factor 1 explains most variation, others less…

Factor 1 correlates with ALL measures of civic participationIn other words, people tend to be high on all measures or low on all.

Is this “civic engagement”?

Factor 2: Some people are LOW on membership & moderately high on demonstrations/strikes.Others are the converse…

Maybe some people are alienated or active in social movements?

FA Example: Civic Engagement• Output (cont’d)Factor loadings (pattern matrix) and unique variances

----------------------------------------------------------- Variable | Factor1 Factor2 Factor3 | Uniqueness -------------+------------------------------+-------------- member | 0.7111 -0.5941 0.0984 | 0.1316 volunteer | 0.6689 -0.6450 0.0939 | 0.1278 petition | 0.3485 0.2288 -0.6927 | 0.3464 boycott | 0.6350 0.3756 -0.2149 | 0.4095 demonstrate | 0.6210 0.4021 -0.1098 | 0.4406 strike | 0.4035 0.4387 0.4021 | 0.4830 occupybldg | 0.2698 0.4038 0.5597 | 0.4509 -----------------------------------------------------------

Factor 3 finds that some people engage in strikes/occupation of buildings but do not sign petitions.

A bit hard to interpret… Focus your energies on first few factors that have big eigenvalues…

FA Example: Civic Engagement• A visual representation of factor loadings

membervolunteer

petition

boycottdemonstrate

strikeoccupybldg

-.4

-.2

0.2

.4F

acto

r 2

0 .2 .4 .6 .8Factor 1

Factor loadings Command: “loadingplot”-- run after factor analysis

Descriptive patterns emerge from the data

Membership & volunteering go together…But are far from strikes, protests, etc.

Factor Rotation

• Factors can be “rotated”• Rotation = recalculating them to maximize differences

between them• This can improve interpretability of factors

Rotated factor loadings (pattern matrix) and unique variances

----------------------------------------------------------- Variable | Factor1 Factor2 Factor3 | Uniqueness -------------+------------------------------+-------------- member | 0.8061 0.0974 0.0139 | 0.3405 volunteer | 0.8055 0.0377 -0.0087 | 0.3497 petition | 0.0615 0.3130 -0.1456 | 0.8771 boycott | 0.1504 0.5724 0.0165 | 0.6494 demonstrate | 0.1358 0.5614 0.0671 | 0.6619 strike | 0.0371 0.3536 0.2421 | 0.8150 occupybldg | -0.0030 0.2439 0.2501 | 0.8780 -----------------------------------------------------------

Here, we see a clearer pattern… Factors 1 & 2 are more distinct.Factor 1 = civic membership; factor 2 = protest/social mvmts, etc…

FA Example: Civic Engagement• Let’s plot the rotated factor loadings:

Pattern is similar to unrotated…But, rotation moves variables closer to axes

membervolunteer

petition

boycottdemonstrate

strike

occupybldg

0.2

.4.6

Fac

tor

2

0 .2 .4 .6 .8Factor 1

Rotation: orthogonal varimaxMethod: principal factors

Factor loadings

Factor Scores

• Factors = variables…• We can compute the value of them for a given case…• Ex: How high do I score on F1 (depression)?• Stata syntax: “predict f1 f2 f3…”

– If you only want scores from first 2 factors, just list 2 variable names…

– Note: If done after rotation, scores will be based on rotated factor loadings! Results will differ

– This is a powerful way to create index variables…• Ex: Depression. You could sum several variables to

create an index… • Or do a factor analysis and compute scores for a factor

that appeared to reflect depression…

FA Example: Civic Engagement

• Factor scores from some sample cases:. predict f1 f2 f3(regression scoring assumed)

Scoring coefficients (method = regression; based on varimax rotated factors). list member volunteer f1 f2

+-------------------------------------------+ | member volunt~r f1 f2 | |-------------------------------------------| 1. | 3 2 .3280279 .4303528 | 2. | 1 0 -.6338809 -.305814 | 3. | 3 3 .575327 -.8480528 | 4. | 5 5 1.52282 .3150256 | 5. | 7 3 1.450748 .4064942 | 6. | 4 4 1.044003 -.4640276 | 8. | 0 0 -.8484179 .5083777 | 9. | 5 5 1.523822 -.9253936 | 12. | 2 2 .1134908 1.244545 | 13. | 1 0 -.6204671 .5076937 | 14. | 5 4 1.276523 .353012 | 15. | 7 5 1.956463 -.4956342 | 16. | 9 1 1.374107 -.3197608 |

Cases that are high on membership & volunteering score very high on factor 1

FA Example: Civic Engagement• Factor scores can also be plotted

This is most useful when you have a small number of cases…Ex: countries, which can be labeled on plot

-10

12

3S

core

s fo

r fa

cto

r 2

-2 0 2 4 6Scores for factor 1

Rotation: orthogonal varimaxMethod: principal factors

Score variables (factor)

Stata: Loadingplots & scoreplots

• Notes:• 1. Plots can be done of all factors…

– I’ve only showed first two… to keep things simple– Syntax: loadingplot, factors(3)

• 2. Case labels can be useful on scoreplots– Scoreplot, mlabel(countryid)– Jitter can sometimes be useful, too…

• 3. Some software allows “biplots”– Plotting loadings & scores together– Helps uncover patterns in data.

Example: Biplot

• Cross-national data on civic participationBiplot (axes F1 and F2: 74.71 %)

East Germany

West Germany

united statesgreat britain

ukraine

turkey

sweden

spain

south africa

slovakia

russian federationromaniaportugal

poland

philippines peru

netherlands

mexico

luxembourg

japan

italy

irelandhungary

france

finland

denmark

czech republic

chile

canada

belarus

belgium

austria

argentina

doccupy

ddemon

dstrike

dboycottdpetition

wtotmtot

-3

-2

-1

0

1

2

3

4

-5 -4 -3 -2 -1 0 1 2 3 4 5

F1 (58.36 %)

F2

(16.

35 %

)

Note that France falls near to activities like “strikes”

US is nearer to mtot (memberhip)

Factor Analysis: Methods

• There are MANY algorithms to extract & rotate factors

• A thorough discussion is beyond the scope of this class• Some defaults (if you don’t choose):

– SPSS: Principle components extraction, varimax rotation– Stata: Principle factors extraction; varimax rotation

• Results can vary if you use different methods…– In practice, few people are skilled in choosing among

methods… people mainly use defaults– I recommend trying multiple methods to ensure that results

are robust…

Confirmatory Factor Analysis

• Factor analysis is purely exploratory• It is data mining, not a model• However, it is based on the idea that factors – which

are unobserved – give rise to (i.e., cause) variation on observed variables

Depression

Happy WGood Hopeless Sad Tired


• Idea: Let’s imagine that depression is a latent variable

• i.e., a variable we can’t directly measure… but gives rise to observed patterns in things we can observe

• Note: No observed variable perfectly measures the latent variable

– There is error… – So, observed variables aren’t perfectly correlated with latent

variable (even though they are “caused” by it)…


• This forms the basis for a kind of model:

Depression

Happy WGood Hopeless Sad Tired


• Idea: We can model real data based on those presumed relationships…

• Estimate slope coefficients for each arrow– How do latent variables affect observed variables?

• Examine overall model fit– How much does our theoretically-informed view of the world

map onto observed data?– If model fits well, our concept of “depression” (and

measurement strategy) are likely to be good

• “Confirmatory” implies that we aren’t just “exploring”– Different from “exploratory factor analysis”…– Rather than data mining, we’re testing a theoretically-informed

model.

SEM

• Next step: Structural Equation Models (SEM) with Latent Variables

• Once we’ve identified latent variables, it makes sense to analyze them!

• We can develop models in which we estimate slopes relating latent variables…

• This is particularly useful when we are interested in latent concepts that are difficult to measure with any single variable.