Factor Analysis & Structural Equation Models 1
Sociology 8811, Class 28
Copyright © 2007 by Evan SchoferDo not copy or distribute without permission
Announcements
• Paper #2 due today!
• Schedule: Structural equation models• I’ll start with related issue:• Factor Analysis• Path Models
• Monday lab:• Factor analysis• Whatever else we can squeeze in (Path models, SEM)• NO graded lab assignment
Factor Analysis• Factor analysis is an exploratory tool
• Often called “Exploratory Factor Analysis”• Helps identify simple patterns that underlie complex
multivariate data– Not about hypothesis testing– Rather, it is more like data mining
• And also helps us understand some principles of SEM
– Note: Factor analysis is informally used to refer to two different methods
• Factor analysis (FA)• Principle component analysis (PCA)• Differences aren’t critical here
– I will focus on FA, which is most useful in understanding SEM– Most of lecture will apply to PCA.
Factor Analysis
• The basic idea: FA seeks to identify a small number of “underlying variables” that effectively summarize multivariate data
• Ex: Suppose we have many political opinion variables– Approval of president; environmental views; etc.
• Perhaps one unmeasured “factor” accounts for people’s positions on all those variables…
– Ex: Liberalism vs. conservatism…
• FA seeks to identify common patterns– But, it is up to the researcher to determine what the underlying
pattern really means…
Factor Analysis: ‘Depression’
• Suppose we believe in a theoretical construct such as “depression”.
• There is no single variable that perfectly measures it… but we believe it exists
• Hypothetical questions:• HAPPY: How happy are you? (1-10)• WORLDGOOD: How much do you agree with the
statement that “The world is a good place”? (1-5)• HOPELESS: Do you often feel hopeless? (1-5)• SAD: Do you often feel sad? (1-5)• TIRED: Do you often feel tired or discouraged? (1-10)
Example: ‘Depression’
• Strategy 1: We could ask many questions & create an index that combines all measures
• Note: we would have to flip signs on some measures• “Happy” would have to be reversed to effectively
measure ‘depression’
• Strategy 2: We could ask many questions and then conduct a factor analysis
• To see if answers to questions exhibit an underlying pattern (which we could label “depression”).
Factor Analysis: Depression• Hypothetical results from a factor analysis:
Factor Loadings
Factor 1 Factor 2
Happy -.86 …
WorldGood -.75 …
Hopeless .92 …
Sad .95 …
Tired .71 …
A factor is a variable that explains lots of variance among the variables being analyzed (Happy, sad, hopeless, etc)
Loadings are the correlation between each variable and the unobserved factor…
The loadings tell you a lot about patterns of variation among cases…Notably: People who score high on “sad” & “hopeless” & “tired” tend to score very low on “happy” and “worldgood” and vice versa…
Factor Analysis: Depression• Issue: It is wholly up to the researcher to
interpret the factors• We are just data mining… • To ascribe meaning to factors requires much careful
thought – and is ideally informed by theory…
Factor 1
Happy -.86
WorldGood -.75
Hopeless .92
Sad .95
Tired .71
What might factor 1 represent?
Does it seem like it captures “Depression”? Might it mean something else?
Factor Analysis: Depression• Factor analysis is agnostic to direction of
factor variables… results might look like this:
Factor 1
Happy .86
WorldGood .75
Hopeless -.92
Sad -.95
Tired -.71
For all intents & purposes, these results are identical… but flipped
The factor is capturing the inverse of depression… (happiness?)
Factor Analysis
• Things you can do with factor analysis:• 1. Examine factor loadings
– Use them to interpret factors that are identified in the data
• 2. Plot factor loadings– Vividly describe which variables “go together” (people score
high on one tend to score high on another or vice versa)
• 3. Compute factor scores– Estimate how individual cases score on underlying factors– How depressed is each case?
• 4. Determine variation explained by factors– See which factors account for the major patterns in your data
• 5. “Rotate” the factors– Modify them to enhance interpretability… Will discuss later.
FA Example: Civic Engagement
• How do people participate in politics?• Do people vary systematically in civic participation?• Is there such a thing as “civic engagement”?
– A common pattern of behavior that appears in empirical data?
– World Values Survey Data for USA:• Membership in civic groups• Volunteering• Participation in demonstrations• Participation in strikes• Participation in boycotts• Sign petitions.
FA Example: Civic Engagement• Factor analysis of US civic participation. factor member volunteer petition boycott demonstrate strike occupybldg
Factor analysis/correlation Number of obs = 1110 Method: principal factors Retained factors = 3 Rotation: (unrotated) Number of params = 18
-------------------------------------------------------------------------- Factor | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Factor1 | 1.51105 0.71238 0.8319 0.8319 Factor2 | 0.79867 0.67994 0.4397 1.2717 Factor3 | 0.11872 0.20190 0.0654 1.3370 Factor4 | -0.08318 0.04249 -0.0458 1.2912 Factor5 | -0.12567 0.05446 -0.0692 1.2221 Factor6 | -0.18013 0.04305 -0.0992 1.1229 Factor7 | -0.22318 . -0.1229 1.0000 -------------------------------------------------------------------------- LR test: independent vs. saturated: chi2(21) = 1405.19 Prob>chi2 = 0.0000
Initial output describes process of factor extraction – identifying factors within the data. Stata identifies many factors (all possible patterns until it runs out of variation). But, only factors with large eigenvalues explain a lot…
FA Example: Civic Engagement• Output (cont’d)Factor loadings (pattern matrix) and unique variances
----------------------------------------------------------- Variable | Factor1 Factor2 Factor3 | Uniqueness -------------+------------------------------+-------------- member | 0.7111 -0.5941 0.0984 | 0.1316 volunteer | 0.6689 -0.6450 0.0939 | 0.1278 petition | 0.3485 0.2288 -0.6927 | 0.3464 boycott | 0.6350 0.3756 -0.2149 | 0.4095 demonstrate | 0.6210 0.4021 -0.1098 | 0.4406 strike | 0.4035 0.4387 0.4021 | 0.4830 occupybldg | 0.2698 0.4038 0.5597 | 0.4509 -----------------------------------------------------------
Next, stata reports the main factors it finds.Factor 1 explains most variation, others less…
Factor 1 correlates with ALL measures of civic participationIn other words, people tend to be high on all measures or low on all.
Is this “civic engagement”?
Factor 2: Some people are LOW on membership & moderately high on demonstrations/strikes.Others are the converse…
Maybe some people are alienated or active in social movements?
FA Example: Civic Engagement• Output (cont’d)Factor loadings (pattern matrix) and unique variances
----------------------------------------------------------- Variable | Factor1 Factor2 Factor3 | Uniqueness -------------+------------------------------+-------------- member | 0.7111 -0.5941 0.0984 | 0.1316 volunteer | 0.6689 -0.6450 0.0939 | 0.1278 petition | 0.3485 0.2288 -0.6927 | 0.3464 boycott | 0.6350 0.3756 -0.2149 | 0.4095 demonstrate | 0.6210 0.4021 -0.1098 | 0.4406 strike | 0.4035 0.4387 0.4021 | 0.4830 occupybldg | 0.2698 0.4038 0.5597 | 0.4509 -----------------------------------------------------------
Factor 3 finds that some people engage in strikes/occupation of buildings but do not sign petitions.
A bit hard to interpret… Focus your energies on first few factors that have big eigenvalues…
FA Example: Civic Engagement• A visual representation of factor loadings
membervolunteer
petition
boycottdemonstrate
strikeoccupybldg
-.4
-.2
0.2
.4F
acto
r 2
0 .2 .4 .6 .8Factor 1
Factor loadings Command: “loadingplot”-- run after factor analysis
Descriptive patterns emerge from the data
Membership & volunteering go together…But are far from strikes, protests, etc.
Factor Rotation
• Factors can be “rotated”• Rotation = recalculating them to maximize differences
between them• This can improve interpretability of factors
Rotated factor loadings (pattern matrix) and unique variances
----------------------------------------------------------- Variable | Factor1 Factor2 Factor3 | Uniqueness -------------+------------------------------+-------------- member | 0.8061 0.0974 0.0139 | 0.3405 volunteer | 0.8055 0.0377 -0.0087 | 0.3497 petition | 0.0615 0.3130 -0.1456 | 0.8771 boycott | 0.1504 0.5724 0.0165 | 0.6494 demonstrate | 0.1358 0.5614 0.0671 | 0.6619 strike | 0.0371 0.3536 0.2421 | 0.8150 occupybldg | -0.0030 0.2439 0.2501 | 0.8780 -----------------------------------------------------------
Here, we see a clearer pattern… Factors 1 & 2 are more distinct.Factor 1 = civic membership; factor 2 = protest/social mvmts, etc…
FA Example: Civic Engagement• Let’s plot the rotated factor loadings:
Pattern is similar to unrotated…But, rotation moves variables closer to axes
membervolunteer
petition
boycottdemonstrate
strike
occupybldg
0.2
.4.6
Fac
tor
2
0 .2 .4 .6 .8Factor 1
Rotation: orthogonal varimaxMethod: principal factors
Factor loadings
Factor Scores
• Factors = variables…• We can compute the value of them for a given case…• Ex: How high do I score on F1 (depression)?• Stata syntax: “predict f1 f2 f3…”
– If you only want scores from first 2 factors, just list 2 variable names…
– Note: If done after rotation, scores will be based on rotated factor loadings! Results will differ
– This is a powerful way to create index variables…• Ex: Depression. You could sum several variables to
create an index… • Or do a factor analysis and compute scores for a factor
that appeared to reflect depression…
FA Example: Civic Engagement
• Factor scores from some sample cases:. predict f1 f2 f3(regression scoring assumed)
Scoring coefficients (method = regression; based on varimax rotated factors). list member volunteer f1 f2
+-------------------------------------------+ | member volunt~r f1 f2 | |-------------------------------------------| 1. | 3 2 .3280279 .4303528 | 2. | 1 0 -.6338809 -.305814 | 3. | 3 3 .575327 -.8480528 | 4. | 5 5 1.52282 .3150256 | 5. | 7 3 1.450748 .4064942 | 6. | 4 4 1.044003 -.4640276 | 8. | 0 0 -.8484179 .5083777 | 9. | 5 5 1.523822 -.9253936 | 12. | 2 2 .1134908 1.244545 | 13. | 1 0 -.6204671 .5076937 | 14. | 5 4 1.276523 .353012 | 15. | 7 5 1.956463 -.4956342 | 16. | 9 1 1.374107 -.3197608 |
Cases that are high on membership & volunteering score very high on factor 1
FA Example: Civic Engagement• Factor scores can also be plotted
This is most useful when you have a small number of cases…Ex: countries, which can be labeled on plot
-10
12
3S
core
s fo
r fa
cto
r 2
-2 0 2 4 6Scores for factor 1
Rotation: orthogonal varimaxMethod: principal factors
Score variables (factor)
Stata: Loadingplots & scoreplots
• Notes:• 1. Plots can be done of all factors…
– I’ve only showed first two… to keep things simple– Syntax: loadingplot, factors(3)
• 2. Case labels can be useful on scoreplots– Scoreplot, mlabel(countryid)– Jitter can sometimes be useful, too…
• 3. Some software allows “biplots”– Plotting loadings & scores together– Helps uncover patterns in data.
Example: Biplot
• Cross-national data on civic participationBiplot (axes F1 and F2: 74.71 %)
East Germany
West Germany
united statesgreat britain
ukraine
turkey
sweden
spain
south africa
slovakia
russian federationromaniaportugal
poland
philippines peru
netherlands
mexico
luxembourg
japan
italy
irelandhungary
france
finland
denmark
czech republic
chile
canada
belarus
belgium
austria
argentina
doccupy
ddemon
dstrike
dboycottdpetition
wtotmtot
-3
-2
-1
0
1
2
3
4
-5 -4 -3 -2 -1 0 1 2 3 4 5
F1 (58.36 %)
F2
(16.
35 %
)
Note that France falls near to activities like “strikes”
US is nearer to mtot (memberhip)
Factor Analysis: Methods
• There are MANY algorithms to extract & rotate factors
• A thorough discussion is beyond the scope of this class• Some defaults (if you don’t choose):
– SPSS: Principle components extraction, varimax rotation– Stata: Principle factors extraction; varimax rotation
• Results can vary if you use different methods…– In practice, few people are skilled in choosing among
methods… people mainly use defaults– I recommend trying multiple methods to ensure that results
are robust…
Confirmatory Factor Analysis
• Factor analysis is purely exploratory• It is data mining, not a model• However, it is based on the idea that factors – which
are unobserved – give rise to (i.e., cause) variation on observed variables
Depression
Happy WGood Hopeless Sad Tired
Confirmatory Factor Analysis
• Idea: Let’s imagine that depression is a latent variable
• i.e., a variable we can’t directly measure… but gives rise to observed patterns in things we can observe
• Note: No observed variable perfectly measures the latent variable
– There is error… – So, observed variables aren’t perfectly correlated with latent
variable (even though they are “caused” by it)…
Confirmatory Factor Analysis
• This forms the basis for a kind of model:
Depression
Happy WGood Hopeless Sad Tired
Confirmatory Factor Analysis
• Idea: We can model real data based on those presumed relationships…
• Estimate slope coefficients for each arrow– How do latent variables affect observed variables?
• Examine overall model fit– How much does our theoretically-informed view of the world
map onto observed data?– If model fits well, our concept of “depression” (and
measurement strategy) are likely to be good
• “Confirmatory” implies that we aren’t just “exploring”– Different from “exploratory factor analysis”…– Rather than data mining, we’re testing a theoretically-informed
model.
SEM
• Next step: Structural Equation Models (SEM) with Latent Variables
• Once we’ve identified latent variables, it makes sense to analyze them!
• We can develop models in which we estimate slopes relating latent variables…
• This is particularly useful when we are interested in latent concepts that are difficult to measure with any single variable.