124
Multivariate Data Multivariate Data Analysis Using SPSS Analysis Using SPSS John Zhang John Zhang ARL, IUP ARL, IUP

Multivariate Data Analysis Using SPSS

Embed Size (px)

DESCRIPTION

spss help

Citation preview

  • Multivariate Data Analysis Using SPSSJohn ZhangARL, IUP

  • TopicsA Guide to Multivariate TechniquesPreparation for Statistical AnalysisReview: ANOVAReview: ANCOVAMANOVAMANCOVARepeated Measure AnalysisFactor AnalysisDiscriminant AnalysisCluster Analysis

  • Guide-1Correlation: 1 IV 1 DV; relationshipRegression: 1+ IV 1 DV; relation/predictionT test: 1 IV (Cat.) 1 DV; group diff.One-way ANOVA: 1 IV (2+ cat.) 1 DV; group diff.One-way ANCOVA: 1 IV (2+ cat.) 1 DV 1+ covariates; group diff.One-way MANOVA: 1 IV (2+ cat.) 2+ DVs; group diff.

  • Guide-2One-way MANCOVA: 1 IV (2+cat.) 2+ DVs 1+ covariate; group diff.Factorial MANOVA: 2+ IVs (2+cat.) 2+ DVs; group diff.Factorial MANCOVA: 2+ IVs (2+cat.) 2+ DVs 1+ covariate; group diff.Discriminant Analysis: 2+ IVs 1 DV (cat.); group predictionFactor Analysis: explore the underlying structure

  • Preparation for Stat. Analysis-1Screen dataSPSS Utility proceduresFrequency procedureMissing data analysis (missing data should be random)Check if patterns existDrop data case-wiseDrop data variable-wiseImpute missing data

  • Preparation for Stat. Analysis-2Outliers (generally, statistical procedures are sensitive to outliers.Univariate case: boxplotMultivariate case: Mahalanobis distance (a chi-square statistics), a point is an outlier when its p-value is < .001.Treatment:Drop the caseReport two analysis (one with outlier, one without)

  • Preparation for Stat. Analysis-3NormalityTesting univariate normal:Q-Q plotSkewness and Kurtosis: they should be 0 when normal; not normal when p-value < .01 or .001Komogorov-Smirnov statistic: significant means not normal.Testing multivariate normal:Scatterplots should be ellipticalEach variable must be normal

  • Preparation for Stat. Analysis-4LinearityLinear combination of variables make senseTwo variables (or comb. of variables) are linearCheck for linearityResidual plot in regressionScatterplots

  • Preparation for Stat. Analysis-5Homoscedasticity: the covariance matrixes are equal across groupsBoxs M test: test the equality of the covariance matrixes across groupsSensitive to normalityLevenes test: test equality of variances across groups.Not sensitive to normality

  • Preparation for Stat. Analysis-Example-1Steps in preparation for stat. analysis:Check for variable codling, recode if necessaryExamining missing dataCheck for univariate outlier, normality, homogeneity of variances (Explore)Test for homogeneity of variances (ANOVA)Check for multivariate outliers (Regression>Save> Mahalanobis)Check for linearity (scatterplots; residual plots in regression)

  • Preparation for Stat. Analysis-Example-2Use dataset dssft.savObjective: we are interested in investigating group differences (satjob2) in income (income91), age (age_2) and education (educ)Check for coding: need to recode rincome91 into rincome_2 (22, 98, 99 be system missing)Transform>Recode>Into Different Variable

  • Preparation for Stat. Analysis-Example-3Check for missing valueUse Frequency for categorical variableUse Descriptive Stat. for measurement variableFor categorical variables:If missing value is < 5%, use List-wise optionIf >=5%, define the missing value as a new categoryFor measurement variables:If missing value is < 5%, use List-wise optionIf between 5% and 15%, use Transform>Replace Missing Value. Replacing less than 15% of data has little effect on the outcomeIf greater than 15%, consider to drop the variable or subject

  • Preparation for Stat. Analysis-Example-4Check missing value for satjob2Analysis>Descriptive Statistics>FrequencyCheck for missing value for rincome_2Analysis>Descriptive Statistics>DescriptiveReplaying the missing values in rincome_2Transform>Replacing Missing Value

  • Preparation for Stat. Analysis-Example-5Check for univariate outliers, normality, Homogeneity of variancesAnalysis>Descriptive Statistics>ExplorePut rincome_2, age_2, and educ into the Dependent List box; satjob2 into Factor List boxThere are outliers in rincome_2, lets change those outliers to the acceptable min or max valueTransform>Recode>Into Different VariablePut income_2 into Original Variable box, type income_3 as the new nameReplace all values
  • Preparation for Stat. Analysis-Example-6Explore rincome_3 again: not normalTransform rincome_3 into rincome_4 by ln or sqrtExplore rincome_4Check for multivariate outliersAnalysis>Regression>linearPut id (dummy variable) into Depend box, put rincome_4, age_2, and educ into Independent boxClick at Save, then Mahalanobis boxCompare Mahalanobis dist. with chi-sqrt critical value at p=.001 and df=number of independent variables

  • Preparation for Stat. Analysis-Example-7Check for multivariate normal:Must univariate normalConstruct a scatterplot matrix, each scatterplot should be elliptical shapeCheck for HomoscedasticityUnivariate (ANOVA, Levenes test)Multivariate (MANOVA, Boxs M test, use .01 level of significance level)

  • Review: ANOVA -1One-way ANOVA test the equality of group meansAssumptions: independent observations; normality; homogeneity of varianceTwo-way ANOVA tests three hypotheses simultaneously:Test the interaction of the levels of the two independent variablesInteraction occurs when the effects of one factor depends on the different levels of the second factorTest the two independent variable separately

  • Review: ANCOVA -1Idea: the difference on a DV often does not just depend on one or two IVs, it may depend on other measurement variables. ANCOVA takes into account of such dependency.i.e. it removes the effect of one or more covariatesAssumptions: in addition to the regular ANOVA assumptions, we need:Linear relationship between DV and covariatesThe slope for the regression line is the same for each groupThe covariates are reliable and is measure without error

  • Review: ANCOVA -2Homogeneity of slopes = homogeneity of regression = there is interaction between IVs and the covariateIf the interaction between covariate and IVs are significant, ANCOVA should not be conductedExample: determine if hours worked per week (hrs2) is different by gender (sex) and for those satisfy or dissatisfied with their job (satjob2), after adjusted to their income (or equalized to their income)

  • Review: ANCOVA -3Analysis>GLM>UnivariateMove hrs2 into DV box; move sex and satjob2 into Fixed Factor box; move rincome_2 into Covariate boxClick at Model>CustomHighlight all variables and move it to the Model boxMake sure the Interaction option is selectedClick at OptionMove sex and satjob2 into Display Means boxClick Descriptive Stat.; Estimates of effect size; and Homogeneity testsThis tests the homogeneity of regression slopes

  • Review: ANCOVA -4If there is no interaction found by the previous step, then repeat the previous step except click at Model>Factorial instead of Model>Custom

  • Review: ANOVA -2Interaction is significant means the two IVs in combination result in a significant effect on the DV, thus, it does not make sense to interpret the main effects.Assumptions: the same as One-way ANOVAExample: the impact of gender (sex) and age (agecat4) on income (rincome_2)Explore (omitted)Analysis>GLM>univariateClick model>click Full factorial>Cont.Click Options>Click Descriptive Stat; Estimates of effect size; Homogeneity testClick Post Hoc>click LSD; Bonferroni; Scheffe; Cont.Click Plots>put one IV into Horizontal and the other into Separate line

  • MANOVA-1CharacteristicsSimilar to ANOVAMultiple DVsThe DVs are correlated and linear combination makes senseIt tests whether mean differences among k groups on a combination of DVs are likely to have occurred by chance The idea of MANOVA is find a linear combination that separates the groups optimally, and perform ANOVA on the linear combination

  • MANOVA-2AdvantagesThe chance of discovering what actually changed as a result of the the different treatment increasesMay reveal differences not shown in separate ANOVAsWithout inflation of type one errorThe use of multiple ANOVAs ignores some very important info (the fact that the DVs are correlated)

  • MANOVA-3DisadvantagesMore complicatedANOVA is often more powerfulAssumptions:Independent random samplesMultivariate normal distribution in each groupHomogeneity of covariance matrixLinear relationship among DVs

  • MANOVA-4Steps in carry out MANOVACheck for assumptionsIf MANOVA is not significant, stopIf MANOVA is significant, carry out univariate ANOVAIf univariate ANOVA is significant, do Post HocIf homoscedasticity, use Wilks Lambda, if not, use Pillais Trace. In general, all 4 statistics should be similar.

  • MANOVA-5Example:An experiment looking at the memory effects of different instructions: 3 groups of human subjects learned nonsense syllables as they were presented and were administered two memory tests: recall and recognition. The first group of subjects was instructed to like or dislike the syllables as they were presented (to generate affect). A second group was instructed that they will be tested (induce anxiety?). The 3rd group was told to count the syllable as the were presented (interference). The objective is to access group differences in memory

  • MANOVA-6How to do it?File>Open DataOpen the file As9.por in Instruct>Zhang Multivariate Short Course folderAnalyze>GLM>MultivariateMove recall and recog into Dependent Variable box; move group into Fixed Factors boxClick at Options; move group into Display means box (this will display the marginal means predicted by the model, these means may be different than the observed means if there are covariates or the model is not factorial); Compare main effect box is for testing the every pair of the estimated marginal means for the selected factors.Click at Estimates of effect size and Homogeneity of variance

  • MANOVA-7Push buttons:Plots: create a profile plot for each DV displaying group meansPost Hoc: Post Hoc tests for marginal meansSave: save predicted values, etc.Contrast: perform planned comparisonsModel: specify the modelOptions: Display Means for: display the estimated means predicted by the modelCompare main effects: test for significant difference between every pair of estimated marginal means for each of the main effects

  • MANOVA-8Observed power: produce a statistical power analysis for your studyParameter estimate: check this when you need a predictive modelSpread vs. level plot: visual display of homogeneity of variance

  • MANOVA-9Example 2: Check for the impact of job satisfaction (satjob) and gender (sex) on income (rincome_2) and education (educ) (in gssft.sav)Screen data: transform educ to educ2 to eliminate cases with 6 or lessCheck for assumptions: exploreMANOVA

  • MANCOVA-1Objective: Test for mean differences among groups for a linear combination of DVs after adjusted for the covariate.Example: to test if there is differences in productivity (measured by income and hours worked) for individuals in different age groups after adjusted for the education level

  • MANCOVA-2Assumptions: similar to ANCOVASPSS how to:Analysis>GLM>MultivariateMove rincome_2 and educ2 to DV box; move sex and satjob into IV box; move age to Covariate boxCheck for homogeneity of regressionClick at Model>Custom; Highlight all variables and move them to Model boxIf the covariate-IVs interaction is not significant, repeat the process but select the Full under model

  • Repeated Measure Analysis-1Objective: test for significant differences in means when the same observation appears in multiple levels of a factorExamples of repeated measure studies:Marketing compare customers ratings on 4 different brandsMedicine compare test results before, immediately after, and six months after a procedureEducation compare performance test scores before and after an intervention program

  • Repeated Measure Analysis-2The logic of repeated measure: SPSS performs repeated measure ANOVA by computing contrasts (differences) across the repeated measures factors levels for each subject, then testing if the means of the contrasts are significantly different from 0; any between subject tests are based on the means of the subjects.

  • Repeated Measure Analysis-3Assumptions:Independent observationsNormalityHomogeneity of variancesSphericity: if two or more contrasts are to be pooled (the test of main effect is based on this pooling), then the contrasts should be equally weighted and uncorrelated (equal variances and uncorrelated contrasts); this assumption is equivalent to the covariance matrix is diagonal and the diagonal elements are the same)

  • Repeated Measure Analysis-4Example 1: A study in which 5 subjects were tested in each of 4 drug conditionsOpen data file:File>OpenData; select Repmeas1.porSPSS repeated measure procedure:Analyze>GLM>Repeated MeasureWithin-Subject Factor Name (the name of the repeated measure factor): a repeated measure factor is expressed as a set of variablesReplace factor1 with DrugNumber of levels: the number of repeated measurementsType 4

  • Repeated Measure Analysis-5The Measure pushbutton for two functionsFor multiple dependent measures (e.g. we recorded 4 measures of physiological stress under each of the drug conditions)To label the factor levelsClick Measure; type memory in Measure name box; click addClick Define: here we link the repeated measure factor level to variable names; define between subject factors and covariatesMove drug1 drug 4 to the Within-Subject boxYou can move a selected variable by the up and down button

  • Repeated Measure Analysis-6Model button: by default a complete modelContrast button: specify particular contrastsPlot button: create profile plots that graph factor level estimated marginal means for up to 3 factors at a timePost Hoc: provide Post Hoc tests for between subject factorsSave button: allow you to save predicted values, residuals, etc.Options: similar to MANOVAClick Descriptive; click at Transformation Matrix (it provides the contrasts)

  • Repeated Measure Analysis-7Interpret the resultsLook at the descriptive statisticsLook at the test for SphericityIf Sphericity is significant, use the Multivariate results (test on the contrasts). It tests whether all of the contrast variables are zero in the populationIf Sphericity is not significant, use the Sphericity Assumed resultLook at the tests for within subject contrasts: it test the linear trend; the quadratic trendIt may not be make sense in some applications, as in this example (but it makes sense in terms of time and dosage)

  • Repeated Measure Analysis-8Transformation matrix provide info on what are linear contrast, etc.The fist table is for the average across the repeated measure factor (here they are all .5, it means each variable is weighted equally, normalization requires that the square of the sums equals to 1)The second table defines the corresponding repeated measure factorLinear increase by a constant, etc.Linear and quadratic is orthogonal, etc.Having concluded there are memory differences due to drug condition, , we want to know which condition differ to which others

  • Repeated Measure Analysis-9Repeat the analysis, except under Option button, move drug into Display Means, click at Compare Main effects and select Bonferroni adjustmentTransformation Coefficients (M Matrix): it shows how the variables are created for comparison. Here, we compare the drug conditions, so the M matrix is an identity matrixSuppose we want to test each adjacent pair of means: drug1 vs. drug2; drug2 vs. drug3; drug3 vs. drug 4:Repeated measure>Define>Contrast>Select Repeated

  • Repeated Measure Analysis-10Example 2: A marketing experiment was devised to evaluate whether viewing a commercial produces improved ratings for a specific brand. Ratings on 3 brands were obtained from objects before and after viewing the commercial. Since the hope was that the commercial would improve ratings of only one brand (A), researchers expected a significant brand by pre-post commercial interaction. There are two between-subjects factors: sex and brand used by the subject

  • Repeated Measure Analysis-11SPSS how to:Analyze>GLM>Repeated MeasuresReplace factor1 with prepost in the Within-Subject Factor box; type 2 in the Number of level box; click addType brand in the Within-Subject Factor box; type 3 in the Number of level box; click addClick measure; type measure in Measure Name box; click addNote: SPSS expects 2 between-subject factors

  • Repeated Measure Analysis-12Click Define button; move the appropriate variable into place; move sex and user into Between-Subject Factor boxClick Options button; move sex, user, prepost and brand into the Display means boxClick Homogeneity tests and descriptive boxesClick Plot; move user into Horizontal Axis box and brand into Separate Lines box Click continue; OK

  • Factor Analysis-1The main goal of factor analysis is data reduction. A typical use of factor analysis is in survey research, where a researcher wishes to represent a number of questions with a smaller number of factorsTwo questions in factor analysis:How many factors are there and what they represent (interpretation)Two technical aids: EigenvaluesPercentage of variance accounted for

  • Factor Analysis-2Two types of factor analysis:Exploratory: introduce hereConfirmatory: SPSS AMOSTheoretical basis:Correlations among variables are explained by underlying factorsAn example of mathematical 1 factor model for two variables:V1=L1*F1+E1 V2=L2*F1+E2

  • Factor Analysis-3Each variable is compose of a common factor (F1) multiply by a loading coefficient (L1, L2 the lambdas or factor loadings) plus a random componentV1 and V2 correlate because the common factor and should relate to the factor loadings, thus, the factor loadings can be estimated by the correlationsA set of correlations can derive different factor loadings (i.e. the solutions are not unique)One should pick the simplest solution

  • Factor Analysis-4A factor solution needs to be confirm:By a different factor methodBy a different sampleMore on terminologyFactor loading: interpreted as the Pearson correlation between the variable and the factorCommunality: the proportion of variability for a given variable that is explained by the factorExtraction: the process by which the factors are determined from a large set of variables

  • Factor Analysis-5Principle component: one of the extraction methodsA principle component is a linear combination of observed variables that is independent (orthogonal) of other componentsThe first component accounts for the largest amount of variance in the input data; the second component accounts for the largest amount or the remaining varianceComponents are orthogonal means they are uncorrelated

  • Factor Analysis-6Possible application of principle components:E.g. in a survey research, it is common to have many questions to address one issue (e.g. customer service). It is likely that these questions are highly correlated. It is problematic to use these variables in some statistical procedures (e.g. regression). One can use factor scores, computed from factor loadings on each orthogonal component

  • Factor Analysis-7Principle component vs. other extract methods:Principle component focus on accounting for the maximum among of variance (the diagonal of a correlation matrix)Other extract methods (e.g. principle axis factoring) focus more on accounting for the correlations between variables (off diagonal correlations)Principle component can be defined as a unique combination of variables but the other factor methods can notPrinciple component are use for data reduction but more difficult to interpret

  • Factor Analysis-8Number of factors:Eigenvalues are often used to determine how many factors to takeTake as many factors there are eigenvalues greater than 1Eigenvalue represents the amount of standardized variance in the variable accounted for by a factorThe amount of standardized variance in a variable is 1The sum of eigenvalues is the percentage of variance accounted for

  • Factor Analysis-9RotationObjective: to facilitate interpretationOrthogonal rotation: done when data reduction is the objective and factors need to be orthogonalVarimax: attempts to simplify interpretation by maximize the variances of the variable loadings on each factorQuartimax: simplify solution by finding a rotation that produces high and low loadings across factors for each variableOblique rotation: use when there are reason to allow factors to be correlatedOblimin and Promax (promax runs fast)

  • Factor Analysis-10Factor scores: if you are satisfy with a factor solutionYou can request that a new set of variables be created that represents the scores of each observation on the factor (difficult of interpret)You can use the lambda coefficient to judge which variables are highly related to the factor; the compute the sum of the mean of this variables for further analysis (easy to interpret)

  • Factor Analysis-11Sample size: the sample size should be about 10 to 15 times of the number of variables (as other multivariate procedures)Number of methods: there are 8 factoring methods, including principle componentPrinciple axis: account for correlations between the variablesUnweighted least-squares: minimize the residual between the observed and the reproduced correlation matrix

  • Factor Analysis-12Generalize least-squares: similar to Unweighted least-squares but give more weight the the variables with stronger correlationMaximum Likelihood: generate the solution that is the most likely to produce the correlation matrixAlpha Factoring: Consider variables as a sample; not using factor loadingsImage factoring: decompose the variables into a common part and a unique part, then work with the common part

  • Factor Analysis-13Recommendations:Principle components and principle axis are the most common used methodsWhen there are multicollinearity, use principle componentsRotations are often done. Try to use Varimax

  • Factor Analysis-14Example 1: whether a small number of athletic skills account for performance in the ten separate decathlon eventsFile>Open>Data; select Olymp88.porLooking at correlation:Analyze>Correlation>BivariatePrinciple component with orthogonal rotationAnalyze>Data Reduction>FactorSelect all variables except scoreClick Extract button>click Scree PlotCheck off Unrotated factor solutionClick continue

  • Factor Analysis-15Click Rotation button>click Varimax; Loading plots; click continueClick options button>click sorted by size; click Suppress absolute values box; change .1 to ,3; click continueClick Descriptive>Univariate descriptive; KMO and Bartletts test of sphericity (KMO measures how well the sample data are suited for factor analysis: .9 is great and less than .5 is not acceptable; Bartletts test tests the sphericity of the correlation matrix); click continueClick OK

  • Factor Analysis-16Try to validate the first factor solution using a different methodAnalyze>Data Reduction>Factor AnalysisClick Extraction>Select Principle axis factoring; click continueClick Rotation>Select Direct Oblimin (leave delta value at 0, most oblique value possible); type 50 in the Max Iteration box; click continueClick Score button>click save as variables (this involve solving system of equation for the factors, regression is one of the methods to solve the equations); click continueClick OK

  • Factor Analysis-17Note: the Patten matrix gives the standardized linear weights and the Structure matrix gives the correlation between variable and factors (in principle component analysis, the component matrix gives both factor loadings and the correlations)

  • Discriminant Analysis-1Discriminant analysis characterize the relationship between a set of IVs with a categorical DV with relatively few categoriesIt creates a linear combination of the IVs that best characterizes the differences among the groupsPredictive discriminant analysis focus on creating a rule to predict group membershipDescriptive DA studies the relationship between the DV and the IVs.

  • Discriminant Analysis-2Possible applications:Whether a bank should offer a loan to a new customer?Which customer is likely to buy?Identify patients who may be at high risk for problems after surgery

  • Discriminant Analysis-3How does it work?Assume the population of interest is composed of distinct populationsAssume the IVs follows multivariate normal distributionDS seek a linear combination of the IVs that best separate the populationsIf we have k groups, we need k-1 discriminate functionsA discriminant score is computed for each functionThis score is used to classify cases into one of the categories

  • Discriminant Analysis-4There are three methods to classify group memberships:Maximum likelihood method: assign case to group k is the probability of membership is greater in group k than any other groupFisher (linear) classification functions: assign a membership to group k if its score on the function for group k is greater than any other function scoresDistance function: assign membership to group k if its distance to the centroid of the group is minimumNote: SPSS uses Maximum likelihood method

  • Discriminant Analysis-5Basic steps in DA:Identify the variablesScreen data: look for outliers, variables may not be good predictors, etcRun DACheck for the correct prediction rateCheck for the importance of individual predictorsValidate the model

  • Discriminant Analysis-6Assumptions:IVs are either dichotomous or measurementNormalityHomogeneity of variances

  • Discriminant Analysis-7Example 1: VCR buyers filled out a survey; we want to determine which set of demographic information and attitude best predict which customer may buy another VCRFile>Open Data>CSM.porExplore the dataAnalyze>Classify>DiscriminantMove age, complain, educ, fail, pinnovat, preliabl, puse, qual, use, and value into Independent boxMove buyyes into Grouping boxClick Define Range; type 1 for Min and 2 for MaxClick continue

  • Discriminant Analysis-8Click Statistics>click Boxs M and Fishers; continueClick Classify button>click Summary table; Separate groups; ContinueClick Save button>click on Discriminant Scores; continueClick OKHow original variables related to the discriminant score?Graphs>Scatter>Click DefineMove pinnovat into X and dis1_1 into Y; move buyyes into Set Markers by box

  • Discriminant Analysis-9Since Boxs M test was significant, one can ask SPSS to run DA using separate covariances option (under Classify) and compare the resultsFrom the 1st analysis, we see that age was not important, one can redo the analysis without age and compare the results

  • Discriminant Analysis-10Validate the model: leave-one-out classificationRepeat the analysis, click on Classify>click leave-one-out classification; Click continueExample 2: predict smoking and drinking habitsAnalyze>Classify>DiscriminantMove smkdrnk into Grouping Variable box; move age, attend, black, class, educ, sex and white into IV listClick Statistics>Select Fishers and Box M; ContinueClick Classify>Summary table, Combine-groups; Territorial map; ContinueClick OK

  • Cluster Analysis-1Cluster analysis is an exploratory data analysis technique design to reveal groupsHow?By distance: close together observations should be in the same group, and observations in the groups should be far apartApplications:Plants and animals into ecological groupsCompanies for product usage

  • Cluster Analysis-2Two types of methodHierarchical: requires observations to remain together once they have joint in a clusterComplete linkageBetween groups average linkageWards methodNonhierarchical: no such requirementResearch must pick a number of clusters to run (K-means algorithm)

  • Cluster Analysis-3Recommendations:For relative small samples, use hierarchical (less than a few hundred)For large samples, use K-meansExample 1: evaluating 20 types of beerFile>Open>Data; select beer.porAnalyze>Descriptive Stat>DescriptiveMove cost, calories, sodium, and alcohol into variable listClick at Save standardized values; OK

  • Cluster Analysis-4Analyze>Classify>Hierarchical ClusterMove cost, calories, sodium, and alcohol into Variable list boxMove Beer into label cases by boxClick Plots>click Dendrogram; click none in Icicle area; continueClick Method>select Z-score from the standardize drop-down list; ContinueClick Save>Click range of solutions; range 2-5 clusters; continueOK

  • Cluster Analysis-5Additional analysisLook at the last 4 column of the data (clu5_1 to clu2_1) they contain memberships for each solution between 5 and 2 clustersAnalyze>Descriptive>FrequenciesMove clu2_1 to clu5_1 to Variable boxOKObtain mean profile for clustersGraph>Line>summary of separate variablesClick Define>move zcost, zcalorie, zsodium, and zalcohol to Lines Rep. BoxClick clu4_1 and move it to Category box

  • Path Analysis-1Path analysis is a technique based on regression to establish causal relationshipStart with a diagram with causal flowDirect causal effects model (regression)The direct causal effect of an IV on a DV is the coefficient (the number of unit change in DV for 1 unit change in X)Building on the DCEMTwo forms of causal model:DiagramEquation (structure equation)

  • Path Analysis-2An example of a causal modelStructural equation:Z4=p41Z1+p42Z2+p43Z3+e4P: path coefficiente: disturbanceZ4, endogenous variableZ1: exogenous variablePath diagramIndirect effect is the multiplication of the path coefficients

  • Path Analysis-3Steps in path analysis:Create a path diagramUse regression to estimate structural equation coefficientsAssess to model:Compare the observed and reproduced correlations (reproduced correlations will be computed by hand)

  • Path Analysis-4Research questions: Is our model-which describe the causal effects among the variables region of the world, status as a developing nation, number of doctors, and male life expectancy-consistent with our observed correlation among these variables?If our model is consistent, what are the estimated direct, indirect, and total causal effects among the variables?

  • Path Analysis-5Legal path:No path may pass through the same variable more than onceNo path may go backward on an arrow after going forward on another arrowNo path may include more than one double headed curve arrow

  • Path Analysis-6Component labels:D: direct effect (just one straight arrow)I: indirect effect (more than one straight arrows)S: spurious effect (there is a backward arrow)U: effect is uncertain (start with a two arrows curve)

  • Path Analysis-7If the model is in question (some of the reproduced correlations differ from the observed correlations by more than .05)Test all missing paths (running additional regressions and check for significance of the coefficients)Reduce the existing paths if their coefficients are not significant

  • Logistic regression - MotivationsWhen the dependent variable is dichotomous, regular regression is not appropriateWe want to predict probabilityOLS regression predictions could be any numbers, not just numbers between 0 and 1When dealing with proportions, variance is depended on mean, equal variance assumption in OLS is violated

  • Motivations-ContinueFit a S curve to the data

  • What is Logistic Regression?Regressions of the formln(Odds)=B0+B1X1++BkXkln(Odds) is called a logicOdds=Porb/(1-Prob)

  • Application of Logistic RegressionWhen to use it? When the dependent valuable is dichotomousObjectives:Run a logistic regressionApply a stepwise logistic regressionUse ROC (response operating characteristic) curve to access the model

  • Assumptions of logistic regressionThe indep. variables be interval or dichotomousAll relevant predictors be included, no irrelevant predictors be included and the form of the relationship is linearThe expected value of the error term is zeroThere is no autocorrelation

  • Assumptions of logistic regression Cont.There is no correlation between the error and the independent variablesThere is an absence of perfect multicollinearity between the independent variablesNeed to have a large sample (rule of thumb: n should be > 30 times of the number of parameters)

  • Note on assumptionsNo need for normality of errorsNo need for equal variance

  • ExampleObjective: to predict low birth weight babiesVariables:Low: 1: 2500 gramsLWT: weight at last menstrual cycleAgeSmokePTL: # of premature deliveriesHT: History of HypertensionUI: uterine irritabilityFTV: # of physician visits during first trimesterRace: 1=white, 2=black, 3=other

  • ExampleFile > Open > Data > Select SPSS Portable type > select Birthwt (in Regression)Analyze > Regression > Binary LogisticMove low to the Dependent list boxMove age, ftv, ht, ptl, race, smoke, and ui into the Covariate list box

  • Example (cont.)Click the Categorical buttonPlace race in the Categorical Covariates boxClick Continue, click SaveClick the Probability and Group Membership check boxesClick Continue and then the Option button

  • Example (cont.)Click on the Classification plots and Hosmer-Lemeshow goodness of fit checkboxesClick Continue, then OK

  • Logistic outputsInitial summary output: info on dependent and categorical variablesBlock 0: based on the model just include a constant provides baseline infoBlock 1: Method Enter include the model infoChi-square tests if all the coeffs are 0 (similar to F in regression)

  • Logistic outputs (cont.)The Modle chi-square value is the difference of the initial and final 2LL (small value of -2LL indicates a good fit, -2LL=0 indicates a perfect fit)The Step and Block display the the result of last Step and Block (they are the same here because we are not using stepwise regression)

  • Logistic outputs (cont.)The goodness of fit statistics 2LL is 203.554Cox & Snell R square similar to R-square in OLSNagelkerke R squre (prefered b/c it can be 1)Hosmer and Lemeshow test: test there is no difference between expected and observe counts. I.e. we prefer a non-significant result

  • Logistic outputs (cont.)Classification table: can our model to predict accurately?Overall accuracy is 73%We do much better on higher birth weightDoes a poor job on lower birth weightA significant model doesnt mean having high predictability

  • Interpretation of the coefficientsE.g. HT (hypertension) B=1.736 hypertension in the mother increase the log odds by 1.736Exp(B)=5.831 - hypertension in the mother increase the odds of having a low birth baby by a factor of 5.831What is the prob. change?If the original odds is 1:100 (p=.0099), it changes to 5.831:100 (p=.0551); if the original odds is 1:1 (p=.5), it changes to 5:1 (p=.83)

  • Interpretation of the coefficients (cont.)Categorical variable Race:First an overall effectRace(1) white: the effect of being white is significant, acting to decrease the odds ratio compared to those of other by a factor of .4The effect of being black is not significant compared with other

  • Making predictionSuppose a mother;Age 20Weigh 130 poundsSmokeNo hypertension or premature laborHas uterine irritabilityWhiteTwo visits to her doctor

  • Making prediction (cont.)P(event) = 1/(1+exp(-(a+b1X1++bkXk)P=.397Predicted to be not have low birth rate because the prob. is less that .5

  • Checking classificationNeed to study the characteristics of mispredicted casesTransform>Compute> Pred_err=1 ifAnalyze>Compare Means (LWT vs Pred_err)The mean LWT for mispredicted is much lower than the correctly predicted

  • Residual AnalysisAnalyze>Regression>Logistic>Click Save >Click Cooks, Leverage, Unstandardized, Logit, and StandardizedExamining dataCooks and Leverage should be small (if a case has no influence on the regression result, the values would be 0)Res_1 is the residual of probability (e.g. 1st case have predicted prob. .29804 and and actual low value is 0, and the res_1=0-.29804=-.29804)Zre_1 is the standardized residual of the probslre_1 is the residual in terms of logit

  • ROC curve (Receiver Operating Characteristic)Sensitivity: true positiveSpecificity: true negativeChanging cut off points (.5) changes both the sensitivity and specificityROC can help us to select an optimal cut off pointGraph>ROC Curve>move pre_1 to Test Variable, low to State Variable, type 1 in the Value of State Variable, click with diagonal reference line and Coordinate points of the ROC Curve

  • ROC curve interpretationVertical axis: sensitivity (true positive rate)Horizontal axis: false negative rateDiagonal: referenceGive the trade off between sensitivity and false negative ratesPay attention to the area where the curve rise rapidlyThe 1st column of coordinate of the curve gives the cut off prob.

  • Residual Analysis Cont.Examine the distribution of zre_1Graph>Interactive>Histogram>drag zre_1 to X axis, click Histogram, click Normal CurveNote: this plot need not to should normalityFinding influential casesGraph>Scatterplot>Define>Move id to X axis, coo_1 to Y axisMulticollinearityUse OLS regression to check (?)

  • Multinomial Logistic RegressionThe dependent variable is categorical with two or more categoriesIt is an extension of the logistic regressionThe assumptions are the assumptions for logistic regression plus the dependent variable has multinomial distribution

  • ExampleObjective: predict risk credit risk (3 categories) base on financial and demographic variablesVariables:AgeIncomeGenderMarital (single, married, divsepwid)Numkids: # of dependent children

  • Example Cont.Numcards: #of credit cardsHowpaid: how often paid (weekly, monthly)Mortgage: have a mortgage (y, n)Storecar: # of store credit cardsLoans: # of other loadsRisk: 1=bad risk, 2=bad risk-profit, 3=good risk

  • How does it work?Let f(j) be the probability of being in outcome category jf(1)=P(bad risk-lost)f(2)=P(bad risk-profit)f(3)=P(good risk)g(1)=f(1)/f(3)g(2)=f(2)/f(3)g(3)=f(3)/f(3)=1

  • How does it work? Cont.Fit the modele:ln(g(1))= A1+B11X1++B1kXkln(g(2))= A2+B21X1++B2kXkln(g(3))= ln(1)=0=A3+B31X1++B3kXk

  • How does it work? Cont.

  • Example Cont.File > Open > Data > Select Risk > OpenMove risk into dependent list boxMove marital and mortgage into the Factor(s) list boxMove income and numberkids into the Covariate(s) list boxClick model buttonClick cancel button

  • Example (Cont.)Click Statistics buttonCheck the Classification table check boxClick ContinueClick SaveThe Multinomial Logistic regression in SPSS version 10 will only save model info in an XML (Extensible Markup Language) formatClick cancelClick OK

  • Multinomial outputModel Fit and Pseudo R-square, Likelihood ratio test are similar to logistic regressionParameter estimates table is differentThere are two sets of parametersOne for the probability ratio of (bad risk-lost)/(good risk)Another set for the prob. Ratio of (bad risk-profit)/(good risk)

  • Interpretation of coefficientsIncome in the bad lost sectionIt is significantExp(B)=.962: the expected probability ratio is decreased a little (by a factor of .962) for one unit increase of income

  • How to predict?F(1) the chance in bad loss groupF(2) the chance in bad profit groupF(3) the chance in good risk groupF(j)=g(j)/sum(g(i))g(j)=exp(modelj)

  • How to predict? (cont.)Suppose an individualSingle, has a mortgageNo childrenIncome of 35,000 poundsg(1)=.218g(2)=.767g(3)=1

  • How to predict?F(1)=.218/(.218+.767+1)=.110F(2)=.386F(3)=.504The individual is classified as good risk

  • Multinomial Logistic Reg. With InteractionAnalyze>Regression>Multinomial Logistic>Click at Model, select custom>specify your model (all main effects and the interaction between Marital and Mortgage)Interpret the results as usual

  • Interaction effects in logistic RegressionIt is similar to OLS regression:Add interaction terms to the model as crossproductsIn SPSS, highlight two variables (holding down the ctrl key) and move them into the variable box will create the interaction term