Anova Glm

Embed Size (px)

Text of Anova Glm

  • 8/10/2019 Anova Glm




    Analysis of Variance and


    Level : E2

  • 8/10/2019 Anova Glm



    Key concepts

    Analysis of Variance

    Analysis of Covariance

    GLM Procedure

  • 8/10/2019 Anova Glm


    Analysis of Variance


    used to uncover the main and interaction effects of categorical

    independent variables (called "factors") on interval dependent

    variable (s).


    An experiment may measure weight change (the dependent variable) for men

    and women who participated in two different weight-loss programs. The 4

    cells of the design are formed by the 4 combinations of sex (men, women) andprogram (A, B).

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    What will be wrong if we use t-test in case of three or more means?


    Let us have a situation where we have three means A, B and C. We want to test the

    H0 : A = B = C

    Against H1 : at least one of them is different than others.

    If we use t test repetitively, we will increase the ERRORSin our analysis.

  • 8/10/2019 Anova Glm



    The scale on which the dependent variable is measured has the properties of an equal intervalscale.

    The ksamples are independently and randomly drawn from the source population(s)

    The source population(s) can be reasonably supposed to have a normal distribution.

    The ksamples have approximately equal variances.

  • 8/10/2019 Anova Glm


    Main Effect

    the effect of a particular factor on average.

    Interaction Effect

    the effects of one factor differs according to the levels of another factor

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    The key statistic in ANOVA is the F-test of difference of group means, testing if themeans of the groups formed by values of the independent variable (or combinations ofvalues for multiple independent variables) are different enough not to have occurred bychance.

    ANOVA focuses on F-tests of significance of differences in group means. If one has an

    complete enumeration rather than a sample, then any difference of means is "real."

    However, when ANOVA is used for comparing two or more different samples, the realmeans are unknown. The researcher wants to know if the difference in sample means isenough to conclude the real means do in fact differ among two or more groups.

    If the group means do not differ significantly then it is inferred that the independent

    variable(s) did not have an effect on the dependent variable.

    If the F test shows that overall the independent variable(s) is (are) related to thedependent variable, then multiple comparison tests of significance are used to explorejust which values of the independent(s) have the most to do with the relationship.

  • 8/10/2019 Anova Glm


    Post-hoc Comparisons

    If null hypothesis in ANOVA is rejected then go for the multiple comparison (Post-hoc Comparisons)test.

    The most common tests are

    Least square difference (LSD)



    TukeysHonest Square Difference (HSD)

    Bonferroni, Scheffe

  • 8/10/2019 Anova Glm


    Suppose we are testing the null hypothesis that the four sample means are equal

    H0 :m

    = m


    = m


    = m


    H1: m m 2 m3 m 4

    this hypothesis is rejected.

    The F test in ANOVA tells that at least one mean is not same to the other but it

    does not specify which particular mean it is

    One of the possible ways to detect which particular sample mean is different may to

    conduct the following six tests-

  • 8/10/2019 Anova Glm


    Unbalanced Designs

    If the sample sizes for the treatment combinations are not all equal.

    Unbalanced designs cause Confounding.

    confoundingis the condition that the effects of two (or more) explanatory variables cannot be

    distinguished from each other

  • 8/10/2019 Anova Glm


    Types of Sum of Squares

    Type I, Type II, Type III and Type IV sum of squares.

    Type II sum of square are the reduction in the SSE due to adding the effect to a model that contains all othereffects except effects that contains the effect being tested.

    Type III SSare each adjusted for all other effects in the model

    If our model does not contain any interaction term then both will lead to same output

    For the highest order interaction term the two methods will always provide same estimate

    If interaction can be safely ignored then Type II provides more powerful than that obtained from Type III to testthe significance of main effect

    If there are not sufficient reasons to ignore interactions then we should use Type III. This is the defaulttype inmost of the softwares for Statistical Analysis

  • 8/10/2019 Anova Glm


    SAS Implementation

    proc anova data = hhh

    class treat

    model weight = treat


    PROC ANOVA takes into account the special structure of a balanced design, it is faster

    and uses less storage than PROC GLM for balanced data, ), whereas the GLM

    procedure can analyze both balanced and unbalanced data

    The classification variable is specified in the CLASS statement

    The MODEL statement names the dependent variables and independent effects

  • 8/10/2019 Anova Glm



    title1 'Nitrogen Content of Red Clover Plants';data Clover;input Strain $ Nitrogen @@;datalines;

    1 19.4 1 32.6 1 27.0 1 32.1 1 33.05 17.7 5 24.8 5 27.9 5 25.2 5 24.34 17.0 4 19.4 4 9.1 4 11.9 4 15.87 20.7 7 21.0 7 20.5 7 18.8 7 18.613 14.3 13 14.4 13 11.8 13 11.6 13 14.215 17.3 15 19.4 15 19.1 15 16.9 15 20.8 ;

    proc anova data = Clover;class strain;model Nitrogen = Strain;


  • 8/10/2019 Anova Glm


    Results and interpretation

    Dependent Variable: Nitrogen

    Source DF Sum of Squares Mean Square F value Pr>F

    Model 5 847.046667 169.409333 14.37 F

    Strain 5 847.0466667 169.4093333 14.37

  • 8/10/2019 Anova Glm


    The degrees of freedom (DF) column should be used to check the analysisresults. The model degrees of freedom for a one-way analysis of varianceare the number of levels minus 1; in this case, 6-1=5. The Corrected Totaldegrees of freedom are always the total number of observations minus one;in this case 30-1=29. The sum of Model and Error degrees of freedomequal the Corrected Total.

    The overall Ftest is significant (F=14.37,p

  • 8/10/2019 Anova Glm


    Analysis of Covariance

    A combination of linear Regression and ANOVA.

    If we have a continuous variable that can have an impact on thedependent variable and we want to control that variable also thewe use ANCOVA at the place of ANOVA. That is, In experimentaldesigns, to control for factors which cannot be randomized butwhich can be measured on an interval scale.

    Example: In some study baseline values can be a variable which weneed to control to examine the significance of categoricalpredictors.

    When covariate scores are available we have information aboutdifferences between treatment groups that existed before theexperiment was performed and we want to control for that.

  • 8/10/2019 Anova Glm


    As a general rule a very small number of covariates is best.

    Correlated with the dependent variable.

    Not correlated with each other (multi-colinearity)

    Data on covariates should be gathered before treatment is administered

    Failure to do this often means that some portion of the effect of the predictor isremoved from the dependent when the covariate adjustment is calculated.

    The rules like that for sum of squares etc remain as they were in the case ofANOVA.

  • 8/10/2019 Anova Glm


    GLM Procedures

    The general linear model(GLM) is a statisticallinear model. It may be written as

    Y = XB+ U

    where Yis a matrix with series of multivariate measurements, Xis a matrix that might be a design matrix,Bis a matrix containing parameters that are usually to be estimated and Uis a matrix containing residuals(i.e., errors or noise). The residual is usually assumed to follow a multivariate normal distribution. If theresidual is not a multivariate normal distribution, Generalized linear models may be used to relaxassumptions about Yand U.

    The GLM procedure uses the method of least squares to fit general linear models.

    GLM handles models relating one or several continuous dependent variables to one or severalindependent variables. The independent variables may be either classificationvariables, which divide theobservations into discrete groups, or continuousvariables.

  • 8/10/2019 Anova Glm


    Thus, the GLM procedure can be used for many different analyses, including

    simple regression

    multiple regression

    analysis of variance (ANOVA), especially for unbalanced data

    analysis of covariance (ANCOVA)

    response-surface models

    weighted regression

    polynomial regression

    partial correlation

    multivariate analysis of variance (MANOVA)

    repeated measures analysis of variance

  • 8/10/2019 Anova Glm


    SAS GLM procedure

    PROC GLM DATA = SAS data-set;

    CLASS variables;

    MODEL dependents = independents ;

    MEANS effects ;

    LSMEANS effects ;

    OUTPUT OUT = SAS data-set keyword = variable... ;


  • 8/10/2019 Anova Glm


    PROC GLM handles models relating one or several continuous dependent variablesto one or several independent variables.

    CLASS specifies classification variables for the analysis.

    MODEL specifies dependent and independent variables for the analysis

    MEANS computes means of the dependent variable for each value of the specifiedeffect

    LSMEANS produces means for the outcome variable, broken out by the variablespecified and adjusting for any other explanatory variables included on the MODELstatement.

    LSMEANS can also be used for multiple comparisons tests.

    OUTPUT specifies an output data set that contains all variables from the input dataset and variables representing statistics from the analysis.

  • 8/10/2019 Anova Glm



    title 'Analysis of Unbalanced 2-by-2 Factorial';data exp;

    input A $ B $ Y @@;datalines;

    A1 B1 12 A1 B1 14 A1 B2 11 A1 B2 9A2 B1 20 A2 B1 18 A2 B2 17;

    proc glm;class A B;model Y=A B A*B;


  • 8/10/2019 Anova Glm



    Analysis of Unbalanced 2-by-2 FactorialThe GLM ProcedureDependent Variable: Y

    Source DF Sum of Squares Mean Square F Value Pr > FModel 3 91.71428571 30.57142857 15.29 0.0253Error 3 6.00000000 2.00000000

    Corrected Total 6 97.71428571

    R-Square Coeff Var Root MSE Y Mean0.938596 9.801480 1.414214 14.42857

    Source DF Type I SS Mean Square F Value Pr > FA 1 80.04761905 80.04761905 40.02 0.0080B 1 11.26666667 11.26666667 5.63 0.0982A*B 1 0.40000000 0.40000000 0.20 0.6850Source DF Type III SS Mean Square F Value Pr > FA 1 67.60000000 67.60000000 33.80 0.0101B 1 10.00000000 10.00000000 5.00 0.1114A*B 1 0.40000000 0.40000000 0.20 0.6850

  • 8/10/2019 Anova Glm



    The degrees of freedom may be used to check your data. The Model degrees of freedom for a 2 2factorial design with interaction are (ab-1), where a is the number of levels of A and b is thenumber of levels of B; in this case, (22-1) = 3. The Corrected Total degrees of freedom are alwaysone less than the number of observations used in the analysis; in this case, 7-1=6.

    The overall F test is significant (F=15.29,p=0.0253), indicating strong evidence that the means forthe four different AB cells are different. You can further analyze this difference by examining theindividual tests for each effect.

    Four types of estimable functions of parameters are available for testing hypotheses in PROC GLM.For data with no missing cells, the Type III and Type IV estimable functions are the same and testthe same hypotheses that would be tested if the data were balanced. Type I and Type III sums ofsquares are typically not equal when the data are unbalanced; Type III sums of squares arepreferred in testing effects in unbalanced cases because they test a function of the underlyingparameters that is independent of the number of observations per treatment combination.

    According to a significance level of 5% , the A*B interaction is not significant (F=0.20, p=0.6850).This indicates that the effect of A does not depend on the level of B and vice versa. Therefore, thetests for the individual effects are valid, showing a significant A effect (F=33.80,p=0.0101) but nosignificant B effect (F=5.00,p=0.1114).

  • 8/10/2019 Anova Glm



  • 8/10/2019 Anova Glm



  • 8/10/2019 Anova Glm


    Key Concepts

    Non-Parametric Tests

    Mann - Whitney Test Kruskal - Wallis Test

    Friedman Test

    McNemar Test

    Log - Rank Test

  • 8/10/2019 Anova Glm


    Parametric Vs. Non-Parametric Tests


    These methods needs distributional

    assumption from which samples are drawn.

    Require a sufficiently large sample size.


    These methods needs no distributional assumption from which

    samples are drawn i.e. to say it is DistributionFree Test.

    It should be used when the sample size is small.

  • 8/10/2019 Anova Glm


    Mann-Whitney Wilcoxon Test


    Test for comparing two populations.

    Used to test the null hypothesis that two independent samples have identical

    distribution functions against the alternative hypothesis that the two

    distribution functions differ only with respect to mean or mediani.e. to say

    used to make inferences about population mean or median without requiring

    the assumption of normality.

    Used as an alternative to the two sample t-test when the normalityassumption is not satisfied.

    Applied when the observations in a sample are ranks, that is, ordinal data

    rather than direct measurements

  • 8/10/2019 Anova Glm



    Two samples are randomly and independently drawn.

    Dependent variable is continuous, capable of producing measures carried out to the nth decimal


    Measures within the two samples have the properties of at least an ordinal scale of measurement, so

    that it is meaningful to speak of "greater than," "less than," and "equal to."

    Data can be ranked including tied rank values wherever appropriate. Ranks helps to focus only on the

    ordinal relationships among the raw measures"greater than," "less than," and "equal to.

    Two population distributions differ only by a small shift in location.

  • 8/10/2019 Anova Glm


    Proc npar1way wilcoxon

    In general, PROC NPAR1WAY performs an analysis of variance (option

    ANOVA), tests for location differences (options WILCOXON, MEDIAN,

    SAVAGE, and VW), and performs empirical distribution function tests (option

    EDF). Call is

    PROC NPAR1WAY < options > ;

    BY variables ;

    CLASS variable ;

    EXACT < / computation-options > ;

    FREQ variable ;

    OUTPUT < OUT=SAS-data-set > < WILCOXON > ;

    VAR variables ;


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    BY statementdo separate analyses on observations in groups defined by the

    BY variables. When a BY statement appears, the procedure expects the input

    data set to be sorted in order of the BY variables.

    The CLASS variableidentifies groups (or samples) in the data. The variable

    can be character or numeric.

    The FREQ statementnames a numeric variable that provides a frequency for

    each observation in the DATA= data set.

    The VAR statementnames the response or dependent variables to be analyzed. These

    variables must be numeric. If the VAR statement is omitted, the procedure analyzes all

    numeric variables in the data set except for the CLASS variable, the FREQ variable,and the BY variables.

    OUT=SAS-data-set names the output data set.

  • 8/10/2019 Anova Glm


    Computation-Options are:

    Options Description

    ALPHA= value specifies the level of the confidence limits for Monte Carlo p-valueestimates. The value of the ALPHA= option must be between 0 and 1,and the default is 0.01 which produces produces 99% confidence limitsfor the Monte Carlo estimates.

    MAXTIME=value specifies the maximum clock time (in seconds) that PROC NPAR1WAYcan use to compute an exact p-value. If the procedure does not completethe computation within the specified time, the computation terminates.

    MC requests Monte Carlo estimation of exact p-values, instead of direct exactp-value computation. Monte Carlo estimation can be useful for large

    problems that require a great amount of time and memory for exactcomputations

    N=n specifies the number of samples for Monte Carlo estimation. The value ofthe N= option must be a positive integer, and the default is 10,000

    samples. Larger values of n produce more precise estimates of exact p-values.

    POINT requests exact point probabilities for the test statistics.

    SEED=number specifies the initial seed for random number generation for Monte Carlo

    estimation. The value of the SEED= o tion must be an inte er.

  • 8/10/2019 Anova Glm



    Global Evaluations of drug A & drug B in back pain: In a treatment it was found that patients

    with low back pain experienced a decrease in pain after 6 to 8 weeks of daily treatment. So, a

    study was conducted to determine whether this phenomenon is a drug related response or

    coincidental. For this patients were asked to provide a global rating of their pain, relative to

    baseline, on the following scale

    For testing this phenomenon we use Mann-Whitney test.

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    Kruskal - Wallis Test


    Analogue of one way ANOVA without the assumption of normality.

    Extension of Wilcoxon test for more then two groups.

    Used to compare population location parameters among two or more groups based on independent


    Used to test the null hypothesis that all populations have identical distribution functions against thealternative hypothesis that at least two of the samples differ only with respect to location .


    Same as Wilcoxon test.

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    Friedman Test


    Models the ratings of n judges (rows) on k treatments (column).

    Generalization of sign test and spearman rank correlation test as it reduces to

    sign test if there are two columns and reduces to spearman rank correlation

    test if there are two rows.

    Also called two-way analysis on ranks as is used for two=way repeated

    measures analysis of variance by ranks.

    Used to test null hypothesis that treatment effects have identical effects

    against the alternative hypothesis that at least one treatment is different from

    at least one other treatment.

  • 8/10/2019 Anova Glm



    There are k experimental treatments. k 2.

    n rows are mutually independent. (i.e. results within one row do not affect the results within other


    Data can be meaningfully ranked.

    SAS Implementation

    Proc freqwith cmh2option in table statement.

  • 8/10/2019 Anova Glm


    Friedman Test


    PROC FREQ < options > ;

    BY variables ;

    EXACT statistic-options < / computation-options > ;

    OUTPUT < OUT=SAS-data-set > options ;TABLES requests < / options > ;



    BY calculates separate frequency or crosstabulation tables for each BY group. EXACT requests exact tests for specified statistics.

    OUTPUT creates an output data set that contains specified statistics.

    TABLES specifies frequency or crosstabulation tables and requests tests and measures of


    TEST requests asymptotic tests for measures of association and agreement.

    WEIGHT identifies a variable with values that weight each observation.

    F i d T t

  • 8/10/2019 Anova Glm


    Friedman Test

    OptionsAGREE McNemar's test for 2 2 tables, simple kappa coefficient, and weighted kappa


    BINOMIAL binomial proportion test for one-way tables

    CHISQ chi-square goodness-of-fit test for one-way tables; Pearson chi-square, likelihood-

    ratio chi-square, and Mantel-Haenszel chi-square tests for two-way tables

    COMOR confidence limits for the common odds ratio for h 2 2 tables; common odds ratio


    FISHER Fisher's exact test

    JT Jonckheere-Terpstra test

    KAPPA test for the simple kappa coefficient

    LRCHI likelihood-ratio chi-square test

    MCNEM McNemar's test

    MEASURES tests for the Pearson correlation and the Spearman correlation, and the odds ratio

    confidence limits for 2 2 tables

    MHCHI Mantel-Haenszel chi-square test OR confidence limits for the odds ratio for 2 2


    PCHI Pearson chi-square test

    PCORR test for the Pearson correlation coefficient

    SCORR test for the Spearman correlation coefficient

    TREND Cochran-Armitage test for trend

    WTKAP test for the weighted kappa coefficient

  • 8/10/2019 Anova Glm


    OptionsAGREE McNemar's test for 2 2 tables, simple kappa coefficient, and weighted kappa


    BINOMIAL binomial proportion test for one-way tables

    CHISQ chi-square goodness-of-fit test for one-way tables; Pearson chi-square, likelihood-

    ratio chi-square, and Mantel-Haenszel chi-square tests for two-way tables

    COMOR confidence limits for the common odds ratio for h 2 2 tables; common odds ratio


    FISHER Fisher's exact test

    JT Jonckheere-Terpstra test

    KAPPA test for the simple kappa coefficient

    LRCHI likelihood-ratio chi-square test

    MCNEM McNemar's test

    MEASURES tests for the Pearson correlation and the Spearman correlation, and the odds ratio

    confidence limits for 2 2 tables

    MHCHI Mantel-Haenszel chi-square test OR confidence limits for the odds ratio for 2 2


    PCHI Pearson chi-square test

    PCORR test for the Pearson correlation coefficient

    SCORR test for the Spearman correlation coefficient

    TREND Cochran-Armitage test for trend

    WTKAP test for the weighted kappa coefficient

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    McNemar Test


    Determine whether the row and column marginal frequencies are equal or not.

    Uses matched pairs labels say, (A,B).

    Tests whether pair (A,B) is as likely as (B,A).

    Used when dichotomous outcomes are recorded twice for each patient under different conditions

    (Eg different treatments or different measurement times).

  • 8/10/2019 Anova Glm



    Data consists of paired observations of labels (A,B).

    Applied to 2x2 contingency tables with a dichotomous trait with matched pairs of subjects.

    Used only when the conditions for the normal approximation apply.

  • 8/10/2019 Anova Glm


    SAS Implementation

    Proc freqwith agreeoption in table statement

    Output gives Chi-Square p-value (two-tailed). One tailed can be obtained by

    halving it.


    Comparing response rates (Eg. normal & abnormal of group of patients where dataare collected for pre and poststudy laboratory results) when patients are treatedunder a particular drug say A. (Here, we need to test whether there is a change in

    the pre - to - post - treatment rates of abnormalities.)

    Suppose following program has been run where aim is to compare response

    rates (yes/no) of case & control.

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    Log-Rank Test


    Used for comparing distributions of time until the occurrence of an event (Eg death, cure, failure,relapse etc.) of interest occur among independent groups.

    Used to test the null hypothesis that there is no difference between the populations in theprobability of an event at any time point.

    Used when Wilcoxon test fails. (i.e. censoring condition is not satisfied)

    Most likely to detect a difference between groups when the risk of an event is consistently greaterfor one group than another.

    Equivalent to applying CMH at each time point as the strata.

  • 8/10/2019 Anova Glm



    Censoring is unrelated to prognosis.

    Survival probabilities are the same for subjects recruited early and late in the study, and the events

    happened at the times specified.

    Requires no assumption regarding the distribution of event times.

  • 8/10/2019 Anova Glm


    SAS Implementation

    Proc lifetest

    Output shows Chi-Square p-value.

    PROC LIFETEST < options > ;

    TIME variable < *censor(list) > ;

    BY variables ;

    FREQ variable ;

    ID variables ;

    STRATA variable < (list) > < ... variable < (list) > > ;

    SURVIVAL options ;

    TEST variables ;


  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm


    Time statement used to indicate the failure time variable, where

    variable is the name of the failure time variable that can be optionally followed

    by an asterisk, the name of the censoring variable, and a parenthetical list of

    values that correspond to right censoring. The censoring values should be

    numeric, non missing values.

    BY statementwith PROC LIFETEST to obtain separate analyses on observations

    in groups defined by the BY variables.

    The variable in the FREQ statementidentifies a variable containing the frequency

    of occurrence of each observation.

    The ID variablevalues are used to label the observations of the product-limit

    survival function estimates.

  • 8/10/2019 Anova Glm


    The STRATA statementindicates which variables determine strata levels for

    the computations. The strata are formed according to the nonmissing values of

    the designated strata variables.

    Options available with STRATA statement

    MISSING used to allow missing values as a valid stratum level.

    GROUP=variable specifies the variable whose formatted values identify the various

    samples whose underlying survival curves are to be compared.

    NODETAIL suppresses the display of the rank statistics and the corresponding

    covariance matrices for various strata.

    NOTEST suppresses the k-sample tests, stratified tests, and trend tests

    TREND computes the trend tests for testing the null hypothesis that thek

    population hazards rate are the same versus an ordered alternatives

    TEST=(list) enables you to select the weight functions for the k-sample tests,

    stratified tests, or trend tests. You can specify a list containing one

    or more of the following keywords

  • 8/10/2019 Anova Glm


  • 8/10/2019 Anova Glm