Anova Glm

8/10/2019 Anova Glm

1/67

COMP-STAT

Group

Analysis of Variance and

Covariance

Level : E2

8/10/2019 Anova Glm

2/67

Contents

Key concepts

Analysis of Variance

Analysis of Covariance

GLM Procedure

8/10/2019 Anova Glm

3/67

Analysis of Variance

ANOVA

used to uncover the main and interaction effects of categorical

independent variables (called "factors") on interval dependent

variable (s).

Example:

An experiment may measure weight change (the dependent variable) for men

and women who participated in two different weight-loss programs. The 4

cells of the design are formed by the 4 combinations of sex (men, women) andprogram (A, B).

8/10/2019 Anova Glm

4/67

8/10/2019 Anova Glm

5/67

What will be wrong if we use t-test in case of three or more means?

Example:

Let us have a situation where we have three means A, B and C. We want to test the

H0 : A = B = C

Against H1 : at least one of them is different than others.

If we use t test repetitively, we will increase the ERRORSin our analysis.

8/10/2019 Anova Glm

6/67

Assumptions

The scale on which the dependent variable is measured has the properties of an equal intervalscale.

The ksamples are independently and randomly drawn from the source population(s)

The source population(s) can be reasonably supposed to have a normal distribution.

The ksamples have approximately equal variances.

8/10/2019 Anova Glm

7/67

Main Effect

the effect of a particular factor on average.

Interaction Effect

the effects of one factor differs according to the levels of another factor

8/10/2019 Anova Glm

8/67

8/10/2019 Anova Glm

9/67

The key statistic in ANOVA is the F-test of difference of group means, testing if themeans of the groups formed by values of the independent variable (or combinations ofvalues for multiple independent variables) are different enough not to have occurred bychance.

ANOVA focuses on F-tests of significance of differences in group means. If one has an

complete enumeration rather than a sample, then any difference of means is "real."

However, when ANOVA is used for comparing two or more different samples, the realmeans are unknown. The researcher wants to know if the difference in sample means isenough to conclude the real means do in fact differ among two or more groups.

If the group means do not differ significantly then it is inferred that the independent

variable(s) did not have an effect on the dependent variable.

If the F test shows that overall the independent variable(s) is (are) related to thedependent variable, then multiple comparison tests of significance are used to explorejust which values of the independent(s) have the most to do with the relationship.

8/10/2019 Anova Glm

10/67

Post-hoc Comparisons

If null hypothesis in ANOVA is rejected then go for the multiple comparison (Post-hoc Comparisons)test.

The most common tests are

Least square difference (LSD)

Duncan

Dunnett

TukeysHonest Square Difference (HSD)

Bonferroni, Scheffe

8/10/2019 Anova Glm

11/67

Suppose we are testing the null hypothesis that the four sample means are equal

H0 :m

= m

2

= m

3

= m

4

H1: m m 2 m3 m 4

this hypothesis is rejected.

The F test in ANOVA tells that at least one mean is not same to the other but it

does not specify which particular mean it is

One of the possible ways to detect which particular sample mean is different may to

conduct the following six tests-

8/10/2019 Anova Glm

12/67

Unbalanced Designs

If the sample sizes for the treatment combinations are not all equal.

Unbalanced designs cause Confounding.

confoundingis the condition that the effects of two (or more) explanatory variables cannot be

distinguished from each other

8/10/2019 Anova Glm

13/67

Types of Sum of Squares

Type I, Type II, Type III and Type IV sum of squares.

Type II sum of square are the reduction in the SSE due to adding the effect to a model that contains all othereffects except effects that contains the effect being tested.

Type III SSare each adjusted for all other effects in the model

If our model does not contain any interaction term then both will lead to same output

For the highest order interaction term the two methods will always provide same estimate

If interaction can be safely ignored then Type II provides more powerful than that obtained from Type III to testthe significance of main effect

If there are not sufficient reasons to ignore interactions then we should use Type III. This is the defaulttype inmost of the softwares for Statistical Analysis

8/10/2019 Anova Glm

14/67

SAS Implementation

proc anova data = hhh

class treat

model weight = treat

run

PROC ANOVA takes into account the special structure of a balanced design, it is faster

and uses less storage than PROC GLM for balanced data, ), whereas the GLM

procedure can analyze both balanced and unbalanced data

The classification variable is specified in the CLASS statement

The MODEL statement names the dependent variables and independent effects

8/10/2019 Anova Glm

15/67

Example

title1 'Nitrogen Content of Red Clover Plants';data Clover;input Strain $ Nitrogen @@;datalines;

1 19.4 1 32.6 1 27.0 1 32.1 1 33.05 17.7 5 24.8 5 27.9 5 25.2 5 24.34 17.0 4 19.4 4 9.1 4 11.9 4 15.87 20.7 7 21.0 7 20.5 7 18.8 7 18.613 14.3 13 14.4 13 11.8 13 11.6 13 14.215 17.3 15 19.4 15 19.1 15 16.9 15 20.8 ;

proc anova data = Clover;class strain;model Nitrogen = Strain;

run;

8/10/2019 Anova Glm

16/67

Results and interpretation

Dependent Variable: Nitrogen

Source DF Sum of Squares Mean Square F value Pr>F

Model 5 847.046667 169.409333 14.37 F

Strain 5 847.0466667 169.4093333 14.37

8/10/2019 Anova Glm

17/67

The degrees of freedom (DF) column should be used to check the analysisresults. The model degrees of freedom for a one-way analysis of varianceare the number of levels minus 1; in this case, 6-1=5. The Corrected Totaldegrees of freedom are always the total number of observations minus one;in this case 30-1=29. The sum of Model and Error degrees of freedomequal the Corrected Total.

The overall Ftest is significant (F=14.37,p

8/10/2019 Anova Glm

18/67

Analysis of Covariance

A combination of linear Regression and ANOVA.

If we have a continuous variable that can have an impact on thedependent variable and we want to control that variable also thewe use ANCOVA at the place of ANOVA. That is, In experimentaldesigns, to control for factors which cannot be randomized butwhich can be measured on an interval scale.

Example: In some study baseline values can be a variable which weneed to control to examine the significance of categoricalpredictors.

When covariate scores are available we have information aboutdifferences between treatment groups that existed before theexperiment was performed and we want to control for that.

8/10/2019 Anova Glm

19/67

As a general rule a very small number of covariates is best.

Correlated with the dependent variable.

Not correlated with each other (multi-colinearity)

Data on covariates should be gathered before treatment is administered

Failure to do this often means that some portion of the effect of the predictor isremoved from the dependent when the covariate adjustment is calculated.

The rules like that for sum of squares etc remain as they were in the case ofANOVA.

8/10/2019 Anova Glm

20/67

GLM Procedures

The general linear model(GLM) is a statisticallinear model. It may be written as

Y = XB+ U

where Yis a matrix with series of multivariate measurements, Xis a matrix that might be a design matrix,Bis a matrix containing parameters that are usually to be estimated and Uis a matrix containing residuals(i.e., errors or noise). The residual is usually assumed to follow a multivariate normal distribution. If theresidual is not a multivariate normal distribution, Generalized linear models may be used to relaxassumptions about Yand U.

The GLM procedure uses the method of least squares to fit general linear models.

GLM handles models relating one or several continuous dependent variables to one or severalindependent variables. The independent variables may be either classificationvariables, which divide theobservations into discrete groups, or continuousvariables.

8/10/2019 Anova Glm

21/67

Thus, the GLM procedure can be used for many different analyses, including

simple regression

multiple regression

analysis of variance (ANOVA), especially for unbalanced data

analysis of covariance (ANCOVA)

response-surface models

weighted regression

polynomial regression

partial correlation

multivariate analysis of variance (MANOVA)

repeated measures analysis of variance

8/10/2019 Anova Glm

22/67

SAS GLM procedure

PROC GLM DATA = SAS data-set;

CLASS variables;

MODEL dependents = independents ;

MEANS effects ;

LSMEANS effects ;

OUTPUT OUT = SAS data-set keyword = variable... ;

RUN;QUIT;

8/10/2019 Anova Glm

23/67

PROC GLM handles models relating one or several continuous dependent variablesto one or several independent variables.

CLASS specifies classification variables for the analysis.

MODEL specifies dependent and independent variables for the analysis

MEANS computes means of the dependent variable for each value of the specifiedeffect

LSMEANS produces means for the outcome variable, broken out by the variablespecified and adjusting for any other explanatory variables included on the MODELstatement.

LSMEANS can also be used for multiple comparisons tests.

OUTPUT specifies an output data set that contains all variables from the input dataset and variables representing statistics from the analysis.

8/10/2019 Anova Glm

24/67

Example

title 'Analysis of Unbalanced 2-by-2 Factorial';data exp;

input A $ B $ Y @@;datalines;

A1 B1 12 A1 B1 14 A1 B2 11 A1 B2 9A2 B1 20 A2 B1 18 A2 B2 17;

proc glm;class A B;model Y=A B A*B;

run;

8/10/2019 Anova Glm

25/67

Result

Analysis of Unbalanced 2-by-2 FactorialThe GLM ProcedureDependent Variable: Y

Source DF Sum of Squares Mean Square F Value Pr > FModel 3 91.71428571 30.57142857 15.29 0.0253Error 3 6.00000000 2.00000000

Corrected Total 6 97.71428571

R-Square Coeff Var Root MSE Y Mean0.938596 9.801480 1.414214 14.42857

Source DF Type I SS Mean Square F Value Pr > FA 1 80.04761905 80.04761905 40.02 0.0080B 1 11.26666667 11.26666667 5.63 0.0982A*B 1 0.40000000 0.40000000 0.20 0.6850Source DF Type III SS Mean Square F Value Pr > FA 1 67.60000000 67.60000000 33.80 0.0101B 1 10.00000000 10.00000000 5.00 0.1114A*B 1 0.40000000 0.40000000 0.20 0.6850

8/10/2019 Anova Glm

26/67

Interpretation

The degrees of freedom may be used to check your data. The Model degrees of freedom for a 2 2factorial design with interaction are (ab-1), where a is the number of levels of A and b is thenumber of levels of B; in this case, (22-1) = 3. The Corrected Total degrees of freedom are alwaysone less than the number of observations used in the analysis; in this case, 7-1=6.

The overall F test is significant (F=15.29,p=0.0253), indicating strong evidence that the means forthe four different AB cells are different. You can further analyze this difference by examining theindividual tests for each effect.

Four types of estimable functions of parameters are available for testing hypotheses in PROC GLM.For data with no missing cells, the Type III and Type IV estimable functions are the same and testthe same hypotheses that would be tested if the data were balanced. Type I and Type III sums ofsquares are typically not equal when the data are unbalanced; Type III sums of squares arepreferred in testing effects in unbalanced cases because they test a function of the underlyingparameters that is independent of the number of observations per treatment combination.

According to a significance level of 5% , the A*B interaction is not significant (F=0.20, p=0.6850).This indicates that the effect of A does not depend on the level of B and vice versa. Therefore, thetests for the individual effects are valid, showing a significant A effect (F=33.80,p=0.0101) but nosignificant B effect (F=5.00,p=0.1114).

8/10/2019 Anova Glm

27/67

QUESTIONS ? ?

8/10/2019 Anova Glm

28/67

NONPARAMATRIC TEST

8/10/2019 Anova Glm

29/67

Key Concepts

Non-Parametric Tests

Mann - Whitney Test Kruskal - Wallis Test

Friedman Test

McNemar Test

Log - Rank Test

8/10/2019 Anova Glm

30/67

Parametric Vs. Non-Parametric Tests

Parametric

These methods needs distributional

assumption from which samples are drawn.

Require a sufficiently large sample size.

NonParametric

These methods needs no distributional assumption from which

samples are drawn i.e. to say it is DistributionFree Test.

It should be used when the sample size is small.

8/10/2019 Anova Glm

31/67

Mann-Whitney Wilcoxon Test

Introduction

Test for comparing two populations.

Used to test the null hypothesis that two independent samples have identical

distribution functions against the alternative hypothesis that the two

distribution functions differ only with respect to mean or mediani.e. to say

used to make inferences about population mean or median without requiring

the assumption of normality.

Used as an alternative to the two sample t-test when the normalityassumption is not satisfied.

Applied when the observations in a sample are ranks, that is, ordinal data

rather than direct measurements

8/10/2019 Anova Glm

32/67

Assumptions

Two samples are randomly and independently drawn.

Dependent variable is continuous, capable of producing measures carried out to the nth decimal

place.

Measures within the two samples have the properties of at least an ordinal scale of measurement, so

that it is meaningful to speak of "greater than," "less than," and "equal to."

Data can be ranked including tied rank values wherever appropriate. Ranks helps to focus only on the

ordinal relationships among the raw measures"greater than," "less than," and "equal to.

Two population distributions differ only by a small shift in location.

8/10/2019 Anova Glm

33/67

Proc npar1way wilcoxon

In general, PROC NPAR1WAY performs an analysis of variance (option

ANOVA), tests for location differences (options WILCOXON, MEDIAN,

SAVAGE, and VW), and performs empirical distribution function tests (option

EDF). Call is

PROC NPAR1WAY < options > ;

BY variables ;

CLASS variable ;

EXACT < / computation-options > ;

FREQ variable ;

OUTPUT < OUT=SAS-data-set > < WILCOXON > ;

VAR variables ;

RUN;

8/10/2019 Anova Glm

34/67

8/10/2019 Anova Glm

35/67

BY statementdo separate analyses on observations in groups defined by the

BY variables. When a BY statement appears, the procedure expects the input

data set to be sorted in order of the BY variables.

The CLASS variableidentifies groups (or samples) in the data. The variable

can be character or numeric.

The FREQ statementnames a numeric variable that provides a frequency for

each observation in the DATA= data set.

The VAR statementnames the response or dependent variables to be analyzed. These

variables must be numeric. If the VAR statement is omitted, the procedure analyzes all

numeric variables in the data set except for the CLASS variable, the FREQ variable,and the BY variables.

OUT=SAS-data-set names the output data set.

8/10/2019 Anova Glm

36/67

Computation-Options are:

Options Description

ALPHA= value specifies the level of the confidence limits for Monte Carlo p-valueestimates. The value of the ALPHA= option must be between 0 and 1,and the default is 0.01 which produces produces 99% confidence limitsfor the Monte Carlo estimates.

MAXTIME=value specifies the maximum clock time (in seconds) that PROC NPAR1WAYcan use to compute an exact p-value. If the procedure does not completethe computation within the specified time, the computation terminates.

MC requests Monte Carlo estimation of exact p-values, instead of direct exactp-value computation. Monte Carlo estimation can be useful for large

problems that require a great amount of time and memory for exactcomputations

N=n specifies the number of samples for Monte Carlo estimation. The value ofthe N= option must be a positive integer, and the default is 10,000

samples. Larger values of n produce more precise estimates of exact p-values.

POINT requests exact point probabilities for the test statistics.

SEED=number specifies the initial seed for random number generation for Monte Carlo

estimation. The value of the SEED= o tion must be an inte er.

8/10/2019 Anova Glm

37/67

Examples

Global Evaluations of drug A & drug B in back pain: In a treatment it was found that patients

with low back pain experienced a decrease in pain after 6 to 8 weeks of daily treatment. So, a

study was conducted to determine whether this phenomenon is a drug related response or

coincidental. For this patients were asked to provide a global rating of their pain, relative to

baseline, on the following scale

For testing this phenomenon we use Mann-Whitney test.

8/10/2019 Anova Glm

38/67

8/10/2019 Anova Glm

39/67

8/10/2019 Anova Glm

40/67

8/10/2019 Anova Glm

41/67

8/10/2019 Anova Glm

42/67

Kruskal - Wallis Test

Introduction

Analogue of one way ANOVA without the assumption of normality.

Extension of Wilcoxon test for more then two groups.

Used to compare population location parameters among two or more groups based on independent

samples.

Used to test the null hypothesis that all populations have identical distribution functions against thealternative hypothesis that at least two of the samples differ only with respect to location .

Assumptions

Same as Wilcoxon test.

8/10/2019 Anova Glm

43/67

8/10/2019 Anova Glm

44/67

8/10/2019 Anova Glm

45/67

8/10/2019 Anova Glm

46/67

Friedman Test

Introduction

Models the ratings of n judges (rows) on k treatments (column).

Generalization of sign test and spearman rank correlation test as it reduces to

sign test if there are two columns and reduces to spearman rank correlation

test if there are two rows.

Also called two-way analysis on ranks as is used for two=way repeated

measures analysis of variance by ranks.

Used to test null hypothesis that treatment effects have identical effects

against the alternative hypothesis that at least one treatment is different from

at least one other treatment.

8/10/2019 Anova Glm

47/67

Assumptions

There are k experimental treatments. k 2.

n rows are mutually independent. (i.e. results within one row do not affect the results within other

rows)

Data can be meaningfully ranked.

SAS Implementation

Proc freqwith cmh2option in table statement.

8/10/2019 Anova Glm

48/67

Friedman Test

Syntax

PROC FREQ < options > ;

BY variables ;

EXACT statistic-options < / computation-options > ;

OUTPUT < OUT=SAS-data-set > options ;TABLES requests < / options > ;

Run;

Where

BY calculates separate frequency or crosstabulation tables for each BY group. EXACT requests exact tests for specified statistics.

OUTPUT creates an output data set that contains specified statistics.

TABLES specifies frequency or crosstabulation tables and requests tests and measures of

association.

TEST requests asymptotic tests for measures of association and agreement.

WEIGHT identifies a variable with values that weight each observation.

F i d T t

8/10/2019 Anova Glm

49/67

Friedman Test

OptionsAGREE McNemar's test for 2 2 tables, simple kappa coefficient, and weighted kappa

coefficient

BINOMIAL binomial proportion test for one-way tables

CHISQ chi-square goodness-of-fit test for one-way tables; Pearson chi-square, likelihood-

ratio chi-square, and Mantel-Haenszel chi-square tests for two-way tables

COMOR confidence limits for the common odds ratio for h 2 2 tables; common odds ratio

test

FISHER Fisher's exact test

JT Jonckheere-Terpstra test

KAPPA test for the simple kappa coefficient

LRCHI likelihood-ratio chi-square test

MCNEM McNemar's test

MEASURES tests for the Pearson correlation and the Spearman correlation, and the odds ratio

confidence limits for 2 2 tables

MHCHI Mantel-Haenszel chi-square test OR confidence limits for the odds ratio for 2 2

tables

PCHI Pearson chi-square test

PCORR test for the Pearson correlation coefficient

SCORR test for the Spearman correlation coefficient

TREND Cochran-Armitage test for trend

WTKAP test for the weighted kappa coefficient

8/10/2019 Anova Glm

50/67

OptionsAGREE McNemar's test for 2 2 tables, simple kappa coefficient, and weighted kappa

coefficient

BINOMIAL binomial proportion test for one-way tables

CHISQ chi-square goodness-of-fit test for one-way tables; Pearson chi-square, likelihood-

ratio chi-square, and Mantel-Haenszel chi-square tests for two-way tables

COMOR confidence limits for the common odds ratio for h 2 2 tables; common odds ratio

test

FISHER Fisher's exact test

JT Jonckheere-Terpstra test

KAPPA test for the simple kappa coefficient

LRCHI likelihood-ratio chi-square test

MCNEM McNemar's test

MEASURES tests for the Pearson correlation and the Spearman correlation, and the odds ratio

confidence limits for 2 2 tables

MHCHI Mantel-Haenszel chi-square test OR confidence limits for the odds ratio for 2 2

tables

PCHI Pearson chi-square test

PCORR test for the Pearson correlation coefficient

SCORR test for the Spearman correlation coefficient

TREND Cochran-Armitage test for trend

WTKAP test for the weighted kappa coefficient

8/10/2019 Anova Glm

51/67

8/10/2019 Anova Glm

52/67

8/10/2019 Anova Glm

53/67

8/10/2019 Anova Glm

54/67

8/10/2019 Anova Glm

55/67

McNemar Test

Introduction

Determine whether the row and column marginal frequencies are equal or not.

Uses matched pairs labels say, (A,B).

Tests whether pair (A,B) is as likely as (B,A).

Used when dichotomous outcomes are recorded twice for each patient under different conditions

(Eg different treatments or different measurement times).

8/10/2019 Anova Glm

56/67

Assumptions

Data consists of paired observations of labels (A,B).

Applied to 2x2 contingency tables with a dichotomous trait with matched pairs of subjects.

Used only when the conditions for the normal approximation apply.

8/10/2019 Anova Glm

57/67

SAS Implementation

Proc freqwith agreeoption in table statement

Output gives Chi-Square p-value (two-tailed). One tailed can be obtained by

halving it.

Example

Comparing response rates (Eg. normal & abnormal of group of patients where dataare collected for pre and poststudy laboratory results) when patients are treatedunder a particular drug say A. (Here, we need to test whether there is a change in

the pre - to - post - treatment rates of abnormalities.)

Suppose following program has been run where aim is to compare response

rates (yes/no) of case & control.

8/10/2019 Anova Glm

58/67

8/10/2019 Anova Glm

59/67

8/10/2019 Anova Glm

60/67

Log-Rank Test

Introduction

Used for comparing distributions of time until the occurrence of an event (Eg death, cure, failure,relapse etc.) of interest occur among independent groups.

Used to test the null hypothesis that there is no difference between the populations in theprobability of an event at any time point.

Used when Wilcoxon test fails. (i.e. censoring condition is not satisfied)

Most likely to detect a difference between groups when the risk of an event is consistently greaterfor one group than another.

Equivalent to applying CMH at each time point as the strata.

8/10/2019 Anova Glm

61/67

Assumptions

Censoring is unrelated to prognosis.

Survival probabilities are the same for subjects recruited early and late in the study, and the events

happened at the times specified.

Requires no assumption regarding the distribution of event times.

8/10/2019 Anova Glm

62/67

SAS Implementation

Proc lifetest

Output shows Chi-Square p-value.

PROC LIFETEST < options > ;

TIME variable < *censor(list) > ;

BY variables ;

FREQ variable ;

ID variables ;

STRATA variable < (list) > < ... variable < (list) > > ;

SURVIVAL options ;

TEST variables ;

Run;

8/10/2019 Anova Glm

63/67

8/10/2019 Anova Glm

64/67

Time statement used to indicate the failure time variable, where

variable is the name of the failure time variable that can be optionally followed

by an asterisk, the name of the censoring variable, and a parenthetical list of

values that correspond to right censoring. The censoring values should be

numeric, non missing values.

BY statementwith PROC LIFETEST to obtain separate analyses on observations

in groups defined by the BY variables.

The variable in the FREQ statementidentifies a variable containing the frequency

of occurrence of each observation.

The ID variablevalues are used to label the observations of the product-limit

survival function estimates.

8/10/2019 Anova Glm

65/67

The STRATA statementindicates which variables determine strata levels for

the computations. The strata are formed according to the nonmissing values of

the designated strata variables.

Options available with STRATA statement

MISSING used to allow missing values as a valid stratum level.

GROUP=variable specifies the variable whose formatted values identify the various

samples whose underlying survival curves are to be compared.

NODETAIL suppresses the display of the rank statistics and the corresponding

covariance matrices for various strata.

NOTEST suppresses the k-sample tests, stratified tests, and trend tests

TREND computes the trend tests for testing the null hypothesis that thek

population hazards rate are the same versus an ordered alternatives

TEST=(list) enables you to select the weight functions for the k-sample tests,

stratified tests, or trend tests. You can specify a list containing one

or more of the following keywords

8/10/2019 Anova Glm

66/67

8/10/2019 Anova Glm

67/67

Documents

Anova Glm