38
1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

1

Multivariate Analysis and Discrimination

EPP 245

Statistical Analysis of

Laboratory Data

Page 2: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

2

Cystic Fibrosis Data Set

• The 'cystfibr' data frame has 25 rows and 10 columns. It contains lung function data for cystic fibrosis patients (7-23 years old)

• We will examine the relationships among the various measures of lung function

Page 3: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

3

• age a numeric vector. Age in years.• sex a numeric vector code. 0: male, 1:female.• height a numeric vector. Height (cm).• weight a numeric vector. Weight (kg).• bmp a numeric vector. Body mass (% of

normal).• fev1 a numeric vector. Forced expiratory

volume.• rv a numeric vector. Residual volume.• frc a numeric vector. Functional residual

capacity.• tlc a numeric vector. Total lung capacity.• pemax a numeric vector. Maximum

expiratory pressure.

Page 4: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

4

Scatterplot matrices

• We have five variables and may wish to study the relationships among them

• We could separately plot the (5)(4)/2 = 10 pairwise scatterplots

• The graph matrix function in Stata

.graph matrix fev1 rv frc tlc pemax

Page 5: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

5

fev1

rv

frc

tlc

pemax

20

40

60

20 40 60

0

200

400

0 200 400

100

200

300

100 200 300

80

100

120

140

80 100 120 140

50

100

150

200

50 100 150 200

Page 6: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

6

Principal Components Analysis

• The idea of PCA is to create new variables that are combinations of the original ones.

• If x1, x2, …, xp are the original variables, then a component is a1x1 + a2x2 +…+ apxp

• We pick the first PC as the linear combination that has the largest variance

• The second PC is that linear combination orthogonal to the first one that has the largest variance, and so on

Page 7: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

7

Page 8: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

8

. pca fev1 rv frc tlc pemax

Principal components/correlation Number of obs = 25 Number of comp. = 5 Trace = 5 Rotation: (unrotated = principal) Rho = 1.0000

-------------------------------------------------------------------------- Component | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Comp1 | 3.22412 2.33772 0.6448 0.6448 Comp2 | .886399 .40756 0.1773 0.8221 Comp3 | .478839 .133874 0.0958 0.9179 Comp4 | .344966 .279286 0.0690 0.9869 Comp5 | .0656798 . 0.0131 1.0000 --------------------------------------------------------------------------

Principal components (eigenvectors)

------------------------------------------------------------------------------ Variable | Comp1 Comp2 Comp3 Comp4 Comp5 | Unexplained -------------+--------------------------------------------------+------------- fev1 | -0.4525 0.2140 0.5539 0.6641 -0.0397 | 0 rv | 0.5043 0.1736 -0.2977 0.4993 -0.6145 | 0 frc | 0.5291 0.1324 0.0073 0.3571 0.7582 | 0 tlc | 0.4156 0.4525 0.6474 -0.4134 -0.1806 | 0 pemax | -0.2970 0.8377 -0.4306 -0.1063 0.1152 | 0 ------------------------------------------------------------------------------

Page 9: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

9

01

23

Eig

enva

lues

1 2 3 4 5Number

Scree plot of eigenvalues after pca

Page 10: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

10

-2-1

01

2S

core

s fo

r co

mpo

nent

2

-4 -2 0 2 4Scores for component 1

Score variables (pca)

Page 11: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

11

Fisher’s Iris Data

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are _Iris setosa_, _versicolor_, and _virginica_.

Page 12: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

12

. summarize

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- index | 150 75.5 43.44537 1 150 sepallength | 150 5.843333 .8280661 4.3 7.9 sepalwidth | 150 3.057333 .4358663 2 4.4 petallength | 150 3.758 1.765298 1 6.9 petalwidth | 150 1.199333 .7622377 .1 2.5-------------+-------------------------------------------------------- species | 0

. graph matrix sepallength sepalwidth petallength petalwidth

Page 13: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

13

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

4

6

8

4 6 8

2

3

4

2 3 4

0

5

10

0 5 10

0

1

2

3

0 1 2 3

Page 14: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

14

. pca sepallength sepalwidth petallength petalwidth

Principal components/correlation Number of obs = 150 Number of comp. = 4 Trace = 4 Rotation: (unrotated = principal) Rho = 1.0000

-------------------------------------------------------------------------- Component | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Comp1 | 2.9185 2.00447 0.7296 0.7296 Comp2 | .91403 .767274 0.2285 0.9581 Comp3 | .146757 .126042 0.0367 0.9948 Comp4 | .0207148 . 0.0052 1.0000 --------------------------------------------------------------------------

Principal components (eigenvectors)

-------------------------------------------------------------------- Variable | Comp1 Comp2 Comp3 Comp4 | Unexplained -------------+----------------------------------------+------------- sepallength | 0.5211 0.3774 -0.7196 -0.2613 | 0 sepalwidth | -0.2693 0.9233 0.2444 0.1235 | 0 petallength | 0.5804 0.0245 0.1421 0.8014 | 0 petalwidth | 0.5649 0.0669 0.6343 -0.5236 | 0 --------------------------------------------------------------------

Page 15: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

15

-4-2

02

4S

core

s fo

r co

mpo

nent

2

-4 -2 0 2 4Scores for component 1

Score variables (pca)

Page 16: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

16

Multinomial Logit

• Logistic regression can only deal with binary categories

• One alternative that can deal with more than two categories is discriminant analysis. This comes in two flavors, (Fisher’s) Linear Discriminant Analysis or LDA and (Fisher’s) Quadratic Discriminant Analysis or QDA

• In each case we model the shape of the groups and provide a dividing line/curve

Page 17: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

17

• Unfortunately, Stata does not provide an LDA or QDA program

• An alternative is multinomial logit, in which logistic regression is used but with more than two categories.

Page 18: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

18

. use "C:\TD\CLASS\K30-2006\iris1.dta"

. encode species, generate(nspecies)

. mlogit nspecies sepallength sepalwidth petallength petalwidth

Iteration 0: log likelihood = -164.79184Iteration 1: log likelihood = -67.780459Iteration 2: log likelihood = -45.612546Iteration 3: log likelihood = -23.256068Iteration 4: log likelihood = -12.950564Iteration 5: log likelihood = -8.8076897Iteration 6: log likelihood = -7.0436566Iteration 7: log likelihood = -6.2945111Iteration 8: log likelihood = -6.0232697…………Iteration 22: log likelihood = -5.9492736Iteration 23: log likelihood = -5.9492736

Page 19: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

19

Multinomial logistic regression Number of obs = 150 LR chi2(8) = 317.69 Prob > chi2 = 0.0000Log likelihood = -5.9492736 Pseudo R2 = 0.9639

------------------------------------------------------------------------------ nspecies | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------versicolor | sepallength | -8.596623 975.8807 -0.01 0.993 -1921.288 1904.094 sepalwidth | -6.182755 . . . . . petallength | 16.82211 . . . . . petalwidth | 17.8852 9.742607 1.84 0.066 -1.209956 36.98036 _cons | 7.261185 25.70765 0.28 0.778 -43.12489 57.64726-------------+----------------------------------------------------------------virginica | sepallength | -11.06184 975.8824 -0.01 0.991 -1923.756 1901.632 sepalwidth | -12.86364 4.479563 -2.87 0.004 -21.64342 -4.083859 petallength | 26.2515 4.737208 5.54 0.000 16.96674 35.53626 petalwidth | 36.17133 . . . . . _cons | -35.37661 . . . . .------------------------------------------------------------------------------(nspecies==setosa is the base outcome)

.

Page 20: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

20

. test sepallength

( 1) [versicolor]sepallength = 0 ( 2) [virginica]sepallength = 0

chi2( 2) = 1.06 Prob > chi2 = 0.5885

. mlogit nspecies sepalwidth petallength petalwidthMultinomial logistic regression Number of obs = 150 LR chi2(6) = 316.32 Prob > chi2 = 0.0000Log likelihood = -6.6329197 Pseudo R2 = 0.9597

------------------------------------------------------------------------------ nspecies | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------versicolor | sepalwidth | -11.07503 2941.788 -0.00 0.997 -5776.873 5754.723 petallength | 12.326 3.840663 3.21 0.001 4.798441 19.85356 petalwidth | 23.4702 . . . . . _cons | -16.4182 23.99477 -0.68 0.494 -63.44709 30.61069-------------+----------------------------------------------------------------virginica | sepalwidth | -19.4511 2941.798 -0.01 0.995 -5785.269 5746.367 petallength | 20.20055 . . . . . petalwidth | 44.8998 10.70735 4.19 0.000 23.91378 65.88582 _cons | -66.94504 . . . . .------------------------------------------------------------------------------(nspecies==setosa is the base outcome)

Page 21: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

21

. test sepalwidth

( 1) [versicolor]sepalwidth = 0 ( 2) [virginica]sepalwidth = 0

chi2( 2) = 3.09 Prob > chi2 = 0.2128

. mlogit nspecies petallength petalwidth

Multinomial logistic regression Number of obs = 150 LR chi2(4) = 309.02 Prob > chi2 = 0.0000Log likelihood = -10.281754 Pseudo R2 = 0.9376

------------------------------------------------------------------------------ nspecies | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------versicolor | petallength | 18.61007 2659.161 0.01 0.994 -5193.25 5230.47 petalwidth | 23.61704 . . . . . _cons | -63.27827 13.61167 -4.65 0.000 -89.95664 -36.5999-------------+----------------------------------------------------------------virginica | petallength | 24.3646 2659.158 0.01 0.993 -5187.489 5236.218 petalwidth | 34.06374 3.755651 9.07 0.000 26.7028 41.42468 _cons | -108.5506 . . . . .------------------------------------------------------------------------------(nspecies==setosa is the base outcome)

Page 22: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

22

. test petallength

( 1) [versicolor]petallength = 0 ( 2) [virginica]petallength = 0

chi2( 2) = 6.23 Prob > chi2 = 0.0444

. test petalwidth

( 1) [versicolor]petalwidth = 0 ( 2) [virginica]petalwidth = 0 Constraint 1 dropped

chi2( 1) = 82.26 Prob > chi2 = 0.0000

. twoway (scatter petalwidth petallength if nspecies==1, mcolor(red)) (scatter petalwidth petallength if nspecies==2, mcolor(blue)) (scatter petalwidth petallength if nspecies==3, mcolor(black)), legend(off)

. graph export petals.wmf

Page 23: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

23

0.5

11.

52

2.5

Pet

al.W

idth

0 2 4 6 8Petal.Length

Page 24: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

24

Lymphoma Data Set

• Alizadeh et al. Nature (2000) “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling”

• We will analyze a subset of the data consisting of 61 arrays on patients with– 45 Diffuse large B-cell lymphoma (DLBCL/DL)– 10 Chronic lymphocytic leukaemia (CLL/CL)– 6 Follicular leukaemia (FL)

Page 25: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

25

Data Available

• The original Nature paper

• The expression data in the form of unbackground corrected log ratios. A common reference was always on Cy3 with the sample on Cy5 (array.data.txt and array.data.zip). 9216 by 61

• A file with array codes and disease status for each of the 61 arrays, ArrayID.txt

Page 26: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

26

Identify Differentially Expressed Genes

• We will assume that the log ratios are on a reasonable enough scale that we can use them as is

• For each gene, we can run a one-way ANOVA and find the p-value, obtaining 9,216 of them.

• Account for multiplicity by using a small p-value threshold

• Identify genes with small adjusted p-values

Page 27: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

27

Develop Classifier

• Reduce dimension with ANOVA gene selection or with PCA

• Use logistic regression of LDA

• Evaluate the four possibilities and their sub-possibilities with cross validation. With 61 arrays one could reasonable omit 10% or 6 at random

• This is difficult with Array Assist or Stata

Page 28: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

28

Differential Expression

• We can locate genes that are differentially expressed; that is, genes whose expression differs systematically by the type of lymphoma.

• To do this, we use lymphoma type to predict expression, and see when this is statistically significant.

• This can be done in Array Assist

Page 29: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

29

• It is almost equivalent to locate genes whose expression can be used to predict lymphoma type, this being the reverse process.

• If there is significant association in one direction there should logically be significant association in the other

• This will not be true exactly, but is true approximately

• We can also easily do the latter analysis using the expression of more than one gene using logistic regresssion, LDA, and QDA

Page 30: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

30

Page 31: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

31

Significant Genes

• There are 3845 out of 9261 genes that have significant p-values from the ANOVA of less than 0.05, compared to 463 expected by chance

• There are 2733 genes with FDR adjusted p-values less than 0.05

• There are only 184 genes with Bonferroni adjusted p-values less than 0.05

Page 32: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

32

Page 33: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

33

Logistic Regression

• We will use logistic regression to distinguish DLBCL from CLL and DLBCL from FL

• We will do this first by choosing the variables with the smallest overall p-values in the ANOVA

• We will then evaluate the results within sample and by cross validation

Page 34: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

34

Page 35: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

35

Number of Variables

DL/CL Errors

DL/FL Errors

1 7 4

2 7 4

3 5 5

4 0 3

5 0 2

6 0 0

Page 36: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

36

Evaluation of performance

• Within sample evaluation of performance like this is unreliable

• This is especially true if we are selecting predictors from a very large set

• One useful yardstick is the performance of random classifiers

Page 37: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

37

Number of Variables

Percent Errors

1 24.0%

2 24.7%

3 23.3%

4 24.7%

5 28.7%

6 24.7%

7 26.3%

8 23.6%

9 25.7%

10 24.3%

Left is CVperformance ofbest k variables

Random =25.4%

Page 38: 1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data

November 30, 2006 EPP 245 Statistical Analysis of Laboratory Data

38

Conclusion

• Logistic regression on the variables with the smallest p-values does not work very well

• This cannot be identified by looking at the within sample statistics

• Cross validation is a requirement to assess performance of classifiers