17
More On Preprocessing Javier Cabrera

More On Preprocessing Javier Cabrera

  • Upload
    clodia

  • View
    42

  • Download
    4

Embed Size (px)

DESCRIPTION

More On Preprocessing Javier Cabrera. Outline. Transform the data into a scale suitable for analysis. Remove the effects of systematic and obfuscating sources of variation. Identify discrepant observations. Outline. Preprocessing => Quality of downstream analyses - PowerPoint PPT Presentation

Citation preview

Page 1: More On Preprocessing Javier Cabrera

More On Preprocessing

Javier Cabrera

Page 2: More On Preprocessing Javier Cabrera

Outline

1. Transform the data into a scale suitable for

analysis.

2. Remove the effects of systematic and

obfuscating sources of variation.

3. Identify discrepant observations.

Page 3: More On Preprocessing Javier Cabrera

OutlinePreprocessing => Quality of downstream analyses

• log transformation, X log(X)The variation of logged intensities may be less dependent on magnitude, Logs reduces the skewness of highly skewed distributions. Taking logs improves variance estimation.

2. Other TransformationsPower transformations (X X for some =1/2, 1/3 or other)

Amaratunga and Cabrera (2000),  Tusher et al (2001) 3. Variance stabilizing transformations X log(X+c) : Symmetrizing the spot intensity data

and stabilizing their variances.

Page 4: More On Preprocessing Javier Cabrera

Transformations4. Rocke and Durbin (2001) arrays with replicate spots. Analogy: models used for estimating concentration of

analyte: X = + e + mean background, true expression level; and

normally distributed error (2 ,

2) 5. Durbin et al (2002) generalized log transformation:

- , 2 and

2 must be estimated.

2 2 2log(( ) ( ) ( / ))X X X S

Page 5: More On Preprocessing Javier Cabrera

Power Transformations

must be estimated.- Three criteria:- Equal variances: CV ( gene variances) Low skewness: mean( skewness) No Mean Variance correlation: correlation between

mean and variance

( )X X

Histogram of sqrt(xs)

sqrt(xs)

Frequency

0 5 10 15 20 25 30

0200

400

600

800

100012001400

Page 6: More On Preprocessing Javier Cabrera

Example 1: Tissue Data

Tissue data: 3 treatments applied to mice tissue. (A,B,C)Arrays: Treatment A: 11 Treatment B: 11 Treatment C: 19Genes: 3487 genes. Gene expression matrix X: Dim(X)=100x41 treatA.1 treatA.2 treatA.3 treatA.4 treatA.5 treatA.6 treatA.7 treatA.8 treatA.9 treatA.10 treatA.11 treatB.12 treatB.131 3.706 3.900 3.877 3.769 3.654 3.805 3.661 3.878 4.213 3.989 3.877 3.797 3.7432 3.762 4.034 4.402 3.912 3.889 3.988 4.280 3.901 4.385 3.835 4.051 4.583 4.9733 4.140 4.114 4.182 4.200 4.117 4.029 4.200 4.137 4.344 4.122 3.989 4.273 4.3684 3.555 3.555 3.555 3.555 3.555 3.555 3.555 3.621 4.181 3.555 3.555 3.555 3.5715 4.228 4.152 3.828 4.216 3.889 3.923 3.912 4.102 4.273 3.858 4.031 4.144 3.9766 6.622 6.749 6.625 6.883 6.865 6.335 6.241 6.201 5.895 6.548 6.577 6.298 6.5467 7.322 7.437 7.523 7.267 7.586 7.562 7.238 7.294 6.812 7.557 7.370 7.497 6.8348 3.555 3.555 3.555 3.555 3.555 3.555 3.555 3.591 4.165 3.555 3.555 3.555 3.5719 4.756 4.605 4.935 4.295 4.510 4.571 4.396 4.804 4.639 5.239 4.402 4.502 4.24810 4.468 4.306 4.483 4.396 4.432 4.008 4.475 4.357 4.344 4.208 4.147 4.227 4.436>. . . . . . . . . . . . . . . . . . . .

Page 7: More On Preprocessing Javier Cabrera

Power Trans (X-3.60 )-0.4

Quantile Normalized

Raw DataEqual 75pctl

Log Transformed

Page 8: More On Preprocessing Javier Cabrera

Gene selection for classification- Left panel: PC2 vs PC1 plot log transformation- Right panel: PC2 vs PC1 plot power transformation

-0.2 0.0 0.2 0.4 0.6

-0.2

-0.1

0.0

0.1

Comp.1

Comp.2

-0.5 0.0 0.5 1.0

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Comp.1

Comp.2

Page 9: More On Preprocessing Javier Cabrera

Example 2: Khan et al (2001):

4 types of small round blue cell tumors (SRBC) - Neuroblastoma (NB) - Rhabdomyosarcoma (RMS) - Ewing family of tumors (EWS) - Burkitt lymphomas (BL)

Training set= 63 (23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 (6 EWS, 5 RMS, 6 NB, 3 BL, 5 ot)

Genes: Of 6567 initial genes, 2308 genes were selected because they showed minimal expression

Subset A: Cells: 23 EWS and 20 RMS from training set. 100 most significant genes after performing a t-test. Gene expression matrix X: Dim(X)=100x43 EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9 EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19 EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C91 3.203 1.655 3.278 1.006 2.710 2.059 1.848 2.714 2.356 1.929 3.616 2.151 2.312 1.069 0.919 0.925 2.626 1.079 1.0992 0.068 0.071 0.116 0.191 0.237 0.082 0.123 0.180 0.079 0.252 0.106 0.097 0.160 0.197 0.192 0.089 0.092 0.178 0.1663 1.046 1.041 0.893 0.430 0.369 0.902 0.998 0.496 0.761 0.574 0.583 0.499 0.579 1.681 0.786 1.511 1.869 2.346 2.019. . . . . . . . . . . . . . . . . . . .

Page 10: More On Preprocessing Javier Cabrera

Power Trans -(X-0.66 )-0.04

Quantile Normalized

Raw DataEqual 75pctl

Log Transformed

X44 X48 X52 X56 X60

05

1015

2025

30

X44 X48 X52 X56 X60

01

23

45

67

X44 X48 X52 X56 X60

-3-2

-10

12

X44 X48 X52 X56 X60

-1.10

-1.05

-1.00

-0.95

Page 11: More On Preprocessing Javier Cabrera

Judging the success of a normalization

{Yg1} and {Yg2}.

Successful workflow =>Arrays are monotonically related to each other.

Pearson’s correlation coefficient: measures linearity rather than agreement.

Concordance correlation coefficient :

12

1 2

ˆs

s s

1

G

gcg

c

Y

YG

2

12

( )G

gc cg

c

Y Y

sG

1 1 2 2

112

( )G

g gg

Y Y Y Y

sG

12

22 21 2 1 2

c

s

s s Y Y

Page 12: More On Preprocessing Javier Cabrera

Judging the success of a normalization

{Yg1} and {Yg2}.

Successful workflow =>Arrays are monotonically related to each other.

- Spearman’s rank correlation coefficient:

Rgi is the rank of Ygi when the {Ygi} are ranked from 1 to G.

1 21

2

1 112 { ( 1)}{ ( 1}

2 2ˆ

( 1)

G

g gg

S

R G R G

G G

Page 13: More On Preprocessing Javier Cabrera

Concordance Map

Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50X44 1.000 0.703 0.622 0.706 0.674 0.746 0.694X45 0.703 1.000 0.702 0.679 0.784 0.710 0.788X46 0.622 0.702 1.000 0.791 0.683 0.562 0.776X47 0.706 0.679 0.791 1.000 0.691 0.607 0.760X48 0.674 0.784 0.683 0.691 1.000 0.770 0.832X49 0.746 0.710 0.562 0.607 0.770 1.000 0.727X50 0.694 0.788 0.776 0.760 0.832 0.727 1.000

X44 X46 X48 X50

X50

X49

X48

X47

X46

X45

X44

0.6 0.7 0.8 0.9 1.0

Page 14: More On Preprocessing Javier Cabrera

Concordance Map

Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50X44 1.000 0.756 0.622 0.700 0.695 0.813 0.698X45 0.756 1.000 0.813 0.722 0.793 0.710 0.803X46 0.622 0.813 1.000 0.789 0.753 0.655 0.826X47 0.700 0.722 0.789 1.000 0.714 0.663 0.763X48 0.695 0.793 0.753 0.714 1.000 0.779 0.834X49 0.813 0.710 0.655 0.663 0.779 1.000 0.742X50 0.698 0.803 0.826 0.763 0.834 0.742 1.000

X44 X46 X48 X50

X50

X49

X48

X47

X46

X45

X44

0.65 0.75 0.85 0.95

Page 15: More On Preprocessing Javier Cabrera

Linear correlation

Standard Normal

t dist, df=6

t dist, df=2

1 2Y Y2 21 2s s

Page 16: More On Preprocessing Javier Cabrera

correlation

1 2Y Y 2 21 2s s

1.  If the distributional properties of the values change substantially during a normalization (e.g., the skewness is decreased), it is possible that the concordance correlation coefficients might increase, but this may only be an artificial improvement.

2. For microarrays that have been normalized by equating all the quantiles, the concordance correlation coefficient will be equal to Pearson’s correlation coefficient. This is because, after such a normalization, the quantiles of both samples are identical and, therefore, both means are equal and both variances are equal too

3. Spearman’s rank correlation coefficient is equal to (a) Pearson’s correlation coefficient calculated on the ranks of the data (b) the concordance correlation coefficient calculated on the ranks of the data.

Page 17: More On Preprocessing Javier Cabrera