Bioinformatik Ringvorlesung SS2005 - UMG · Abt. Medizinische Statistik - Biostatistik Wednesday. 20. January 2010. 8:30. Bioinformatik/System Biologie ... Tim Beißbarth Bioinformatik

UMG Georg-August-Universität GöttingenAbt. Medizinische Statistik - Biostatistik

Wednesday20. January 20108:30

Bioinformatik/System Biologie Wintersemester 2010

Prof. Dr. Tim Beißbarth

Tim Beißbarth Bioinformatik

Sources of Variation: Bias und Variance

“biased” “unbiased”

Little noise

Much noise

x


Raw data are not mRNA concentrations

• tissue contamination

• RNA degradation

• amplification efficiency

• reverse transcription efficiency

• Hybridization efficiency and specificity

• clone identification and mapping

• PCR yield, contamination

• spotting efficiency

• DNA support binding

• other array manufacturing related issues

• image segmentation

• signal quantification

• “background” correction


Sources of Variation for Microarray-Data

Normalization Error model

Systematic Stochastic

• amount of RNA in biopsy• efficiency of:

• RNA extraction• reverse transcription• labeling• photodetection

• PCR yield• DNA quality• spotting efficiency, spot size• cross-/unspecific hybridization• stray signal

• similar effect on many measurements

• corrections can be estimated from data

• Effects, on single spots• random effects cannot be

estimated, „noise“


Array 2Cy3 Cy5Array 1

Cy3 Cy5

median

Q3=75% Quantil

Q1=25% Quantil

Minimum

Maximum

Displaying Variability of Microarray-Data


Normalization via rescaling

boxplots

“Location and scale of different MA measurements should be (approximately) the same.“

• Location and scale are basic statistical concepts for data description:

Location normalization: corrects for spatial or dye bias

Scale normalization: homogenizes the variability across arrays

Normalized log-intensity ratios are given by

Mnorm = (M-location) / scale


Normalizing the Hybridization-Intensities

• Background Correction• Local Background Image Analysis• Global Background e.g. 5% Quantile

• Robust estimation of a “rescaling” Factor, e.g. Median of Differences

based on

• the majority of genes• Housekeeping genes• Spiked

in control genes

• There are many other normalization methods! Other methods:

• Lowess (aka loess)• Quantile• VSN


Median-centering

Log

Sig

nal,

cent

ered

at 0

• One of the simplest strategies is to bring all „Centers“ of the array data to the same level.

• Assumption, the majority of genes or the center should not change between conditions.

• the Median is used as a robust measure.

divide all expression measurements of each array by the Median.


Problems with Median-Centering

Log Green

Log

Red

Scatterplot

A = (Log Green + Log Red) / 2

M =

Log

Red

-Lo

g G

reen

M-A Plot

Median Centering is a global Method: there are no intensity dependent effects


A = (Log Green + Log Red) / 2

M =

Log

Red

-Lo

g G

reen

Lo(w)ess Normalization

Assumption: There is an intensity- dependent bias of the fold change,

and hence where δj is the “true“ log fold change for gene j. The true fold change distribution is approximately a zero-symmetric normal distribution.

Task: Find f, replace yj

by yj

- f(xj

).

)()(ˆ xfxf x

f

)(AfM

jjj xfy )(

The idea of local regression is that f

can be estimated locally at a point x

by a simple (and easy-to-fit) function fx

. For each point x, we then estimate f

by


Quantile Normalization

The basic idea of Quantile-Normalization is very simple:

„The Histograms of all Slides are made identical“

Tightens the idea of Median-Centering. Not only the 50%-Quantile is adjusted, but all

Quantiles.

Boxplot and QQ-plot after Quantile normalization


Sample A Sample B Sample C

Gen 1 100 200 140

Gen 2 10 40 270

Gen 3 100 120 70

Quantile-normalisation



Gen 1 100 200 140

Gen 2 10 40 270

Gen 3 100 120 70




100 Gen 1

200 Gen 1

140 Gen 1

10 Gen 2

40 Gen 2

270 Gen 2

100 Gen 3

120 Gen3

70 Gen3




10 Gen 2

40 Gen 2

70 Gen 3

100 Gen 1

120 Gen 3

140 Gen 1

100 Gen 3

200 Gen 1

270 Gen 2

Mittelwert

(10+40+70) / 3 =

40

(100+120+140) / 3 =

120

(100+200+270) / 3 =

190




40 Gen 2

40 Gen 2

40 Gen 3

120 Gen 1

120 Gen 3

120 Gen 1

190 Gen 3

190 Gen 1

190 Gen 2

Mittelwert

(10+40+70) / 3 =

40

(100+120+140) / 3 =

120

(100+200+270) / 3 =

190




Gen 1 120 (100)

190 (200)

120 (140)

Gen 2 40 (10)

40 (40)

190 (270)

Gen 3 190 (100)

120 (120)

40 (70)



−4−2

02

4

0.00.10.20.30.4

QQ-plot

Distribution 1

Distribution 2

Quantile Normalization


VSN: model and theory

• Huber et al. (2002) Bioinformatics, 18:S96–S104

• Model for measured probe intensity Rocke DM, Durbin B (2001) Journal of Computational Biology, 8:557–569

• log-transformation is replaced by a transformation (arcsinh) based on theoretical grounds.

• Estimation of transformation parameters (location, scale) based on ML paradigm and numerically solved by a least trimmed sum of squares regression.

• vsn–normalized data behaves close to the normal distribution


0 20000 40000 60000

8.0

8.5

9.0

9.5

10.0

11.0

raw scale

trans

form

ed s

cale

Tran

sfor

med

Sca

le –

f(x)

Original Scale - x

variance stabilizing transformations


variance stabilizing transformations

1( )v ( )

x

f x d uu

1.) constant variance (‘additive’) 2( ) sv u f u

2.) constant CV (‘multiplicative’) 2( ) logv u u f u

4.) additive and multiplicative

2 2 00( ) ( ) arsinh u uv u u u s f

s

3.) offset 20 0( ) ( ) log( )v u u u f u u


The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

B. Durbin, D. Rocke, JCB 2001


Fitting of an error model

iik ika aai

per-sample offset

ik

~ N(0, bi2s1

2)“additive noise”

bi

per-samplenormalization factor

bk

sequence-wiseprobe efficiency

ik ~ N(0,s22)

“multiplicative noise”

exp( )iik k ikb b b

ik ik ik ky a b x measured intensity = offset + gain

true abundance


2Yarsinh , (0, )ikik ki ki

i

a N cb

o maximum likelihood estimator: straightforward –

but sensitive to deviations from normality

o model holds for genes that are unchanged; differentially transcribed genes act as outliers.

o robust

variant of ML estimator, à la Least Trimmed Sum of Squares

regression.

o

works as long as <50% of genes are differentially transcribed

Parameter-Estimation


Least trimmed sum of squares regression

0 2 4 6 8

02

46

8

x

y

2n/2

( ) ( )i=1

( )i iy f x

minimize

-

least sum of squares-

least trimmed sum of squares

P. Rousseeuw, 1980s


The „glog“-Transformation

intensity-200 0 200 400 600 800 1000

- - - f(x) = log(x)

——— hσ

(x) = asinh(x/σ)

)2log( ))log((asinh(x) lim

)1log(xarsinh(x) 2

x

x

x


The „glog“-Transformation

Additive componentVariance:

multiplicative component P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002


evaluation: effects of different data transformationsevaluation: effects of different data transformationsdiff

eren

ce r

ed-g

reen

rank(average)


Literatur, Links

• Bioconductor vignette for vsn. W.Huber http://www.maths.lth.se/help/R/.R/library/vsn/doc/vsn.pdf• A comparison of normalization methods for high density oligonucleotide array data based on variance and

bias. Bolstad, B., et al. Bioinformatics

19 2:185-193 (2003)

• A model for measurement error for gene expression analysis. D. Rocke, B. Durbin. Journal of Computational Biology, 8:557-569, 2001

• Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104 (ISMB 2002).

• Parameter estimation for the calibration and variance stabilization of microarray data. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Statistical Applications in Genetics and Molecular Biology 2003 Vol. 2: No. 1, Article 3

• Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. to appear in: Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2004).

• Interpretability and Data Transformations for Gene Expression Microarray Data. D. M. Rocke, W. Huber, B. Durbin, A. von Heydebreck, M. Vingron. submitted (2004).

• Statistical methods for identifying differentially expressed genes in microarray experiments. S. Dudoit, Y.H. Yang, T.P. Speed, and M.J. Callow. Statistica Sinica, 12:111-139, 2002.

http://www.maths.lth.se/help/R/.R/library/vsn/doc/vsn.pdf


Finding differential Genes

Treasur eor

Junk?


Cell type 1Cell type 2

points) ( mean

Question / HypothesisGene Variation can be

explained by random noise, or not?

DataExpression of Gene g

in several Samples

points) ( mean

1

2

DecisionReject Hypothesis if

0dd

d

Test StatisticsSubtract Mean of the

samples21 d

Short Introduction: Testing, Teststatistics


Overview

• Finding differential genes• T-test• WiIkoxon Test (signed Rank Test, Man-Whitney test)• Permutation tests (SAM)• Modified T-test• Empirical Bayes• Multiple Testing

• Komplex Experiment-Design• Linear Models


Cell type 1Cell type 2

1

2

d

2

d

1

T-Statistics


Differentielle Gene finden

• Wir haben in n

Experimenten jeweils die differentielle Genexpression zwischen zwei Bedingungen gemessen, z.B. WT/KO.

• Messwerte M-values für Gen g: x1

, …

, xn

• Hypothese H0 - die Gene sind nicht differentiell:

• Annahme – die Meßwerte sind Normalverteilt N(0

, 0

).

( ) 0E x

2221( )

2

x

f x e

- 3 - 2 - 1 0 1 2 3



• Bester Schätzer für : (Mittelwert)

• Bester Schätzer für : (Standardabweichung)

• Erwartete Abweichung von von :

Standard error of the mean

1

1 n

ii

x xn

2

1

( )

1

n

ii

x xs

n

x

SEMn

- 3 - 2 - 1 0 1 2 3



• Wir haben in n

Experimenten jeweils die differentielle Genexpression zwischen zwei Bedingungen gemessen, z.B. WT/KO.

• Messwerte M-values für Gen g: x1

, …

, xn

• Hypothese H0 - die Gene sind nicht differentiell:

• Annahme – die Meßwerte sind Normalverteilt N(0

, 0

).

( ) 0E x

2221( )

2

x

f x e

- 3 - 2 - 1 0 1 2 3


T-test

• Vergleiche beobachtete Abweichung von

mit erwarteter:

• t-Verteilung mit n-1 Freiheitsgraden.

• Berechne p-value.

• Wenn T(x)

außerhalb des Akzeptanzbereichs lehne H0 ab.

0( ) xT xSEM


Beispiel 2 – T-test 2

• Affymetrix Daten von Golub et al, 1998• 38 Tumor Gewebe:

• 27 acute lymphoblastic leukemia (ALL)• 11 acute myeloid leukemia (AML)

• 6817 Gene, 3051 nach filtern• Expressionswerte von Gen g in ALL x1

, …

, xn

und in AML y1

, …

, yn

.• Berechne gemeinsame Varianz:

• 2 Sample T-test:

t-Verteilung mit nx

+ny

-2 Freiheitsgraden.

1 1( , )

x yn n

x yT x ys

2 2 2

1 1

1 ( ) ( )2

yx nn

i ii ix y

s x x y yn n


T-test: Variante von Welch

• Erlaube verschiedene Varianzen in den beiden verschiedenen Stichproben.

• Wir nehmen an x ~ N(1

, 1

)

und y ~ N(2

, 2

).

• Testen Hypothese H0 : 1

= 2

.

22( , )

yx ssn m

x yT x y


Beispiel


Mehr Tests: Wilcoxon-Test (auch Man-Whitney Test)

• Nicht-parametrischer Test (keine Normalverteilungsannahme) zum Vergleichen von zwei empirischen Verteilungen.

• Berechne die Ränge der Werte aus beiden Messreihen:

• Die Teststatistik wird aus der Summe der Ränge berechnet: R1

=6• Für kleine Stichprobengrößen kann die Verteilung der Teststatistik exakt

berechnet werden (i.e. alle Möglichkeiten), für große Stichproben kann eine Approximation durch Normalverteilung benutzt werden.

• Vorteile: Nicht-parametrisch, robust gegen Ausreißer.• Nachteile: weniger Mächtig da keine Verteilungsannahmen.


Permutationstests

• Vorteile: kein statistisches Modell notwendig.• Nachteile: evtl. sehr rechenintensiv.


Problem beim T-test

• Problem:

• Es gibt sehr viele Gene (Tests) meistens aber nur sehr wenige Wiederholungen.

• Als Schätzer für 2

habe ich s2

benutzt evtl. zufällige Fehler bei s2.

• Beispiel: Gen g

wird mit den M-Werten 0.0011, 0.0012 und –0.0009 gemessen T~12, p~0.007.

• Merke: Bei sehr kleiner gemessener Varianz wird der Wert für T sehr groß.

• Fazit: Bei Microarray Experimenten nach Möglichkeit keinen Standard-T- test verwenden.

• Folgerung: Modifizierte T-Statistik nötig.


moderated T-statistics

• Beim T-test schätzen wir die Varianz für jedes Gen sg2

einzeln. Dies ist evtl. unstabil.

• Stattdessen versuchen wir nun die Varianz über alle Gene s02

oder Subgruppen von Genen zu schätzen und damit die Varianzschätzung zu korrigieren.

• Dieser „Fudge Factor“ kann noch durch Faktoren (, ) unterschiedlich gewichtet werden.

• Referenz: Efron/Tibshirani, Genet. Epidemiol., 2000• Software: R/Bioconductor Pakete – limma, siggenes

2 20

( , ) x yT x ys s


Empirical Bayes

• Methode um über viele Gene moderierte Statistiken zu Berechnen unter Annahme von Prior-Distributions.

• Verschiedene Varianten existieren. Siehe Efron et al 2001, Lönsted/Speed 2002, ...

• Beispiel:

Für große n: t

= M.

/ s

B const log

2an

s 2 M2

2an

s2 M2

1 nc


Multiples Testen

• Problem: Tausende von Hypothesen werden gleichzeitig getestet.

• Beispiel: Bei 10000 Genen auf dem Chip und einem Cutoff des p-Wertes von 0.01 erwarte ich, daß 10000×0.01=100 Gene einen signifikanten p-Wert p<0.01 haben.

• Resultat: Ein einzelner p-Wert von 0.01 indiziert nicht mehr unbedingt ein signifikantes Gen. Es gibt eine erhöhte Chance falsch-positive Gene zu finden.

• Lösung: Man muß die p-Werte für multiples Testen korrigieren.

• Methode: Einfachstes Verfahren von Bonferroni Multipliziere alle p-Werte mit der Anzahl der Tests.


Statistik nicht differentieller

Gene Typ 1 = Typ 2

0

0

Gute Statistik

Schlechte Statistik

Statistik hochregulierter

Gene Typ 1 > Typ 2

Wahrscheinlich- keitsdichte

Wahrscheinlich- keitsdichte

Was macht eine gute Teststatistik aus?


Nullhypothese akzeptiert

Nullhypothese abgelehnt

Nullhypothese wahr Echt Negative Typ I Fehler

(Falsch Positive)

Alternativhypo- these wahr

Typ II Fehler (Falsch Negative) Echt Positive

0

d0

Gute Statistik

Power Sensitivität




0

d0

Schlechte Statistik

Nullhypothese akzeptiert

Nullhypothese abgelehnt

Nullhypothese wahr Echt Negative Typ I Fehler

(Falsch Positive)

Alternativhypo- these wahr

Typ II Fehler (Falsch Negative) Echt Positive


Bedeutung des p-Wertes: Typ I Fehler


Das Problem des multiplen Testens

Nicht differentielle Gene

Hochregulierte Gene

Kleiner Typ I Fehler!

Echt Positive

Realität Testergebnis positiv

Falsch Positive

Hohe Power!

>

Das Verhältnis wahrer Nullhypothesen zu wahren Alternativhypothesen spielt beim gleichzeitigen Testen vieler Hypothesen eine entscheidende Rolle!


Typ I Fehlerraten

• Ein p-value oder beobachteter Signifikanzlevel ist die Chance bei wahrer Null-Hypothese eine Teststatistik zu beobachten, welche mindestens genauso extrem ist wie die beobachtete Teststatistik.

1. Family-Wise Error Rate (FWER): FWER ist definiert als die Wahrscheinlichkeit von mindestens einem Typ I Fehler (falsch positiven).

2. False Discovery Rate (FDR): FDR ist definiert als die erwartete Rate von Typ I Fehlern unter den abgelehnten Hypothesen:

Pr( 0)FWER V

( )mit

wenn 00 wenn 0

FDR E Q

V R RQ

R


• Die gleiche Teststatistik wird auf viele hunderte oder tausende von Tests angewendet. Wie oft erwarte ich, dass ich zufällig signifikante Testergebnisse bekomme?

• Verschiedene Methoden um für multiples Testen zu korrigieren: Bonferroni: p-value × Anzahl der Tests Holm: Bonferroni verbessert Benjamini-Hochberg: False Discovery Rate Benjamini-Yekutieli: False Discovery Rate (mit Abhängigkeiten)

Das Problem mit vielen Tests


Beispiel


Lineare Modelle um differentielle Expression zu messen

A B

RefA

B

A B

A B

C

Erlaubt, daß alle Vergleiche simultan ausgewertet werden.


baba

baa

yyyyyyy

2

1

2

1

7

6

5

4

3

2

1

11100100100001001100010010000100100

WT.P11

+ a1

MT.P21

+ (a1 + a2) + b + (a1 + a2)b

MT.P11

+a1+b+a1.b

WT.P21

+ a1 + a2

WT.P1

MT.P1

+ b

1

2

3

4

5

6

7

Ein etwas größeres Beispiel:


Lineare Modelle berechnen

Entwerfe lineares Modell für jedes Gen g

Schätze Modell durch robuste Regression,least squares oder generalized least squares (in R Funktionen lm, rlm, glm) und erhalte

coefficients

standard deviations

standard errors


References

• T. P. Speed and Y. H Yang (2002). Direct versus indirect designs for cDNA microarray experiments. Sankhya

: The Indian Journal of Statistics, Vol. 64, Series A, Pt. 3, pp 706- 720

• Y.H. Yang and T. P. Speed (2003). Design and analysis of comparative microarray Experiments In T. P Speed (ed) Statistical analysis of gene expression microarray data, Chapman & Hall.

• R. Simon, M. D. Radmacher and K. Dobbin (2002). Design of studies using DNA microarrays. Genetic Epidemiology 23:21-36.

• F. Bretz, J. Landgrebe and E. Brunner (2003). Efficient design and analysis of two color factorial microarray experiments. Biostaistics.

• G. Churchill (2003). Fundamentals of experimental design for cDNA microarrays. Nature genetics review 32:490-495.

• G. Smyth, J. Michaud and H. Scott (2003) Use of within-array replicate spots for assessing differential experssion in microarray experiments. Technical Report In WEHI.

• Glonek, G. F. V., and Solomon, P. J. (2002). Factorial and time course designs for cDNA microarray experiments. Technical Report, Department of Applied Mathematics, University of Adelaide. 10/2002


Swirl Data

• Experiment to study early development in zebrafish.

• Swirl mutant vs. wild-type zebrafish.

• Two sets of dye-swap experiments.

• Microarray containing 8448 cDNA probes

• 768 control spots (negative, positive, normalization)

• printed using 4x4 print-tips, each grid contains a 22x24 Spot matrix

RR

R Console

> library(marray)

> data(swirl)

> ll()

member class mode dimension1 swirl marrayRaw list c(8448,4)


24 x 22 spots per print-tip

MT – R WT - G

MT – G WT - R

Hybr. I

Hybr. IItechnical, biological variability

Swirl Data


4 x 4 sectors

Sector: 24 rows 22 columns

8448 spots

Mean signal intensity

Visual inspection

RR

R Console

> image(swirl[,1])

81: image of M1 2 3 4

4

3

2

1

-5

-3.9

-2.8

-1.7

-0.56

0.56

1.7

2.8

3.9

5


81: image of Rb1 2 3 4

4

3

2

1

58

300

540

780

1000

1300

1500

1700

2000

2200

81: image of Rf1 2 3 4

4

3

2

1

160

6800

1300

2000

2700

3300

4000

4700

5300

6000

81: image of Gb1 2 3 4

4

3

2

1

76

100

130

150

180

200

230

250

280

300

81: image of Gf1 2 3 4

4

3

2

1

160

6300

1300

1900

2500

3100

3700

4300

5000

5600

Visual inspection – Foreground and Background intensities

RR

R Console

> Gcol <-

maPalette(

low = "white", high = "green", k = 50)

> Rcol <-

maPalette(

low = "white",

high = "red", k = 50)

> image(swirl[,1]

xvar="maRb",

col=Rcol)

> image(swirl[,1]

xvar="maRf",

col=Rcol)

> image(swirl[,1]

xvar="maGb",

col=Gcol)

> image(swirl[,1]

xvar="maRf",

col=Gcol)


200 500 1000 5000 20000 50000

100

200

500

1000

2000

Foreground (red)

Bac

kgro

und

(Red

)

swirl.1.spot

Foreground versus Background intensities

RR

R Console

>

plot(

maRf(swirl[,1]),

maRb(swirl[,1]),

log="xy")

> abline(0,1)


(1,1

)(1

,2)

(1,3

)(1

,4)

(2,1

)(2

,2)

(2,3

)(2

,4)

(3,1

)(3

,2)

(3,3

)(3

,4)

(4,1

)(4

,2)

(4,3

)(4

,4)

-2-1

01

2

Swirl array 93: pre-norm

PrintTip

M

81 82 93 94

-20

24

Swirl arrays: pre--normalization

M

RR

R Console

> boxplot(swirl[, 3], xvar = "maPrintTip", yvar = "maM")

> boxplot(swirl, yvar = "maM")

marray – Swirl Data: Pre Normalization


(1,1

)(1

,2)

(1,3

)(1

,4)

(2,1

)(2

,2)

(2,3

)(2

,4)

(3,1

)(3

,2)

(3,3

)(3

,4)

(4,1

)(4

,2)

(4,3

)(4

,4)

-10

12

3

Swirl array 93: post-norm

PrintTip

Mmarray – Swirl Data: Post Normalization

RR

R Console

> swirl.norm <-

maNorm(swirl, norm = "p")

> boxplot(swirl.norm[, 3], xvar = "maPrintTip", yvar = "maM")

> boxplot(swirl.norm, yvar = "maM")


Swirl Data – M values, raw versus normalized

81: image of M1 2 3 4

4

3

2

1

-5

-3.9

-2.8

-1.7

-0.56

0.56

1.7

2.8

3.9

581: image of M

1 2 3 4

4

3

2

1

-6

-4.7

-3.3

-2

-0.67

0.67

2

3.3

4.7

6

Normalization procedure was not able to remove scratchRR

R Console

> image(swirl[,1])

> image(swirl.norm[,1])


6 8 10 12 14

-2-1

01

2

Swirl array 93: pre-norm MA-Plot

A

M

(1,1)(2,1)(3,1)(4,1)

(1,2)(2,2)(3,2)(4,2)

(1,3)(2,3)(3,3)(4,3)

(1,4)(2,4)(3,4)(4,4)

6 8 10 12 14

-10

12

3

Swirl array 93: post-norm MA-Plot

AM

(1,1)(2,1)(3,1)(4,1)

(1,2)(2,2)(3,2)(4,2)

(1,3)(2,3)(3,3)(4,3)

(1,4)(2,4)(3,4)(4,4)

Non-parametric smoother: loess, lowess, local regression line, generalizes the concept of moving average.

marray – Swirl Data: Print-tip lowess Normalization

RR

R Console

> plot(swirl[, 3], xvar = "maA", yvar = "maM", zvar = "maPrintTip")

> plot(swirl.norm[, 3], xvar = "maA", yvar = "maM", zvar = "maPrintTip")


Acknowledgements – Slides geborgt von

• Achim Tresch

• Benedikt Brors

• Wolfgang Huber

• Anja von Heydebreck

• Terry Speed

• Jean Yang


Swirl Data: Lowess versus VSN

RR

R Console>

plot(maA(swirl.norm[,3]), maM(swirl.norm[,3]), ylim=c(-3,3))

> library(vsn); library(limma);

>

A.vsn<-log2(exp(exprs(swirl.vsn[,6])+exprs(swirl.vsn[,5])))/2

> M.vsn<-log2(exp(exprs(swirl.vsn[,6])-exprs(swirl.vsn[,5])))

> plot(A.vsn, M.vsn, ylim=c(-3,3)


RR

R Console

> M.lowess<-maM(swirl.norm)

> M.vsn<-log2(exp(exprs(

swirl.vsn[,c(2,4,6,8)])-

exprs(swirl.vsn[,c(1,3,5,7)]

)))

> par(mfrow=c(2,2))

> plot(M.lowess[,1},

M.vsn[,1], pch=20)

> abline(0,1, col="red")


M.vsn[,1], pch=20)



M.vsn[,1], pch=20)



M.vsn[,1], pch=20)


Swirl: LOWESS versus VSN


5‘ 3‘

PM: ATGAGCTGTACCAATGCCAACCTGG MM: ATGAGCTGTACCTATGCCAACCTGG

16-20 probe pairs

per gene

16-20 probe pairs: HG-U95a 11 probe pairs: HG-U133

64 pixels; Signal intensity is upper quartile of the 36 inner pixels

Stored in CEL file

Affymetrix technology


Affymetrix expression measures

• PMijg , MMijg = Intensity for perfect match and mismatch probe j for gene g in chip i.

• i = 1,…, n one to hundreds of chips• j = 1,…, J usually 16 or 20 probe pairs• g = 1,…, G 8…20,000 probe sets.

• Tasks:• calibrate (normalize) the measurements from different chips

(samples)• summarize for each probe set the probe level data, i.e., 20 PM and

MM pairs, into a single expression measure.• compare between chips (samples) for detecting differential

expression.

Documents

Bioinformatik Ringvorlesung SS2005 - UMG · Abt. Medizinische Statistik - Biostatistik Wednesday. 20. January 2010. 8:30. Bioinformatik/System Biologie ... Tim Beißbarth Bioinformatik