21
Ashley Publications Ltd www.ashley-pub.com 2003 © Ashley Publications Ltd ISSN 1462-2416 Pharmacogenomics (2003) 4(4) 1 T ECHNOLOGY R EPORT The analysis of microarray data Ramesh Hariharan Strand Genomics Private Limited, and Indian Institute of Science, Bangalore, India, 560080 Tel: +91 80 361 1349; Fax: +91 80 361 8996; E-mail: ramesh@ strandgenomics.com Keywords: gene expression, image analysis, microarrays, oligonucleotide probe design, statistical data analysis This article describes issues, techniques and algorithms for analyzing data from microarray experiments. Each such experiment generates a large amount of data, only a fraction of which comprises significant differentially expressed genes. The precise identification of these interesting genes is heavily dependent not only on the statistical data analysis techniques used but also on the accuracy of the previous oligonucleotide probe design and image analysis steps as well. Indeed, wrong decisions in these steps can multiply the number of false positives by many-fold, thus necessitating a careful choice of algorithms in all three steps. These steps are described here and placed in the context of commercial and public tools available for the analysis of microarray data. Introduction Gene microarrays constitute a powerful and increasingly popular platform for studying changes in gene expression on a large scale. This platform allows tracking changes in gene expres- sion for the entire transcriptome (several thou- sands or tens of thousands of genes) simultaneously. A single microarray experiment yields gene expression information not only about individual genes but about joint behavior of collections of genes as well. Unraveling this joint behavior facilitates the study of biological phenomena on a hitherto impossible systemic scale. Indeed, microarrays have become invalua- ble tools for gene discovery, disease diagnosis, pharmacogenomics and toxicogenomics. There are of course challenges in using micro- arrays to study biological systems. The chain of events between biological sample and final out- come is very long and involves several experi- mental and computational steps: sample preparation oligonucleotide probe design oligonucleotide synthesis slide preparation and spotting • hybridization • washing image analysis statistical data analysis Errors in each step will also affect the accuracy of the final results considerably. Indeed data generated from microarrays usually show a large number of false positives (i.e., genes which are not differentially expressed but appear as being so). This article surveys computational issues which arise in the three main computational steps in the previously mentioned chain, namely, oligo- nucleotide probe design, image analysis, and statisti- cal data analysis, and studies their effects on the final outcome (i.e., the list of genes declared to be differentially expressed). Choosing the right algo- rithms in each of these steps is critical as small changes in algorithm can increase the number of false positives by several-fold. As the title of a popular article on gene expression informatics reads, it’s all in your mine [1]. The oligonucleotide probe design step requires choosing appropriate probes specific to each gene of interest. This is a fairly complex multiparame- ter problem. A good probe needs to satisfy several properties, some of which are predicted by mod- eling physical hybridization phenomena which are only partially understood. Therefore, probe behavior in an experiment is not always as pre- dicted or expected, especially at low expression levels, and this can have a significant impact on the final results. Since a microarray image has several tens or hundreds of thousands of spots, image analysis is necessarily an automated step and not always amenable to manual checking or correction. Differences in segmentation and background correction methods in the image analysis step can affect the final outcomes sub- stantially. Finally, the statistical data analysis step performs a variety of statistical analyses on spot quantitated data to assess whether a gene is truly differentially expressed or not. There are several issues which this step needs to address, for exam- ple, normalizing arrays to factor out differences due to non-biological conditions variations,

The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

The analysis of microarray data

Ramesh Hariharan

Strand Genomics Private Limited, and Indian Institute of Science, Bangalore, India, 560080Tel: +91 80 361 1349; Fax: +91 80 361 8996; E-mail: [email protected]

Ashley Publications Ltd

Keywords: gene expression, image analysis, microarrays, oligonucleotide probe design, statistical data analysis

www.ashley-pub.com

2003 © Ashley Publications Ltd

This article describes issues, techniques and algorithms for analyzing data from microarray experiments. Each such experiment generates a large amount of data, only a fraction of which comprises significant differentially expressed genes. The precise identification of these interesting genes is heavily dependent not only on the statistical data analysis techniques used but also on the accuracy of the previous oligonucleotide probe design and image analysis steps as well. Indeed, wrong decisions in these steps can multiply the number of false positives by many-fold, thus necessitating a careful choice of algorithms in all three steps. These steps are described here and placed in the context of commercial and public tools available for the analysis of microarray data.

IntroductionGene microarrays constitute a powerful andincreasingly popular platform for studyingchanges in gene expression on a large scale. Thisplatform allows tracking changes in gene expres-sion for the entire transcriptome (several thou-sands or tens of thousands of genes)simultaneously. A single microarray experimentyields gene expression information not onlyabout individual genes but about joint behaviorof collections of genes as well. Unraveling thisjoint behavior facilitates the study of biologicalphenomena on a hitherto impossible systemicscale. Indeed, microarrays have become invalua-ble tools for gene discovery, disease diagnosis,pharmacogenomics and toxicogenomics.

There are of course challenges in using micro-arrays to study biological systems. The chain ofevents between biological sample and final out-come is very long and involves several experi-mental and computational steps:

• sample preparation• oligonucleotide probe design• oligonucleotide synthesis• slide preparation and spotting• hybridization• washing• image analysis• statistical data analysis

Errors in each step will also affect the accuracyof the final results considerably. Indeed datagenerated from microarrays usually show a largenumber of false positives (i.e., genes which arenot differentially expressed but appear as beingso).

This article surveys computational issueswhich arise in the three main computational stepsin the previously mentioned chain, namely, oligo-nucleotide probe design, image analysis, and statisti-cal data analysis, and studies their effects on thefinal outcome (i.e., the list of genes declared to bedifferentially expressed). Choosing the right algo-rithms in each of these steps is critical as smallchanges in algorithm can increase the number offalse positives by several-fold. As the title of apopular article on gene expression informaticsreads, it’s all in your mine [1].

The oligonucleotide probe design step requireschoosing appropriate probes specific to each geneof interest. This is a fairly complex multiparame-ter problem. A good probe needs to satisfy severalproperties, some of which are predicted by mod-eling physical hybridization phenomena whichare only partially understood. Therefore, probebehavior in an experiment is not always as pre-dicted or expected, especially at low expressionlevels, and this can have a significant impact onthe final results. Since a microarray image hasseveral tens or hundreds of thousands of spots,image analysis is necessarily an automated stepand not always amenable to manual checking orcorrection. Differences in segmentation andbackground correction methods in the imageanalysis step can affect the final outcomes sub-stantially. Finally, the statistical data analysis stepperforms a variety of statistical analyses on spotquantitated data to assess whether a gene is trulydifferentially expressed or not. There are severalissues which this step needs to address, for exam-ple, normalizing arrays to factor out differencesdue to non-biological conditions variations,

ISSN 1462-2416 Pharmacogenomics (2003) 4(4) 1

Page 2: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

2

performing the right transformations and statisti-cal tests to identify differentially expressed genesso as to reduce the number of false positives etc.This is a very vibrant area of research, and it isprobably fair to say that there have been tremen-dous advances in the last couple of years contrib-uting to a substantially better understanding ofvarious microarray platforms and consequentlyfar more accurate results.

One of the key problems in assessing and com-paring various algorithms for microarray dataanalysis is that there are few or no benchmarks ordata sets available for the various available micro-array platforms on which the true answers areknown. Researchers do perform reverse tran-scriptase polymerase chain reaction (RT-PCR)studies on chosen individual genes to verify theirlevel of differential expression and some of thesestudies have been reported in literature; however,since RT-PCR studies are performed only on asubset of interesting genes, it is not always easy todraw large scale statistical conclusions from these.Of great use here are experiment sets where a cer-tain number of genes are spiked-in in knownconcentrations on a common background. Sinceonly the spiked-in genes are truly differentiallyexpressed, these data sets are invaluable in assess-ing and quantifying the relative accuracies of sev-eral algorithms. Some of the discussion in thisarticle as regards comparative analysis will revolvearound the Affymetrix Latin Square Dataset [101];the other commonly used Latin Square Dataset isavailable from GeneLogic [102]. Of course, thefact that an algorithm performs well on these datasets does not mean that the algorithm will per-form equally well on all data sets. Therefore, theindividual steps of microarray data analysisshould be performed in a quality-controlled fash-ion in order to find the method best suited for thedata set. Possible quality control proceduresinclude inspection of pre/postnormalization scat-ter plots and the observation of genes withknown expression patterns.

RoadmapThe following two sections describe issues in oli-gonucleotide probe design and image analysis,respectively, and examine the impact of thesesteps on the final results. This article thendescribes issues in statistical data analysis. Thisdescription is restricted to primary analysis (i.e.,analysis aimed at identifying significant genes).There are several important steps beyond thisprimary analysis, notably identifying co-expressed/co-regulated genes using clustering

approaches and mapping genes to pathways,which are not addressed in this article.

Oligonucleotide probe designOligonucleotide arrays typically have less varia-tion amongst replicates than cDNA arrays andmany commercial platforms, for example,Affymetrix, Agilent, Amersham etc., use oligo-nucleotide arrays. However, the design of oligo-nucleotides poses some challengingcomputational problems in ensuring the follow-ing properties:

• Specificity – a probe should hybridize only tomRNA from the corresponding gene but notto mRNA from other genes.

• Availability – a probe should be available tohybridize with mRNA from the correspond-ing gene. In particular, it should not form sec-ondary structures which prevent thishybridization.

• Uniformity – all probes on an array must havesomewhat uniform thermodynamic propertiesso they behave as required under a commonexperiment condition.

Specificity is usually enforced by ensuring thatthere is no non-specific match within certainhomology limits, for example, either withhomology > 75% or having a 15-mer continuousexact match stretch [2,103]. Often homology maynot be a sufficient criterion in avoiding cross-hybridization because thermodynamic proper-ties of hybridization depend not only on per-centage homology but also on the basecomposition and, therefore, low homologymatches with substantial GC content could stillmake for a stable binding [104]; it may then bemore effective to estimate a cross-melting temper-ature (the melting temperature of the strongestnon-specific cross match) and ensure that thistemperature is well below the self-melting temper-ature (the melting temperature of the perfectmatch) [105]. Prediction of melting temperaturesis usually performed using nearest neighborparameters [3-5]. This approach has two prob-lems: first, nearest neighbor parameters are usu-ally obtained from studies in solution which maynot be applicable directly to the array surface,and second, these parameters are available onlyfor perfect match and single mismatch duplexes.Furthermore, there seem to be allied parameterswhich govern the specificity, for example, thenumber of non-specific matches with a modestpredicted melting temperature, the number oflow complexity subsequences in the oligo etc.

Pharmacogenomics (2003) 4(4)

Page 3: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Figure 1. Profiles fo

Each red profile is the procomprising one group of

Availability is typically tested using secondarystructure prediction for both the oligo and thegene sequence. Uniformity requires a multipa-rameter optimization scheme to obtain probeswhich are optimal along several of the aboveparameters while keeping the parameters uni-form. Finally, probe length plays a key role indetermining all the above parameters. WhileAffymetrix uses 25-mers, Bosch et al. [106] claimthat 70-mers offer greater sensitivity. Agilentarrays use both 25-mers and 60-mers, whileAmersham is focussing on 30-mers.

Computing with noisy probesWhile predictions of probe behavior made fromin silico models as described above are faithful ata coarse level and for most probes, there seem tobe no conclusive studies published demonstrat-ing accurate specificity predictions for all probesat low expression levels. Indeed, at low

expression levels, probe behavior is often verynoisy as illustrated by the example in the nextparagraph. However, using several noisy probesand averaging over these probes can attenuatethe effect of noise here.

As an example consider probeset 1597_at onthe Affymetrix HGU95 chip and consider theexpression values obtained from the AffymetrixLatin Square Dataset [101]. The spike-in concen-tration of this gene goes from 0 in the first set offour replicates to 0.25 pm in the second set offour replicates. The profiles of all probes in thisprobeset over these eight arrays are shown inFigure 1. Note that the behavior of most probes isquite choppy; indeed, testing the individualprobes for differential expression yields only twoprobes with a p-value smaller than 0.05. On theother hand, the average profile shown in green inFigure 1 seems to be smooth and shows a dis-tinctly higher value in the last four replicates as

r probeset 1597_at over eight arrays.

file of one of the 16 probes over the 8 arrays. These 8 arrays appear along the x-axis with arrays 0,1,2,3 replicates and arrays 4,5,6,7 comprising the other group. The green line shows the average probeset profile.

3

Page 4: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

4

compared to the first four replicates; testing thisaverage profile indicated differential expressionwith a much smaller p-value of 0.014. Thus,averaging over several probes leads to more accu-rate results, primarily because of cancellation ofrandom noise on addition. Indeed, Affymetrixuses 10–20 probes per probeset to increase accu-racy. However, increasing the number of probesper probeset produces uniformity challenges.

Image analysisImage analysis needs slightly different treatmentfor spotted arrays as opposed to synthetic arrays.

Spotted arraysThere are three steps in image analysis of spottedarrays. The first step is called addressing or grid-ding and involves associating spots on the arraywith the row and column coordinates. The sec-ond step is segmentation which involves tracingthe spot boundary so as to separate the spot fore-ground from the surrounding region. The final

step involves spot quantification and backgroundcorrection (i.e., computing foreground and back-ground expression values for each spot and per-forming background correction to obtain a netexpression value for each probe). Various algo-rithms used in the above steps are surveyed inYang et al. [6].

The gridding problem is usually solved by asemiautomatic process where the user gives somemanual tips (usually in the form of clicking at afew key points, e.g., the top left corner of the topleft grid) and the machine figures out the rest,allowing the user to perform some final fine-tun-ing by hand. With the right user interface, thecorrectness of the gridding process can also beverified very quickly.

As regards segmentation, currently availabletools follow one of two approaches. The firstapproach fits circles onto the spots of either fixedsize (e.g., Scanalyse or Quantarray™) or anadaptive size (e.g., GenePix®). The secondapproach actually traces spot boundaries using

Figure 2. Spot segmented by circle fitting on the left and by exact boundary tracing on the right.

This figure shows a spot segmented by circle fitting on the left and the same spot segmented by exact boundary tracing on the right. The associated spot statistics are displayed below the spots: the figures, in order from top to bottom, are Channel 1 and 2 intensities, Channel 1 and 2 backgrounds, and then Channel 1 and 2 background corrected values. The circle fitting approach on the left yields a background corrected value which is much smaller than the one obtained by accurate tracing on the right.

Pharmacogenomics (2003) 4(4)

Page 5: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

more sophisticated algorithms (e.g., Spot [107]

which uses the seeded region growing algorithmof Adams and Bischof [7], and Chitraka [108]

which uses a clustering approach). The advan-tage of the second approach is that the fore-ground and background are separated better,leading to better quantitation. The twoapproaches above can yield substantially differ-ent values as indicated by Figure 2.

Furthermore, several spots on a spotted arrayactually have a doughnut like shape, fitting cir-cles on which will attenuate the average fore-ground values. The accurate tracing methods caneasily handle such spots as well, as shown inFigure 3.

Background correction can be performedeither locally or globally. The local correctionapproach computes a background estimate fromthe non-spot regions adjacent to a given spot;typically, this estimate would be a robust averageor median of the non-spot regions around a spotand should not be corrupted by brighter pixelsbelonging to the edges of spots. The globalapproach computes a background taking thewhole array into account; this background valuecould be computed by taking, for example, thethird percentile of all spot foreground values, themotivation being to subtract off backgroundnoise due to non-specific hybridization. Theformer approach is clearly better at handling spa-tial variations in intensity and the latter could beused in conjunction with the former. Most soft-ware packages available implement a local cor-rection approach.

Dudoit et al. [8] observe that the choice ofbackground correction method has greater

impact on the final log intensity ratios than thechoice of segmentation method and that a robustlocal correction method based on MorphologicalOpening [9] seems to produce most stable esti-mates of background.

Synthetic arraysWe discuss only Affymetrix GeneChip® arrayshere. Image analysis for these arrays is somewhatsimpler as compared to spotted arrays becausecells (the analogs of spots on a spotted array)have a fixed rectangular shape and are laid out ina fairly regular grid structure.

Once the corners of the grid have been identi-fied, simple linear interpolation can identify thecorners of each cell to within 3 pixels, as statedby Zuzan et al. [109]. Zuzan et al. [109] propose analgorithm for correcting this up to 3 pixel error.Interestingly, they also show that when the align-ment is computed using Affymetrix software, acertain banding pattern occurs when one viewsan image in which each cell is replaced by a pixelwith intensity proportional to the coefficient ofvariation. They ascribe this banding pattern tofaulty alignment and demonstrate that theiralignment algorithm corrects this spatial effect.

After alignment, the pixel values in each cellare averaged using a robust measure. Affymetrixsoftware usually reports the 75th percentile ofthe pixel values within a cell along with thestandard deviation. Zuzan et al. [109] mentionthat pixels near the cell borders need to beignored in computing cell statistics.

Background correctionFor Affymetrix high density arrays, the imageprocessing task is fundamentally differentbecause probes are synthesized in situ and, there-fore, the addressing and spot segmentation tasksare much simpler. Since the probes are packedvery tightly on the chip, the region surroundingthe probes cannot be used for background com-putation. Therefore, background correction forsuch high density chips needs to be performedusing the probe values themselves as explainedlater in this article.

Statistical data analysisOnce the microarray image has been analyzedand quantitated, the next set of steps involvesanalyzing this data using statistical algorithms.The article reviews relevant issues which arise ineach of the following steps.

The first of these steps is concerned with per-forming the right transformations to the raw

Figure 3. An accurately traced doughnut spot.

Only the portion inside the outer boundary and outside the inner boundary is considered for foreground computation.

5

Page 6: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

6

Figure 4. The MM (YDataset plotted aga

A large fraction of the Mseem to increase as PM iMM: Mismatch value; PM

expression levels measured. There are two issuesherein. The first question is whether dataobtained from image analysis needs to be ana-lyzed on the linear scale or converted to a differ-ent scale for analysis, for example, thelogarithmic scale. The second issue is specific tothe Affymetrix platform. Probes on an Affyme-trix array are paired in PM,MM pairs, where PMrefers to the perfect match oligo and MM to aoligo which has one mismatched base relative tothe perfect match oligo. In this case, the imageanalysis phase outputs two values, a PM (or per-fect match value) and an MM (or mismatchvalue). The following question now arises: what

is the right way to combine PM and MM to getthe expression value for a particular probe?

Once the right transformation is determined,the second step is that of normalizing the data toremove effects of non-biological variation. Thisissue arises naturally in experiments involvingmultiple arrays where observed expression levelsacross arrays include both biological variationand other sources of uninteresting variation aris-ing from the production and processing ofarrays, varying dye efficiencies in multi-dyearrays (i.e., for the same expression levels, differ-ent dyes may report different intensity values),and variations of ambient light intensities

-axis) values for about 12,000 genes from an array in the Affymetrix Latin Square inst their corresponding PM (X-axis) values.

M values are much bigger than their corresponding PM values. In addition, for a fraction of genes, MM values ncreases.: Perfect match value.

Pharmacogenomics (2003) 4(4)

Page 7: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

between regions of an array and across arrays. SeeHartemink et al. [10] for a more detailed descrip-tion of such sources of variation.

The third step arises from the fact that severalmicroarray platforms use multiple probes foreach gene of interest, for example, the Affyme-trix HGU95 chips had 16–20 probes per geneand the HGU133 chips have 10 probes per gene.As mentioned earlier, this seems to be a goodway to fight noise. But it leads to another com-putational task: expression values for these vari-ous probes need to be aggregated into a geneexpression value.

The final step in the statistical analysis phase isstatistical hypothesis testing which would deter-mine the statistical confidence with which eachgene can be declared as differentially expressed.Algorithms for hypothesis testing and as well asfor determining the trade-off between number ofreplicates and the number of false positives arerelevant here.

Data transformationsPM,MM and background correction for Affymetrix arraysSeveral tools which analyze microarray data (e.g.,Affymetrix MAS5.0 [110], Affymetrix MAS 4.0,and one of the main DChip options [111]) arebased on the premise that MM measures noisedue to non-specific hybridization while PM meas-ures the actual signal, and therefore, PM-MM is ameasure of the signal corrected for noise.

However, as observed by Irizarry et al. [11] andas illustrated in Figure 4, MM seems to pick up afair amount of signal in addition to the noise,and thus PM-MM could result in an attenuatedsignal. In addition, for several probes, the MMvalues seem to be much larger than the corre-sponding PM values leading to the problem ofnegative values (though the Affymetrix MAS5.0algorithms avoids negative values by dampingthe MM values whenever they are too large). Forthese reasons, PM-MM may not be a good meas-ure of expression.

Indeed as the results in the following tableshow, PM-MM based methods (e.g., MAS5.0,the DChip Li-Wong PM-MM option) yield rel-atively poor results on the Affymetrix LatinSquare Dataset when compared to pure PMbased methods. This is indicated by the p-valueranks of the 14 actually differentially expressedgenes, which ideally should get ranks 1 to 14,and may be a little but not much higher, if cross-hybridization/probe design errors etc. are takeninto account, in Tables 1 & 2. Since the various

methods in this table use different normalizationmethods, it is not obvious that the difference inperformance is due solely to the use of PM orPM-MM. This is indeed verified by holding allparameters other than PM/PM-MM constantand running a comparative analysis. ComparingALG1 versus ALG3 or comparing DChipPM-MM versus DChip PM confirms thishypothesis (though there are a couple of aberrantvalues in the DChip PM option). Thus, whilethere seems to be some information in the MMvalues, it is not clear how this information can begainfully used to remove noise.

Affymetrix background correctionIf MM is not used to remove noise, one needs tofind alternative ways of removing noise in PMvalues which arises due to non-specific binding.This becomes particularly important at low val-ues of expression where noise terms can drownout the real signal, for example, if the actual sig-nal in two distinct arrays is x and 2x, then differ-ential expression as measured by difference onthe log scale would be: log(2x)-log(x) = 1

while an additive noise level of 100 in botharrays would make this difference:

log(100+2x)-log(100+x)

which is close to 0 for small values of x. Notethat for Affymetrix arrays, probes are so denselypacked that regions adjacent to probes cannot beused to calculate backgrounds as for cDNAarrays.

Irizarry et al. [11] suggest one such backgroundcorrection method which is based on the distribu-tion of MM values amongst probes on an Affyme-trix array. The key observation is that thesmoothed histogram of the log(MM) values asdisplayed in Figure 5 exhibits a sharp normal-likedistribution to the left of the mode (i.e., the peakvalue) but not on the right. This observation sug-gests that the MM values are a mixture of non-specific binding and background noise on the onehand and specific binding on the other hand. Inparticular, the distribution on the right of themode is possibly influenced by the MM valuespicking up some amount of signal and if this didnot happen, the guess would be that the curve inFigure 5 would look symmetric about the peakvalue (which would then be the mean) and, there-fore, the average non-specific hybridization would

7

Page 8: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

8

Table 1. Ranks of thDataset where one double in concentra

RMA ALG

0 0

1 1

3 2

4 5

6 7

8 10

9 18

10 22

16 24

19 28

21 29

47 30

99 46

162 52

These ranks are obtaineshould get ranks 0–13 inbased on PM alone andMM: Mismatch value; P

Table 2. ALG1 has thRMA does best in th

Algorithm

RMA

ALG1

ALG2

DChipPM

MAS5.0

DChip

ALG3

RMA [112] and MAS5 [1implementations in Sootheir native implementaNote that the gene orde

be normally distributed around this mean value.Thus, the mode of the log(MM) distribution is anatural estimate of the average background noise,and this can be subtracted from all PM values toget background corrected PM values. However,the problem of negative values remains.

Irizarry et al. [11] solve the problem of negativevalues by suggesting a further extension of

imposing a positive distribution on the back-ground corrected values. They assume that eachobserved PM value O is a sum of two compo-nents: a signal S which is assumed to be expo-nentially distributed (and is therefore alwayspositive) and a noise component N which is nor-mally distributed. The background correctedvalue is obtained by determining the expectationof S conditioned on O which can be computedusing a closed form formula. However, thisrequires estimating the decay parameter of theexponential distribution and the mean and vari-ance of the normal distribution from the data athand. This method is used as part of the robustmulti-array analysis (RMA) package in the Bio-conductor suite [112].

The MAS5.0 algorithm (as reported in [110])from Affymetrix uses a completely differentmethod of background correction. The entirearray is divided into 16 rectangular zones and thesecond percentile of the probe values in eachzone (both PMs and MMs combined) is chosenas the background value for that region. For eachprobe, the intention now is to reduce the expres-sion level measured for this probe by an amountequal to the background level computed for thezone containing this probe. However, this couldresult in discontinuities at zone boundaries. To

e 14 spike-in genes in lot 1532 (arrays MNOPQRST) of the Affymetrix Latin Square gene goes down from 1024 pm to 0 pm, another from 0 to 0.25 pm, and all others tion.

1 ALG2 DChipPM MAS5 DChip ALG3

0 0 0 0 0

1 2 1 2 3

5 3 2 11 5

6 8 5 18 7

7 11 9 32 9

11 12 24 36 10

14 13 43 43 13

20 19 54 58 27

25 36 69 122 82

26 45 75 246 111

27 92 159 267 119

45 97 271 338 504

47 442 272 1052 998

64 9960 2352 3571 9183

d by performing t-tests using different probe aggregation and normalization methods. Ideally, the spike-ins this process, however all methods give a fair number of false positives. The first four columns are algorithms

perform better on the whole than the last three methods which use the PM-MM measure.M: Perfect match value; RMA: Robust multi-array analysis..

e least maximum rank, namely 52, while e upper reaches.

Properties

[102]

Robust log(PM-bg), Quantile Normalization, bg = mode(MM)

Robust log(PM-bg), Lowess Normalization, bg = mode(MM)

[105] PM option

[114]

[111] PM-MM option

Robust log(PM-MM), Quantile Normalization

10] results have been obtained using chika [116] and could therefore differ slightly from tions. ALG1, ALG2 and ALG3 are native to Soochika. r provided by different methods is not the same.

Pharmacogenomics (2003) 4(4)

Page 9: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Figure 5. The distrib

make these transitions smooth, what is actuallysubtracted from each probe is a weighted combi-nation of the background levels computed abovefor all the zones. Negative values are avoided bythresholding.

We believe that background correction usingthe MM values contributes a small but signifi-cant improvement to results obtained. For exam-ple, running ALG2 from Table 1 without thebackground correction increases the maximumrank from 64 to 115.

The logarithmic scaleTo motivate the problem involved in determin-ing the right transformation to measure geneexpression, consider two conditions being

studied, each having several replicates. The goalis to determine which genes are differentiallyexpressed between these two conditions. Typi-cally, this analysis proceeds independently foreach gene. For a particular gene, typically a t-test(or a one-way analysis of variance, ANOVA, ifthere were more conditions) is performed. Thenumerator in this test is calculated from themeans of expressions levels within each set ofreplicates, and the denominator is calculatedusing the standard deviations of expression val-ues within each set of replicates. A standardassumption in this test is that the expression val-ues in each set of replicates are drawn from anormal distribution; furthermore, the two nor-mal distributions need not have the same mean

ution of log(MM) values on an Affymetrix array.

9

Page 10: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

10

Figure 6. These twofor each of about 1

Expression values are mestandard deviation over aright, the average standadecreases for very large e

but must necessarily have the same variance. Thisrequires that the variance at different expressionlevels must be the same (i.e., roughly speaking)genes showing a high expression level shouldshow approximately the same amount of varia-tion across arrays as do genes with low expressionlevels. This assumption turns out to be not truefor microarray data as illustrated in Figure 6.

There are two options to work with the abovevariance behavior. The first requires devisingnew statistical tests to handle this behavior whilethe second, and possibly simpler, option is toperform appropriate transformations on thedata. Converting the data to a logarithmic scale(using base 2) has various advantages (see theSpeed page [113]), one of which is that the effectof expression level on the variance is somewhatmitigated, though not completely. The secondadvantage is of course for visualization; takinglogarithms compresses the scale so that moreinformation is visible within a given area.

The disadvantages of a plain logarithmictransform are listed in Durbin et al. [12]. The keyoperational irritant with the logarithmic trans-form is that it does not apply to negative values.

While it may seem counter intuitive, negativevalues often do occur in microarray data, largelydue to background correction, which ofteninvolves subtracting an estimated backgroundvalue which is larger than the measured expres-sion value itself. For example, the MM value fora probe on an Affymetrix array supposedlydetects background noise due to non-specifichybridization and is often larger than its PMcounterpart. As argued by Durbin et al. [12], afurther disadvantage of the logarithmic trans-form is that the variance for very low expressionlevels shoots up.

There are several approaches to overcomingthe above mentioned disadvantages of the loga-rithmic transform. Typically, the problem ofnegative values is avoided by thresholding or byperforming background correction as describedearlier. To avoid the problem of low expressionlevels, Durbin et al. [12], Huber et al. [13], andMunson [14] independently came up with thefollowing generalized logarithmic transform:

figures show the mean (X axis) and the standard deviation (Y axis) of expression values 2,000 genes over four replicates in the Affymetrix Latin Square experiment.

asured on the linear scale in the figure on the left and on the log scale in the figure on the right. The average ll 12,000 genes clearly increases in the figure on the left as the expression value increases. In the figure on the rd deviation stays somewhat constant or weakly decreases as the mean increases for a good stretch but xpression levels.

f x( ) x x2 c2++( )2

-----------------------------------ln=

Pharmacogenomics (2003) 4(4)

Page 11: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Figure 7. MVA plotsThe plots show log

MVA: ?

This transformation behaves linearly near 0 andlike the logarithmic function for large expressionvalues, and furthermore, is defined even for nega-tive values. However, the estimation of the param-eter c requires error modeling of the data, whichtakes it beyond the realm of simplicity. In practice,working with the plain log transform achieves agood balance between effectiveness and simplicity.

NormalizationResults with and without normalization can besubstantially different, for example, in Tables 1 & 2,running ALG2 without background correction ornormalization yields a massive maximum rank of9313 instead of 64. Thus, normalization of datais essential before further analysis. A variety ofnormalization methods have been studied andused by researchers, all of which attempt toremove non-biological variation between arraysor between dyes on the same array based on the

premise that most genes are not differentiallyexpressed across arrays. While much of thedescription below is in terms of normalizingacross arrays, the techniques described are appli-cable to normalizing across multiple dyes on anarray as well, as will be described at the end of thissection. The need for normalization can be dis-cerned by viewing the so-called MVA plot [8]

between expression values on two arrays (theMVA plot shows, for each probe, the differencebetween the expression levels in the two arraysplotted against the average of these two expressionlevels). The distribution of points on this plotshould be around the horizontal 0 line, assumingthat most probes do not show differential expres-sion between two arrays (for example see Figure 7).However, in actual practice, the MVA plot maynot be centered around this zero line.

before and after Lowess normalization. Non-linear curve fitting is essential in this case. (Ch1)-log(Ch2) versus for a 2-channel array.Ch1( )log Ch2( )log+

2----------------------------------------------------------

11

Page 12: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

12

Table 3. Quantile no

gene1

gene2

gene3

gene4

Table 4. Quantile no

gene1

gene2

gene3

gene4

Mean shiftingThis method removes variations across arrays byequalizing the means of the various arrays. Tomake this robust (i.e., not sensitive to outliers)typically a trimmed mean is used in which themean is computed after ignoring a certain frac-tion of the highest and lowest values. Assumingthat expression values are measured on a loga-rithmic or related scale, mean shifting corre-sponds to global scaling on the linear scale. Thisnormalization is used as part of the AffymetrixMAS4.0 and MAS5.0 algorithms.

Interpolation methodsThis is a pairwise normalization method inessence and is used to normalize one arrayagainst another. The mean shifting algorithm issuitable when the same amount of shift can beapplied at all expression levels. However, thereare instances where the MVA plot appears as inFigure 7. Mean shifting will not normalize thedata in this case because the amount of shiftrequired for different expression levels is differ-ent. One method which is used in such situa-tions is to fit a straight line or even a piecewiselinear curve to the data and shift the expressionvalues on one of the arrays so that this curvemoves to the horizontal zero line as in Figure 7.The need for a non-linear (or piecewise linear)method arises often in the context of 2-dyearrays because of the dependence of dye effi-ciency on intensity. In such cases, fitting a non-linear curve and straightening it out to the hori-zontal zero line as in Schadt et al. [15] does thetrick. This method is used in the DChip soft-ware [111]. One popular method to fit a non-

linear curve is the piecewise linear Lowess (orLoess) smoother [114].

As for the case of Mean Shifting, robustness isan issue. One way to make these interpolationmethods more robust is to perform the normali-zation in multiple steps: first fit a curve, remove acertain fraction of points which are furthest awayfrom the curve and which, therefore, are likely tobe outliers, and finally fit a curve again on theremaining points. Multiple runs of these can bedone as well to further increase robustness.

RemarksBoth the above methods are based on thepremise that most genes are not differentiallyexpressed between arrays and that the totalmRNA content of different samples is the same.As stated in Hoffmann et al. [16], this need not betrue if cells of different sizes or in different stagesof the cell cycle are considered, and even thoughcontrol of this effect is attempted by loadingidentical amounts of cRNA onto the arrays,there can still be significant variation in themean expression level across arrays (see Hill et al.[17]). Furthermore, if the experiment has a dra-matic effect, for example, causing a diauxic shift,then a good fraction of the genes can actually bedifferentially expressed. In such situations, itmakes sense to perform the above methods(more specifically, either the trimmed mean cal-culation or curve fitting) on a subset of theprobes (called the invariant set) obtained as fol-lows. The probes on each of the arrays are rankedby expression value and probes with similarranks comprise the invariant set. The trimmedmean or curve fitting is performed on the

rmalization.

array1 array2 array3 array4

10 1000 10 10

10 10 10 10

10 10 10 10

1000 1000 1000 1000

rmalization.

array1 array2 array3 array4

10 257.5 10 10

10 10 10 10

257.5 10 257.5 257.5

1000 1000 1000 1000

Pharmacogenomics (2003) 4(4)

Page 13: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Figure 8. A mixture

invariant set only and then the actual shift orshifts are performed on all the probes.

Quantile normalizationThis normalization method attacks the problemfrom a different angle. Each array contains a cer-tain distribution of expression values and thismethod aims at making the distributions acrossvarious arrays not just similar, but identical. Thisis done as follows: imagine that the expressionvalues from the various arrays have been loadedinto a spreadsheet with genes along rows andarrays along columns. First, each column issorted in increasing order. Next, the value ineach row is replaced with the average of the val-ues in this row. Finally, the columns are unsorted(i.e., the effect of the sorting step is reversed soitems in a column go back to wherever theycame from). It is easy to see that the distributionsin all arrays become identical in this process.This method is used in the RMA software whichis part of the Bioconductor suite [112].

Statistically, this method seems to obtain thesharpest normalizations; however, occasionaldangerous side effects could result, for example,consider the following artificial situation whereone aberrant value of 1000 for gene 1 can createincorrect values for gene 3 as a result of quantilenormalization. While such side effects are veryrare, these could go undetected if they do indeedoccur. To guard against these, it is best to doquantile normalization before probe aggregation,so that the process of aggregating probes removesany noise created by such side effects.

Comparative analysis of the above methodsNote that the interpolation methods are pairwisemethods (i.e., each array is normalized against achosen baseline array) while the mean shift andquantile methods do not require a baseline array.Furthermore, the mean shift, quantile and linearinterpolation methods are much faster than thenon-linear interpolation method. Comparativeanalysis of these methods and some further

of two distinct profile types.

13

Page 14: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

14

Figure 9. Aberrant p

derivatives of the Lowess method designed toeliminate the need for a baseline array is providedin Bolstad et al. [18]. The key conclusion is thatthe non-linear method and its derivatives whichdo not require a baseline and the quantilemethod typically perform better than the plainunnormalized data or normalizing through meanshifting.

Normalization on two-dye cDNA arraysFor two-dye cDNA arrays, there are usually non-linear relationships between red and the greenchannel intensities and these need to be normal-ized using a Lowess or related method withineach array so that the log R-log G = log(R/G) val-ues are clustered around the horizontal 0 line (inthe MVA plot) on each of the arrays (see e.g.,Dudoit et al. [8]). Further normalization acrossarrays may not be required. However, as men-tioned in Dudoit et al. [8], there could be strongspatial or print-tip effects within an array, and

thus it may be useful to perform Lowess normal-ization within each print-tip group. Yang et al.tackle these issues in further detail [19].

Finally, it is worth mentioning that normaliza-tion is a challenging area and all sources of varia-tion are not yet understood well enough foraccurate statistical modeling.

Probe aggregationAs mentioned above, using several probes pergene and averaging over these probe-sets canindeed reduce noise substantially. Thus, onceintensity values have been background corrected,transformed to a logarithmic scale, and normal-ized, the next task is to average individual probeprofiles for a gene into a gene profile. This aver-aging needs to be done robustly to minimizeoutlier effects as necessitated by the example inFigure 8. This figure shows profiles for probeset1032_at for the eight arrays being consideredfrom the Latin Square Dataset. The first five

robes appear flat: Probeset 36889_at from the Latin Square Dataset.

Pharmacogenomics (2003) 4(4)

Page 15: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Table 5. The rough mDataset for various

No. of replicates

1 Replicate

2 Replicates

3 Replicates

4 Replicates

To catch all 14 spike-in gdrops to about 50 for a 11–12 of the 14 genes

probes behave very differently from the remain-ing probes. In this case, it is not immediatelyclear that the first five probes can be removed asoutliers, as they comprise close to a third of allprobes (these are actually heavily overlappingprobes, with low complexity repeats, a fact whichis not used by any outlier removal algorithms).Nevertheless, the need for outlier removal isindeed clear. There are broadly two categories ofprobe averaging methods, those which considerone array at a time, and those which considermultiple arrays together.

Single array methodsThe simplest method which works with a singlearray at a time takes the mean or median probevalue on that array and considers only thoseprobes whose values are within a certain numberof standard deviations from this value. ALG1and ALG2 mentioned in Tables 1 & 2 use thisapproach. The MAS5 algorithm uses a relatedmethod called one-step Tukey Biweight [110].This method involves finding the median andweighting the items based on their distance fromthe median so items further away from themedian are downweighted. This could actuallybe run for multiple steps with the weights com-puted in each step used to compute the newweighted estimate and then reweighting theitems until there is no further change. TheAffymetrix MAS5.0 algorithm uses only one stepof this procedure.

Multiple array methodsThe key advantage of working with all arraystogether is the following. Occasionally, there areprobes on Affymetrix arrays which behave verydifferently from the remaining probes in theirrespective probe-sets (see Figure 9). These maynot always be identifiable when only one array isconsidered at a time but clearly stand out whenall arrays are considered together. Furthermore,

applying a robust algorithm on one array at atime could cause the removal of a probe whichshows a consistent and expected profile acrossarrays but exhibits rather high or low expressionvalues. Therefore, using multiple arrays togethercould lead to greater robustness. The two nota-ble methods which work with all arrays togetherare due to Irizarry et al. [11] and Li and Wong[20,21], respectively.

Irizarry et al. [11] model observed probe behav-ior on the log scale as the sum of a probe specificterm, the actual expression value on the log scale,and an independent identically distributed noiseterm; they then estimate the actual expressionvalue from this model using a robust procedurecalled Median Polish. This is a very elegantmethod used in the RMA package and deserves abrief description on account of its simplicity. Itcomprises the following steps. Consider probesalong the rows of a spreadsheet and arrays alongthe columns.

Median polish steps• Compute the median of each row and record

the value to the side of the row. Subtract therow median from each point in that particu-lar row.

• Compute the median of the row medians,and record the value as the overall effect.Subtract this overall effect from each of therow medians.

• Take the median of each column and recordthe value beneath the column. Subtract thecolumn median from each point in that par-ticular column.

• Compute the median of the column medians,and add the value to the current overall effect.Subtract this addition to the overall effectfrom each of the column medians.

• Repeat steps 1–4 until no changes occur withthe row or column medians.

aximum rank amongst the 14 spike-in genes computed for the Affymetrix Latin Square numbers of replicates.

Rough maximum rank

9000

2000

150

50

enes, the number of false positives is roughly 9000 for the fold-change method with one replicate, but t-test with four replicates. A big drop happens when going from two to three replicates. However, note that can be caught with far fewer false positives.

15

Page 16: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

16

The final vector of column medians serves as theaggregate profile for the gene in question.

Li and Wong [20,21] use a slightly differentmodel; they model observed probe behavior onthe linear scale as a product of a probe affinityterm and an actual expression term along with anadditive normally distributed independent errorterm. The maximum likelihood estimate of theactual expression level is then determined usingan estimation procedure which has rules for out-lier removal. The outlier removal happens atmultiple levels. At the first level, outlier arrays aredetermined and removed. At the second level, aprobe is removed from all the arrays. At the thirdlevel, the expression value for a particular probeon a particular array is rejected. These three levelsare performed in various iterative cycles untilconvergence is achieved. This method is incorpo-rated in the DChip package [111].

When comparing probe aggregation betweenthe PM based methods in Tables 1 & 2, medianpolishing of RMA seems to do very well in theupper reaches but the single array methods ofALG1 and ALG2 do much better at lower levelsof expression and lower signal to noise ratio.

Statistical hypothesis testingOnce the data at hand have been backgroundcorrected, converted to a logarithmic scale, nor-malized, and probe aggregated, logarithms of theexpression levels of each gene (for single dyearrays) or the log-ratios for each gene (for twodye arrays) will be at hand. In what follows, theselog values or ratios will be referred to simply asexpression values. The next step is to perform sta-tistical hypothesis testing on these values todetermine which gene(s) show significant differ-ential expression across two or more groups ofreplicates (for the case of two-dye arrays, theanalogous goal is to determine genes which aredifferentially expressed between the samples inthe two channels).

To explain the issues in statistical hypothesistesting, consider the simple case of two groups ofexperiments, typically a control group and atreatment group, each group having several repli-cates. Issues in dealing with multiple groups willbe touched upon later.

Why is fold-change not a good measure?The fold-change measure computes the differ-ence between the group means for each gene(recall that we are working on a logarithmic scaleand fold-changes translate to differences on thisscale). A cutoff on this quantity is then used to

determine genes which are differentiallyexpressed. However, as explained in Tusher et al.[22], this gives a very large number of false posi-tives. This stems from the fact that most genes areexpressed at low levels where the signal-to-noiseratio is low and, therefore, fold changes occur atrandom for a large number of genes. Further-more, at high expression levels, small but consist-ent changes in expression across arrays are notdetected by fold-change. There are better alterna-tives to a fold-change test as described below.

The t-testThe standard test that is performed in such situ-ations the so-called t-test, which measures thefollowing t-statistic for each gene g (see [23] forexample):

Here, m1,m2 are the mean expression values forgene g within groups 1 and 2, respectively, s1,s2are the corresponding standard deviations, andn1,n2 are the number of arrays in the two groups.Qualitatively, this t-statistic has a high absolutevalue for a gene if the means within the two setsof replicates are very different and if each set ofreplicates has small standard deviation.

Thus, the higher the t-statistic is in absolutevalue, the greater the confidence with which thisgene can be declared as being differentiallyexpressed. Note that this is a more sophisticatedmeasure than the commonly used fold-changemeasure (which would just be m1-m2 on the log-scale) in that it looks for a large fold-change inconjunction with small variances in each group,The power of this statistic in differentiatingbetween true differential expression and differen-tial expression due to random effects increases asthe numbers n1 and n2 increase. To identify alldifferentially expressed genes, one could just sortthe genes by their respective t-statistics and thenapply a cutoff. However, determining that cutoffvalue would be easier if the t-statistic could beconverted to a more intuitive p-value, which givesthe probability that the gene g appears as differen-tially expressed purely by chance. So a p-value of0.01 would mean that there is a 1% chance thatthe gene is not really differentially expressed butrandom effects have conspired to make it look so.Clearly, the actual p-value for a particular genewill depend on how expression values within eachset of replicates are distributed. These distribu-tions may not always be known.

tgm1 m2–

s1n1-----

s2n2-----+

----------------------=22

Pharmacogenomics (2003) 4(4)

Page 17: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Obtaining p-valuesUnder the assumption that the expression valuesfor a gene within each group are normally dis-tributed and that the variances of the normal dis-tributions associated with the two groups are thesame (recall earlier discussion on variance stabili-zation), the above computed t-statistic for eachgene follows a so-called t-distribution, fromwhich p-values can be calculated. However, asstated in Hoffmann et al. [16] and Long et al. [24],these assumptions may not always be met (eventhough the t-test is known to be reasonablyrobust to departures from normality). To getaround the problem of normality (i.e., to obtaina p-value in the absence of knowledge of distri-bution of expression levels within each set of rep-licates), a permutation testing method issometimes used as described below.

p-values via permutation testsAs described in Dudoit et al. [8], this methoddoes not assume that the t-statistics computedfollows the t-distribution (which it would butonly under the assumptions above), rather itattempts to actually estimate this distribution.Implementation-wise, this is a simple method asdescribed below.

Imagine a spreadsheet with genes along therows and arrays along columns, with the first n1columns belonging to the first group of repli-cates and the remaining n2 columns belonging tothe second group of replicates. The left to rightorder of the columns is now shuffled severaltimes. In each trial, the first n1 columns aretreated as if they comprise the first group and theremaining n2 columns are treated as if they com-prise the second group; the t-statistic is nowcomputed for each gene with this new grouping.This procedure is ideally repeated

times, once for each way of grouping the col-umns into two groups of size n1 and n2, respec-tively. However, if this is too expensivecomputationally, a large enough number of ran-dom permutations are generated instead. p-val-ues for genes are now computed as follows.

Recall that each gene has an actual t-statistic ascomputed a little earlier and several permutationt-statistics computed above. For a particular gene,its p-value is the fraction of permutations inwhich the t-statistic computed is larger in abso-lute value than the actual t-statistic for that gene.

Adjusting for multiple comparisons and controlling false discoveriesMicroarrays usually have genes running into sev-eral thousands and tens of thousands. This leadsto the following problem. Suppose p-values foreach gene have been computed as above and allgenes with a p-value of < 0.01 are considered.Let k be the number of such genes. Each of thesegenes has a less than 1 in 100 chance of appear-ing to be differentially expressed by randomchance. However, the chance that at least one ofthese k genes appears differentially expressed bychance is much higher than 1 in 100 (as an anal-ogy, consider fair coin tosses, each toss producesheads with a 1/2 chance but the chance of get-ting at least one heads in a hundred tosses ismuch higher). In fact, this probability could beas high as k * 0.01 (or in fact 1- (1 - 0.01)k if thep-values for these genes are assumed to be inde-pendently distributed). Thus, a p-value of 0.01for k genes does not translate to a 99 in 100chance of all these genes being truly differentiallyexpressed; in fact, assuming so could lead to alarge number of false positives. To be able toapply a p-value cutoff of 0.01 and claim that allthe genes which pass this cutoff are indeed trulydifferentially expressed with a 0.99 probability,an adjustment needs to be made to these p-val-ues. However, the actual nature of the adjust-ment has to take dependencies between thevarious genes into account for it to be effective.

See Dudoit et al. [8] and the book by Glantz[23] for detailed descriptions of various algo-rithms for adjusting the p-values. The simplestmethods are the Bonferroni method and theSidak method, which are motivated by the dis-cussion in the previous paragraph. In the former,any dependencies in gene behavior are com-pletely ignored and the p-value of each gene ismultiplied by n, the total number of genes. Inthe latter, the p-value of each gene is replaced by1 - (1 - p)n, where p is the original p-value for thisgene; this method is applicable only if the p-val-ues of the various genes are completely inde-pendent of each other. A slightly moresophisticated method is the Holm step-downmethod in which genes are sorted in increasingorder of p-value and the p-value of the jth genein this order is multiplied by n-j+1 to get thenew adjusted p-value (so the multiplier for thegene with smallest p-value is n and for the genewith largest p-value is 1); this method tooignores dependencies between genes. In typicaluse, the above methods of p-value adjustmentoften turn out to be too conservative (i.e., the

n1 n2+

n1

17

Page 18: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

18

p-values end up too high even for truly differen-tially expressed genes). Furthermore, methodsassuming independence do not apply to situa-tions where gene behavior is highly correlated, asis indeed the case in practice. Dudoit et al. [8]

recommend the Westfall and Young procedure asa less conservative procedure which handlesdependencies between genes.

The Westfall and Young [25] procedure is a per-mutation procedure in which genes are firstsorted by increasing t-statistic obtained on unper-muted data. Then, for each permutation, the t-statistics obtained for the various genes in thispermutation are artificially adjusted so that thefollowing property holds: if gene i has a higheroriginal t-statistic than gene j, then gene i has ahigher adjusted t-statistic for this permutationthan gene j. The overall corrected p-value for agene is now defined as the fraction of permuta-tions in which the adjusted t-statistic for that per-mutation exceeds the t-statistic computed on theunpermuted data. Finally, an artificial adjustmentis performed on the p-values so a gene with ahigher unpermuted t-statistic has a lower p-valuethan a gene with a lower unpermuted t-statistic;this adjustment simply increases the p-value ofthe latter gene, if necessary, to make it equal tothe former. Though not explicitly stated, a similaradjustment is usually performed with all otheralgorithms described here as well.

All the above procedures aim at bounding theprobability that even one of the genes declared assignificant is not actually differentially expressed(this is called the Family-wise Error Rate). Ben-jamini and Hochberg [26] argue that requiringcontrol of this error rate may be too conservativeand suggest using an alternative measure calledthe False Discovery Rate, which seeks to boundthe fraction of genes amongst those declared assignificant which are not actually differentiallyexpressed. This method assumes independenceof p-values across genes; it orders genes inincreasing order of p-value and multiplies thep-value of the jth gene in the above order by n/j.Finally, if genes above a p-value of p are consid-ered significant, then the expected fraction offalse discoveries in these genes is p.

Dow [27] has studied the effect of variousp-value adjustment techniques and though theresults are not conclusive, the Holm step-downmethod applied in reverse order (which is calledReverse Holm in [27]) is recommended as anappropriate method to balance control of falsepositives and false negatives.

Tusher et al. [22] use a variant of the abovedescribed techniques to determine differentiallyexpressed genes and estimate the number of falsediscoveries (this is part of the significance analy-sis of microarrays [SAM] package [115]). Theyperform permutation tests as described aboveand compute, for each gene, the differencebetween the actual t-statistic and the averaget-statistic over all permutations. Genes beyond acertain threshold of this difference are consid-ered as being differentially expressed. Next, thenumber of false discoveries is estimated as fol-lows. The smallest t-statistic in absolute valueamongst these genes is noted down; call this δ.Then, for each permutation, the number ofgenes achieving a t-statistic greater than δ in thispermutation is counted. The average of thiscount over all permutations yields the number offalse positives. There are three details which havebeen suppressed in the above description butwhich should be noted. First, a slight variant ofthe t-statistic is used where an additive term isapplied in the denominator to dampen varia-tions at low expression levels. Second, the per-mutations generated need to be balancedpermutations (i.e., each permutation should mixarrays from the two groups in equal numbers).Finally, the δ value and the false positive countsare actually computed separately for induced andrepressed genes to allow for some asymmetry inthese situations.

Number of replicates and the number of false positivesThe one factor which is key in the success of allthe statistical methods mentioned above is thenumber of replicates. Dow [27] concludes that atsample sizes lower than 10, the minimum detect-able fold-change was higher than the changesinduced by the treatment involved in theirexperiments. Of course, this number will varydepending upon the nature of the experiment.However, increasing the number of replicatescan dramatically bring down the number of falsepositives, as shown by the analysis on the LatinSquare Dataset in Table 5.

The question of how many replicates areneeded has barely been explored in the literature.The answer depends on several parameters,including the type of statistical test performed,the difference in expression levels to be detected,the number of false positives desired, the cutoffprobability, and potentially other unknownparameters associated with the test. For example,

Pharmacogenomics (2003) 4(4)

Page 19: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

www.pharmaco-genomics.com

TECHNOLOGY REPORT

Highlights

• Careful attention to anWrong choices can inc

• While oligonucleotideoligonucleotide behavof noise. Using multipaveraging over these

• Variations in the imagalgorithm can often csubstantially; thus, acwith circles and robus

• Normalization across linear approaches arein several cases.

• On Affymetrix arrays,to perform better thafrom the perfect matc

• Having sufficient replifalse positives, thoughrequired for various a

• Fold-change is not a gsophisticated statisticatypically require a fair

• Further comparative aof common benchmabenchmark data and help in defining and s

to compute the number of replicates needed tohave ten false positives at a p-value cutoff of 0.01using a regular t-test, even assuming that thegenes behave independently, one will need toknow the variances in the expression level of eachgene over experiments. For specific tests, therehas been some research on identifying theunknown parameters associated with the test andthen using those to determine the number ofreplicates needed, for example, see Pan et al. [28],

Other tests and experiment typesThe t-test in its original form is a parametric test(i.e., it relies on the normality assumption). TheMann-Whitney rank-sum test [23] is a non-para-metric test (i.e., it does not rely on the normalityassumption) applicable to the two group situa-tion. However, a larger number of replicates istypically desirable [16]. See Pan [29] for a reviewand comparative analysis of some more methods,including a regression modeling approach, amixture modeling approach, and the SAMapproach described above.

More complicated experimental designs needmore sophisticated tests. For more than two

groups, the parametric ANOVA test or the non-parametric Kruskal-Wallis tests are applicable.For the case when the groups involve multipletreatments on the same individuals, the pairedt-tests and the Wilcoxon Sign-Rank test are usedfor two groups and the Repeated Measures testfor multiple groups. See [23] for details on thesetests. Finally, a special case of only one group ofreplicates arises for two-dye arrays. This is like apaired t-test where the t-statistic that is com-puted is simply as below:

Here m1 is the mean value within the group and s1is the corresponding standard deviation. Much ofthe above discussion on t-tests is applicable here aswell, although the use of permutation tests asdescribed above is not obviously applicable.

RemarksStatistical testing for microarrays is a riperesearch area with algorithms getting increas-ingly sophisticated; however, there do not seemto be any unique winners at this point. Theabsence of common benchmarks and standardsmakes it harder to compare these algorithmsagainst each other.

OutlookThe field of microarray analysis has movedbeyond the initial simple approaches and isbecoming increasingly sophisticated as morepowerful algorithms are being used to increasesensitivity and specificity. Unfortunately, severalof these algorithms require a fair amount of sta-tistical expertise and are therefore inaccessible toa typical user. This will change with time as theseprocedures get more accurate and are standard-ized and incorporated in public and commercialtools, putting more of these techniques in thehands of the person running the experiment.However, there is a pressing need for setting upbenchmarks on all standard microarray plat-forms and keeping these benchmarks current astime passes. Once standardization happens, theemphasis will shift from algorithms for deter-mining differentially expressed genes accuratelyto algorithms for recreating the biological proc-esses at a systemic level from this information.

alysis methods is needed to analyze microarray data. rease the number of false positives by several-fold. arrays show less variability than cDNA arrays, ior at low expression levels is often choppy, with lot le probes (as in the case of Affymetrix arrays) and probes does seem to attenuate this noise.e analysis segmentation and background correction hange the background corrected value of a spot curate spot segmentation as opposed to fitting spots t background correction are important.arrays and dyes is absolutely vital. Simple scaling and not sufficient and non-linear approaches are needed

analysis methods based on the PM intensities seem n those based on subtracting the mismatch intensity h intensity.cation in arrays is vital in reducing the number of methods for predicting the number of replicates

rray platforms have barely been studied.ood measure of differential expression. More l tests can give far more accurate results but these

amount of familiarity with statistical analysis.nalysis of various algorithms is hampered by the lack rks on all standard microarray platforms; establishing spike-in like data sets for all standard platforms can tandardizing analysis protocols.

tgm1s1n1-----

----------=2

19

Page 20: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

BibliographyPapers of special note have been highlighted as either of interest (•) or of considerable interest (••) to readers.1. Bassett DE Jr, Eisen NM, Boguski MS:

Gene expression informatics – it’s all in your mine. Nat. Genet. 21, 51-55 (1999).

2. Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Nucleic Acids Res. 28(22), 4552-4557 (2000).

3. Santalucia J Jr: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA 95, 1460-1465 (1998).

4. Breslauer KJ, Frank R, Blocker H, Marky LA: Predicting DNA duplex stability from the base sequence. Proc. Natl. Acad. Sci. USA 83, 3746-3750 (1986).

5. Owczarzy R, Vallone PM, Gallo FJ, Paner TM, Lane MJ, Benight AS: Predicting sequence-dependent melting stability of short duplex DNA oligomers. Biopolymers 44, 217-239 (1997).

6. Yang YH, Buckley MJ, Dudoit S, Speed TP: Comparison of methods for image analysis on cDNA microarray data. J. Comput. Graphical Stat. 11, 108-136 (2002).

•• Comprehensive survey of image analysis issues for spotted arrays.

7. Adams R, Bischof L: Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 641-647 (1994).

8. Dudoit S, Yang H, Callow MJ, Speed TP: Statistical Methods for identifying genes with differential expression in replicated cDNA experiments. Stat. Sin. 12(1), 11-139 (2000).

9. Soille P: Morphological image analysis: principles and applications. Springer (1999).

10. Hartemink A, Gifford D, Jaakkola T, Young R: Maximum likelihood estimation of optimal scaling factors for expression array normalization In: SPIE BIOS (2001).

11. Irizarry, RA, Hobbs B, Collin F et al.: Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2), 249-264 (2003).

•• A good description of methods for analysing Affymetrix data.

12. Durbin BP, Hardin JS, Hawkins DM: A variance-stabilizing transformation for gene expression microarray data. Bioinformatics 18, 105-110 (2002).

13. Huber W, von Heydebreck A, Sültmann H, Poustka A: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl. 1), 96-104 (2002).

14. Munson P: A consistency test for determining the significance of gene

expression changes on replicate samples and two convenient variance stabilizing transforms. GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data.

15. Schadt E, Li C, Eliss B, Wong WH: Analysing high-density oligonucleotide gene expression array data. J. Cell Biochem. 84(37), 120-125 (2000).

16. Hoffmann R, Seidl T, Dugas M: Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biol. 3(7) 0033.1-0033.11 (2002).

17. Hill AA, Brown EL, Whitley MZ et al.: Evaluation of normalization procedures for Oligonucleotide array data based on spiked cRNA controls. Genome Biol. 2 0055.1-0055.13 (2001).

18. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 2, 185-193 (2003).

•• Comparison of various normalization methods.

19. Yang YH, Dudoit S, Luu P et al.: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, 4 (2002).

20. Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA 98, 31-36 (2000).

•• Description of the popular Li-Wong method for analysing Affymetrix arrays.

21. Li C, Wong WH: Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2(8) 0032.1-0032.11 (2001).

22. Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116-5121 (2001).

23. Glantz S: Primer of Biostatistics (5th edition). McGraw-Hill (2002).

24. Long AD, Mangalam HJ, Chan BY, Tolleri L, Hatfield GW, Baldi P: Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12. J. Biol. Chem. 276(23), 19937-19944 (2001).

25. Westfall PH, Young SS: Resampling based multiple testing. John Wiley & Sons, New York (1993).

26. Benjamini B, Hochberg Y: Controlling the false discovery rate: a practical and powerful

approach to multiple testing. J. R. Statist. Soc. B. 57, 289-300 (1995).

27. Dow GS: Effect of sample size and p-value filtering techniques on the detection of transcriptional changes induced in rat neuroblastoma (NG108) cells by mefloquine. Malar. J. 2(1), 4 (2003).

28. Pan W, Lin J, Le CT: How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol. 3(5) 0022.1-0022.10 (2002).

29. Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 12, 546-554 (2002).

Websites101. http://www.affymetrix.com/analysis/

download_center2.affx Affymetrix Latin Square Data.

102. http://qolotus02.genelogic.com Gene Logic Latin Square Data.

103. http://www.mwg-biotech.com/html/d_support/d_faq.shtmlMWG Biotech.

104. http://www.strandgenomics.com/products/sarani/slast.htmlStrand Genomics Sarani.

105. http://www.strandgenomics.com/products/sarani/overview.htmlStrand Genomics Sarani.

106. http://www.operon.com/arrays/poster.phpBosch JT, Seidel S, Batra, Lam H, Tuason N, Saljoughi S, Saul R: Validation of sequence-optimized 70 base oligonucleotides for use on DNA microarrays [Poster].

107. http://experimental.act.cmis.csiro.au/Spot/index.phpSPOT.

108. http://www.strandgenomics.com/products/chitraka/overview.htmlStrand Genomics Chitraka.

109. http://ftp.isds.duke.edu/WorkingPapers/02-05.htmlZuzan H, Blanchette C, Dressman H et al.: Estimation of probe cell locations in high-density synthetic-oligonucleotide DNA microarrays [Working Paper]. Institute of Statistics and Decision Sciences.

110. http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdfStatistical Algorithms Description Document, Affymetrix, Inc.

111. http://www.biostat.harvard.edu/complab/dchipDChip: The DNA Chip Analyzer.

20 Pharmacogenomics (2003) 4(4)

Page 21: The analysis of microarray data - Ramesh Hariharan · There are of course challenges in using micro-arrays to study biological systems. The chain of events between biological sample

TECHNOLOGY REPORT

112. http://www.bioconductor.orgThe Bioconductor webpage.

113. http://stat-www.berkeley.edu/users/terry/ zarray/html/log.htmlSpeed T: Always log spot intensities and ratios, Speed Group Microarray Page.

114. http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htmThe Lowess method.

115. http://www-stat.stanford.edu/~tibs/SAM/Significance Analysis of Microarrays.

116. http://www.strandgenomics.com/products/soochika/overview.htmlStrand Genomics Soochika.

www.pharmaco-genomics.com 21