Bioinformatics Statistics 6570 Statistical Bioinformatics ...jrstevens/stat5570/Binod.pdf · Binod Manandhar Utah State University. Preprocessing Two-color Spotted Arrays Preprocessing

Bioinformatics

Statistics 6570Statistical Bioinformatics

Spring 2011April 27, 2011

Project work

Binod ManandharUtah State University

Preprocessing Two-color Spotted ArraysPreprocessing of two-color spotted arrays can be broadly divided in tow mainly categories :• Quality assessment

• Normalization

Exploratory tools to access data quality of plots:

• MA plots, Spatial plots, Boxplots

Remove sources of systematic variations:

• Normalization

Dataset:The dataset used here is a subset of a larger dataset described in Rodriguez et al (2004),“Differential gene expression by integrin β7+ and β7− memory T helper cells”.

The actual data is available from the Gene Expression Omnibus (GEO) maintained by the NCBI, with accession number GSE1039.

References

“Bioinformatics and Computational Biology Solutions Using R and Bioconductor”-Robert Gentleman et alChapter 4 - “Preprocessing Two-Color Spotted Arrays”,

by Y.H. Yang and A.C. Paquet.

Preprocessing / Two color array

Preprocessing: The term preprocessing is often used to refer to the quality assessment and normalization of the microarray data.

Two color spotted microarray:The relative abundance of complementary target molecules in a pair of samples can be assessed by monitoring the differential hybridization to the array. For mRNA samples, the two samples or target molecules are reverse transcribed into cDNA, labeled using different fluorescent dyes (usually a red-fluorescent dye, Cyanine 5 or Cy5 and a green-fluorescent dye, Cyanine 3 or Cy3) then mixed in equal proportion and hybridized to the arrayed DNA probes. After this competitive hybridization the slides are imaged using a scanner, and fluorescence measurement are made separately for each dye at each spot on the array.

Load library and data

Microarrays:In total 6 microarrays are used in our analysis. The type of microarray used for this experiment is called a cDNA microarray. The construction of this type of array is done by printing cDNA on spots on the array.

Access Data: Package beta7 object contains all the information of the GenePix files of the integrin data.

library(limma) #regression packagelibrary(arrayQuality)

library(beta7)

rawData <- beta7 #23184 * 4TargetInfo <- read.marrayInfo("C:/Documents and Settings/Binod/My Documents/Bioinformatics STAT 6570/Project/integrinbeta7/TargetBeta7.txt")TargetInfo

Getting the Sample Info

FileNames Subject ID # Cy3 Cy5 Hyb buffer Hyb Temp (deg C)6Hs.195.1.gpr 1 b7 - b7 + Ambion Hyb Slide 556Hs.168.gpr 3 b7 + b7 - Ambion Hyb Slide 556Hs.166.gpr 4 b7 + b7 - Ambion Hyb Slide 556Hs.187.1.gpr 6 b7 - b7 + Ambion Hyb Slide 556Hs.194.gpr 8 b7 - b7 + Ambion Hyb Slide 556Hs.243.1.gpr 11 b7 + b7 - Ambion Hyb Slide 55

FileNames Hyb Time (h) Date of Blood Draw Amplificatio n Slide Type Date of Scan6Hs.195.1.gpr 40 2002.10.11 R2 aRNA Aminosilane 2003.07.256Hs.168.gpr 40 2003.01.16 R2 aRNA Aminosilane 2003.08.076Hs.166.gpr 40 2003.01.16 R2 aRNA Aminosilane 2003.08.076Hs.187.1.gpr 40 2002.09.16 R2 aRNA Aminosilane 2003.07.186Hs.194.gpr 40 2002.09.18 R2 aRNA Aminosilane 2003.07.256Hs.243.1.gpr 40 2003.01.13 R2 aRNA Aminosilane 2003.08.06

TargetInfo <- read.marrayInfo("C:/Documents and Settings/Binod/My Documents/Bioinformatics STAT 6570/Project/integrinbeta7/TargetBeta7.txt")TargetInfo

Summary – marryLayout Class

mraw <- read.GenePix(path = "C:/Documents and Settings/Binod/My Documents/Bioinformatics STAT 6570/Project/integrinbeta7", targets = TargetInfo);summary(mraw)

Pre-normalization intensity data: Object of class marrayRaw.

Number of arrays: 6 arrays.

A) Layout of spots on the array: Array layout: Object of class marrayLayout.

Total number of spots: 23184Dimensions of grid matrix: 12 rows by 4 colsDimensions of spot matrices: 23 rows by 21 cols

Currently working with a subset of 23184 spots.

Control spots:

There are 5 types of controls : Buffer Empty Negative Positive probes

3 1328 225 204 21424

OR “R” code:class(rawData) #gives class "marrayRaw"rawData@maLayout@maNspots # gives no. of spots 23184levels(rawData@maLayout@maControls)

[1] "Buffer" "Empty" "Negative" "Positive" "probes"

Annotation

Gene names> mraw@maGnames[1:2,]An object of class "marrayInfo"@maLabels[1] "H200000297" "H200000303"

@maInfoID

H200000297 H200000297H200000303 H200000303

NameH200000297 OVGP1 - Oviductal glycoprotein 1, 120kD (mucin 9, oviductin)H200000303 TAF1 - TAF1 RNA polymerase II, TATA box binding protein (TBP)-associated factor, 250 kD

marrayRaw Class

> slotNames(mraw)

[1] "maRf" "maGf" "maRb" "maGb"

[5] "maW" "maLayout" "maGnames" "maTargets"[9] "maNotes"

Of these, the first 5 are the basic quantification information, extracted from the GPR files. All of them are 23184 by 6 matrix. The others are the associated layout and annotation files. We will extract these to find out a bit more about them.

Array Summaries

FileNames Min. 1st Qu. Median Mean 3rd. Qu. Max. NA's6Hs.195.1.gpr -6.13 -1.00 -0.52 -0.50 -0.08 5.95 34156Hs.168.gpr -7.08 -0.80 -0.21 -0.23 0.34 5.19 28396Hs.166.gpr -7.07 -1.25 -0.64 -0.62 -0.02 6.15 34406Hs.187.1.gpr -9.81 -0.92 -0.60 -0.55 -0.25 5.00 29426Hs.194.gpr -5.93 0.00 0.44 0.53 0.90 7.74 60906Hs.243.1.gpr -6.38 -1.13 -0.69 -0.64 -0.21 7.05 2227

> summary(mraw)

Summary statistics for log-ratio Cy5/Cy3 distribution

Diagnostic plots of spot statistics

Quality assessmentBefore analyzing the data, it is important to check whether the experimental data is of acceptable quality or if there is a need to repeat some hybridizations.

Functions in R have been developed to asses the quality of individual microarrays stored in marrayRaw objects.

The purpose of this part is to show some graphical methods in context of two-color arrays.• Boxplot

• maPlot

• color images

1. Spatial inspection

The function image can be used to spatially visuali ze the arrays and thus may reveal artifacts on the array.

# visualize the arrays imagehead(rawData@maGf)# Green foreground variableimage(rawData[, 3], xvar = "maGf") # Green foregroun d image array 3image(rawData[, 3], xvar = "maRf") # Red foreground image array 3

Spatial inspection (contd…)Low quality spots

# Weight, highlight low quality spotsflags <- rawData@maW[, 3]< -50 #define if weight les s than (-50) for array 3#Image overlay if weight less than (-50) for array 3image(rawData[, 3], xvar = "maGf", overlay = flags)

Spatial inspection (contd…)Spots with empty

#Spots with emptyflags <- rawData@maLayout@maControls == "Empty"image(rawData[, 3], xvar = "maGf", overlay = flags)

Note the large overlap with the quality flags you d efined in the previous section.

2. maBox plot inspectiondistribution of intensities between arrays

Boxplots summarize data with five values: smallest o bservation, lower quartile (Q1), median,upper quartile (Q3), and largest observation (and o utliers).

#Box Plot# distribution of intensities between arrays# las=2 -->lables verticle, # log = "y" --> Both X, Y-axis in log scale.boxplot(rawData, yvar = "maGf", las = 2,

log = "y", main = "Boxplot: Intensity between arrays")

Box plot inspectiondistribution of intensi ties between plates

There are 61 plates. To view the differences betwee n plates type:# difference between plate typepar(mar = c(5, 3, 3, 3), cex.axis = 0.4)boxplot(rawData[, 3], xvar = "maPlate", yvar = "maGf" , outline = FALSE,las = 2, log = "y")

As can be seen in the plots, the last plates have a much lower intensity distribution in this microarray. This is due to the overrepresentation of negative control spots on these plates.

3. MA-plot

MA-plotsThe Cy5 signal on the x-axis and the Cy3 signal on the y-axis : If a spot has an equal intensity in re d and green, the point representing that spot would l ie along the line x = y.

#MA Plot: Green foreground Vs Red foregroundplot(x = rawData@maRf[, 3], y = rawData@maGf[, 3], pch = ".",

main="Array 3: Cy5 Vs Cy3") # Array 3plot(x = rawData@maRf, y = rawData@maGf, pch = ".",

main="All Array : Cy5 Vs Cy3") # All Array

The distribution of this ratio:To see if a gene (that is, spot) has a higher expre ssion in the green or red labeled sample we can loo k at ratios. From this histogram one can see that taking the log arithm would make the distribution moresymmetrical:

#Histo: Distribution of ratio of Cy5 and Cy3hist(rawData@maRf[, 3]/rawData@maGf[, 3],

main="Distribution of the ratio of Cy5 & Cy3", xlab = "Cy5/Cy3", ylab="Frequency")

#Histogram with log2 scalehist(log2(rawData@maRf[, 3]/rawData@maGf[, 3]),

main="Distribution of the M values [ratio of log2 o f Cy5 & Cy3", xlab = "log2[Cy5/Cy3]", ylab="Frequency")

The MA-plot, 45 degrees of the RG -plot

The log of a ratio is often referred to as the M-va lue. The letter M is used because a log of R/G is e qual to log(R) Minus log(G).

Usually R Vs G, log 2R Vs. log 2G.

Preferred plot is M Vs. AAn MA-plot amount to a 45 0 clockwise rotation of a log 2R Vs. log 2G followed by scaling.

#MA PlotM = log 2R - log 2GA = (log 2R + log 2G) / 2

plot(y = log2(rawData@maRf[, 3]/rawData@maGf[, 3]),x = (log2(rawData@maRf[,3]) + log2(rawData@maGf[, 3 ]))/2, pch = ".",main = "MA PLot", xlab = "Value of A", ylab = "value of y")

abline(h=0)

A common effect in 2-dye microarrays:The intensity distribution of the dyes within an ar ray are not equal, even when the same sample is use d for red and green. This often results in a shift of means but also in a nonlinear effect, dependent on the actual intensity. To illustrate this we can plot the M-value on the y-axis, and the average log intensity ( A value ) on the x-axis. This plot is actually a rotation o f 45 degrees of the RG-plot and is known as the MA-plot.

Normalization

Normalization is necessary before any analysis which involve within or between slides comparisons of intensities.

The purpose of normalization is to identify and remove systematic technical variation while retaining a biological signals.

Normalization identify and remove the effects of systematic variation in the measured fluorescence intensities other than differential expression , for example

– different labeling efficiencies of the dyes:

The imbalance in the red and green intensities is usually not constant across the spots within and between arrays, and can vary according to overall spot intensity, location, plate origin etc.

These factors should be considered in the normalization.

Normalization

For normalization we usually use a different type o f class called (RGList), which can also store batches of array data and its annotation. Thi s class type is used by the package limma which contains many normalization functions. L uckily conversion methods are available between objects of class marrayRaw and RGL ist. Type the following commands:

Within-array normalization:Within-array normalization adjusts the within-array contrasts M using values of A as well as other factors such as print-tip and spatial posi tion of the plates. It adjusts the center and spread of distribution of the log-ratios (M), w hile taking A into account. Most of these methods are based on loess (LOcal regrESSion). The approach can best be visualized in the M versus A plot:

#Within Array Normalization:library(limma) # it contains normalization functionlibrary(convert) # for convertion between marrayRaw a nd RGListmarrayRG <- as(rawData, "RGList")

M <- log2(marrayRG$R[, 3]/marrayRG$G[, 3])A <- (log2(marrayRG$R[, 3]) + log2(marrayRG$G[, 3])) /2dev.off()plot(A, M, pch = ".", main="MA plot");lines(lowess(A, M), col = "red");abline(h=0)

Normalization

For the integrin dataset, we perform a loess normali sation for each print-tip using the built-in function normalizeWithinArrays. To get a n ew object containing the normalized data:

#Normalize: loess normalisationnormLimma <- normalizeWithinArrays(marrayRG, method = "printtiploess",

bc.method = "none", weight = NULL)

#Plot normalize dataplot(normLimma$A[, 3], normLimma$M[, 3], pch = ".",

main="MA Plot,within-array Normalization")lines(lowess(normLimma$A[, 3], normLimma$M[, 3]), c ol = "red")plotMA3by2(normLimma, zero = T)

Normalization

par(mfrow = c(2,3), ps=10)

plot(normLimma$A[, 1], normLimma$M[, 1], pch = ".", main="array-1")






Between-array normalization

Information from the distributions in all arrays is used to normalize each array.

Scale differences can be caused by small changes in scanner settings.

Between-array normalization addresses this effect by improving the

comparability of the distributions of log intensities or log ratios between arrays.

• One solution: The median absolute deviation of the M-values and A-values is the same across arrays

• Another solution: Quantile normalization

Between-array normalization log intensities and log ratios in the marrayRG, normLimma,

scaleLimma and quantileLimma objects

#Normalization: median absolute deviation of the M-values and A-values

scaleLimma <- normalizeBetweenArrays(normLimma, method = "scale")

#quantile normalization

quantileLimma <- normalizeBetweenArrays(normLimma, method = "q")

#MA plots for Raw, loess, scale and quantile normalization

#MA.RG -> Normalize the expression log-ratios for one or more two-colour

# spotted microarray experiments so that the log-ratios average to zero within

# each array or sub-array.

marrayMA <- MA.RG(marrayRG, "none")

layout(matrix(1:4, 2, 2))

plotMA(marrayMA, array = 6, zero = TRUE, main = "raw")

plotMA(normLimma, array = 6, zero = TRUE, main = "loess")

plotMA(scaleLimma, array = 6, zero = TRUE, main = "scale")

plotMA(quantileLimma, array = 6, zero = TRUE, main = "quantile")

Between-array normalization log intensities and log ratios in the marrayRG, normLimma,scaleLimma and quantileLimma objects

Between-array normalization log intensities and log ratios in the marrayRG, normLimma,scaleLimma and quantileLimma objects

#Box plot for Raw, loess, scale and quantile normalization


M <- log2(marrayRG$R/marrayRG$G)

boxplot(as.data.frame(M), las = 2, main = "Raw")

boxplot(as.data.frame(normLimma$M), las = 2, main = "loess")

boxplot(as.data.frame(scaleLimma$M), las = 2, main = "scale")

boxplot(as.data.frame(quantileLimma$M), las = 2, main = "quantile")

Between-array normalization Density plot is :


plotDensities(marrayRG)

plotDensities(normLimma)

plotDensities(scaleLimma)

plotDensities(quantileLimma)

Which method to choose?

marrayRG: intensity dependent dye effects and shifts in log-ratio distributions between arrays.

normLimma: via loess the mean log ratio is forced to be around zero for all A values. The amount of green and red signal with a certain intensity has become more or less equal within each array (see densityPlot).

scaleLimma: the absolute median deviations to the median M and median A-value are made the same across arrays. This also removes differences in scale and distributions for red and green are similar across arrays.

quantileLimma: the distributions of all channels are made equal. Applying this procedure makes that red has the exact same distribution as green for all arrays.

When using all these kind of statistics to force the data into certain distributions, noise can be inserted and/or true biological signals might be removed. Experimental design and diagnostic plots are of great importance for choosing the right set of normalization procedures.

Thank you