Download pptx - Lab 5 Unsupervised and supervised clustering

Lab 5Unsupervised and supervised

clustering

Feb 22th 2012Daniel FernandezAlejandro Quiroz

Outline

• Unsupervised– Hierarchical clustering– Principal component analysis

• Supervised– LIMMA package

• Linear models for microarray data

Before any high level analysis….

• Download the data set used in lab 4– Go to and download GSE10940

• Load the .CEL files and use the custom CDF file annotation used in lab 4: “drosophila2dmrefseqcdf”

• Perform RMA normalization and obtain in a matrix the expression intensities– Obtain the genes that are up and down expressed with a

fold change of 2.• Store the gene ides in: X.top

The data set• Secretory and transmembrane proteins traverse the

endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations.

• Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes.

• Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking.

• Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.

Supervised MethodLIMMA

• Linear Models for MicroArray data– A package for differential expression analysis from

microarray data. – Makes use of linear models to describe the

expression of each gene. – Uses empirical Bayes and other shrinkage methods to

borrow information across genes making the analyses stable even for experiments with small number of arrays.

• LIMMA uses linear models to analyze microarray data.– The approach requires the definition of 2 matrices

• Design matrix– Provides the representation on how the different factors are

distributed in the data– It is assumed a linear model – Where yj contains the expression for gene j– The estimates of αj are provided by lmFit()

• Contrast matrix– Allows the definition of the comparison between factors of

interest– If the parameters are of interest

» C is the contrast matrix– These parameters are estimated by contrast.fit()

• Given the large number of linear models fits arising from a microarray there is a pressing need to take advantage of the parallel structure whereby the same model is fitted to each gene

• Using a hierarchical framework, a moderate t-statistic is computed– Standard errors are shrunk towards a common

value using a Bayesian model• This borrows information for the inference of individual

genes• The degrees of freedom are increased

– Reflexes the greater reliability to the smoothed standard errors

Unsupervised MethodHierarchical clustering

• Hierarchical clustering– First, need to calculate all the pair wise distances

• D=dist(t(X.top))– Finally, perform the hierarchical clustering

• H1=hclust(D,method=“single”)• H2=hclust(D,method=“complete”)• H3=hclust(D,method=“average”)• plot(Hi)

• Is there something odd from the clustering?

Unsupervised MethodMDS

• Multidimensional scaling (MDS) is a set of related statistical techniques to explore similarities in data*.

• *Wikipedia.

Unsupervised Method Principal component

• In R, the function prcomp performs principal component analysis

• In our context, the idea is to visualize the impact of possible dimension reduction in GENES– Important: Remember that in prcomp, the genes

have to be columns and the samples rows.