Lab 5Unsupervised and supervised
clustering
Feb 22th 2012Daniel FernandezAlejandro Quiroz
Outline
• Unsupervised– Hierarchical clustering– Principal component analysis
• Supervised– LIMMA package
• Linear models for microarray data
Before any high level analysis….
• Download the data set used in lab 4– Go to and download GSE10940
• Load the .CEL files and use the custom CDF file annotation used in lab 4: “drosophila2dmrefseqcdf”
• Perform RMA normalization and obtain in a matrix the expression intensities– Obtain the genes that are up and down expressed with a
fold change of 2.• Store the gene ides in: X.top
The data set• Secretory and transmembrane proteins traverse the
endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations.
• Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes.
• Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking.
• Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.
Supervised MethodLIMMA
• Linear Models for MicroArray data– A package for differential expression analysis from
microarray data. – Makes use of linear models to describe the
expression of each gene. – Uses empirical Bayes and other shrinkage methods to
borrow information across genes making the analyses stable even for experiments with small number of arrays.
• LIMMA uses linear models to analyze microarray data.– The approach requires the definition of 2 matrices
• Design matrix– Provides the representation on how the different factors are
distributed in the data– It is assumed a linear model – Where yj contains the expression for gene j– The estimates of αj are provided by lmFit()
• Contrast matrix– Allows the definition of the comparison between factors of
interest– If the parameters are of interest
» C is the contrast matrix– These parameters are estimated by contrast.fit()
• Given the large number of linear models fits arising from a microarray there is a pressing need to take advantage of the parallel structure whereby the same model is fitted to each gene
• Using a hierarchical framework, a moderate t-statistic is computed– Standard errors are shrunk towards a common
value using a Bayesian model• This borrows information for the inference of individual
genes• The degrees of freedom are increased
– Reflexes the greater reliability to the smoothed standard errors
Unsupervised MethodHierarchical clustering
• Hierarchical clustering– First, need to calculate all the pair wise distances
• D=dist(t(X.top))– Finally, perform the hierarchical clustering
• H1=hclust(D,method=“single”)• H2=hclust(D,method=“complete”)• H3=hclust(D,method=“average”)• plot(Hi)
• Is there something odd from the clustering?
Unsupervised MethodMDS
• Multidimensional scaling (MDS) is a set of related statistical techniques to explore similarities in data*.
• *Wikipedia.
Unsupervised Method Principal component
• In R, the function prcomp performs principal component analysis
• In our context, the idea is to visualize the impact of possible dimension reduction in GENES– Important: Remember that in prcomp, the genes
have to be columns and the samples rows.