Kishor Presentation

Preview:

DESCRIPTION

The presentation lists various approaches for analysing microarray data using R/Bioconductor.

Citation preview

Design and Analysis Strategies for DNA microarray data: hits to targets

Organization of the presentation

DNA Microarray data analysis gene based gene sets based functional groups based

Clone ID Lead toxicity investigation using

genetic algorithms

Molecular based discover The completion of “Human Genome Project” which used an

approach of sequencing to characterize and map the entire human genome turned the attention of several researchers to investigate diseases and biological mechanisms at the level of molecules which comprise mostly of DNA , RNA and Proteins.

After pinpointing to a few disease related genes the comparative genomics approach which uses evolutionary biology principles to find similar genes in model organisms gave researchers extra degrees of freedom to study and thoroughly gain insights of the underlying biological mechanisms.

This ultimately drove the discovery approach towards functional genomics to quantitatively elicit the patterns associated with diseases or biological mechanisms.

DNA microarrays became popular and useful functional genomics tools.

The availability of gene sequences for most of the sequenced organisms made it feasible to design Gene Chips to survey genome wide analysis implications on target discovery.

Microarrays DNA microarrays simultaneously measure

thousands of gene expression levels using hybridization and sequence complementarity's

useful tools for detecting biological mechanisms involved in pathogenesis , disease related and other phenotypes using comparative methods.

two types Two-channel (spotted arrays) Single Channel (oligonucleotides)

Applications of microarrays

Biomarker discovery Clinical outcome ( survival, response

to treatment) Diagnostic , prognostic inferences Regulatory networks (guilt by

association. Personalized medicine

Microarray Platforms

Agilent Affymetrix ABI 1700

Gene based common data analysis methods fold change t-test (two groups) factorial methods (multiple groups) time course experiments

Gene sets based analysis GSEA Gene Ontology (GOStats,topGO)

Affymetrix Commonly referred as Gene Chips

Each gene is represented by 16-20 oligonucleotides each made of 25 nucleotides (A,C,T,G)

probe pair : PM/MM probe set : vector of all probe pairs for a gene MM indicates non-specific binding. MA plots can used to understand probe specific

and intensity specific non-specific binding.

Preprocessing methods (BMC Bioinformatics 2006, 7:105)

Preprocessing

Normalization Global

Mean centering MA – plots (two channel) Quantile normalization

Local Loess (intensity dependent) Lowess (remove dye effects)

common experimental inquires

gene knock-out time-series phenotypic differences drug effects disease associated pathways and

biological mechanisms

LIMMA : Linear Models for microarray analysis (Subramanian, Tamayo, et al.

(2005, PNAS 102, 15545-15550 ) )

fits a linear model to each gene based on the RNA source and contrasts of interest for testing its differential expression

the inherent statistics borrows information across the genes/probes to assess differential expression as per the experimental design

works very efficiently even with experiments with smaller sample sizes.

some contrast comparisons may not require replicates (depending on variability between the sources of comparison).

supports factorial designs

Examples of comparisons

mock experimental design Notation ( Factors : drug treatment , age)

DG 1-10 : treated with drug A PL 1-10 : placebo D.Y 1-4 : yng patients treated with drug A D.S 5-10:old patients treated with drug A P.Y 1-6 : yng placebo P.O 7-10:old placebo

> cont.matrix <- makeContrasts ( PL.YvsO=PL.Y-NM.Y, DG.YvsO=DG.Y-DG.Y, Diff=(DG.Y-DG.O)-(PL.Y-PL.O), levels=design ) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) topTable(fit2,coef= Diff) # combined effect topTable(fit2,coef= PL.YvsO) # age effect in normal topTable(fit2,coef= DG.YvsO, adjust=“BH”) # age

effect in drug treated

steps involved

construct a design matrix using target file

indicate contrasts of comparison using contrasts fit method

fit a linear model assess differential expression using

eBayes method

Interpretation of results

Statistics to assess differential expression using LIMMA Moderated t-Statistics

Similar to t-statistic with estimating standard errors based on the expression values of all genes.

B-Statistics log-odds that a gene is differentially expressed

F-Statistics assess differential expression the genes based

on the coefficients of all contrasts.

Significance Analysis of Microarrays

measures differential expression of the data for time course designed experiments.

assesses significance of differential expression of genes using repeated permutations of the sample labels

supports several experimental designs works efficiently even for smaller sample

sizes

Experimental designs supported by SAM(Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)

Sample input format(Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)

SAM statistics (Chu, G., Narasimhan, B., Tibshirani, R. &

Tusher, V. (2002), Signicance analysis of microarrays (sam) software)

SAM plot

Limitations of gene based approaches arbitrary cutoffs too stringent criteria ( effect of multiple

hypothesis testing) speculative selection lack of ways to efficiently differentiate

differential expression of a gene due to experimental noise and a true biological signal.

incoherence between multiple microarray results

Gene set enrichment analysis (GSEA)

cross comparison and validation of multiple experiments with relevant biological motives

gene set based interrogation of microarray data

infer pathway enrichment / analysis and gene regulatory networks

biomarker detection refinement or drilling down gene lists

Methodology1. Choose a ranking metric for sorting genes based on

their correlation with the phenotype2. Compute a running sum statistic (enrichment score)

based on the overrepresentation of the genes at the extremes of the rank ordered list.

3. Estimate the significance of enrichment score relative to null distribution (empirical phenotype based permutation test).

4. Multiple hypothesis testing is performed on the normalized enrichment score (gene set size into account) by controlling FDR which is the probability of finding false computation of the normalized enrichment score.

Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

Leading edge subset

Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

Novel methodology based on gene set enrichment

gives the option of preserving gene-gene correlations while computing enrichment statistics.

user friendly tool with a programmatic interface (API).

availability of curated gene sets database MSig database

Caveats

availability and requirement of pre-defined gene sets.

more knowledge based rather than discovery based in terms of inferring biological mechanisms this is reduced to some extent with the provision of an exhaustive gene sets through MSig database.

Enriched gene sets Phenotype (http://www.broad.mit.edu/gsea/resources/gsea_pnas_results/p53_C2.Gsea/gsea_report_for_WT_1130958999391.html)

Enriched gene sets in mutant (http://www.broad.mit.edu/gsea/resources/gsea_pnas_results/p53_C2.Gsea/gsea_report_for_MUT_1130958999391.html)

RNA interference “RNA interference (RNAi), a form of post-

transcriptional gene silencing induced by introduction of double-stranded RNA (dsRNA), has become a powerful experimental tool for studying gene function.” [7]

“For drug developers, RNAi phenotypes can provide clues about what to assay to screen antagonist drug candidates” [7].

Uses the principle of reverse genetics to understand changes in biological pathways by simultaneously knocking down (silencing) multiple genes.

depends on siRNA libraries built to target specific genes and proteins.

A careful designed RNAi screen is equivalent to performing multiple gene knock-out microarray experiments.

Can be using siRNA`s (better specificity) and miRNA’s

Endocytotic pathways Endocytosis is a process in which several molecules

(cargos) are transported into the cytoplasm using membrane proteins. cell surface selection budding and pinching off recruited to target protein

Pathways can be inferred using high resolution microscopy which provide quantitative and qualitative information of endocytocised complexes using image processing tools.

Useful for understanding cell growth, development and pathogenesis.

Gene Ontology (Description) Since the completion of Human Genome Project a

major challenge has been annotation and standardized dissemination of information related to genes and gene products.

GO is a consortium which successfully derived ontology by capturing and representing gene features, relationships using direct acyclic graphs.

Accordingly, gene attributes were broadly classified into 3 categories1. Biological Process2. Molecular Function3. Subcellular Colocalization

biomaRt ( Bioconductor interface to BioMart Software Suit [http://www.biomart.org/] )

(The biomaRt user’s guide Steffen Durinck, Wolfgang Huber)

GO based functional characterization of gene sets using topGO Biological interpretation of gene lists

obtained from microarray or high throughput screening platforms using gene ontology based on overlap statistics.

Not only useful for functional based characterization of gene lists but can also provide clues of co-expressed genes.

Along with providing built-in statistical methodologies, features customizable incorporation of user chosen statistics for assessing the differential expression and enrichment of GO terms.

Alexa et al. Bioinformatics, 13, 1600-1607, 2006

Elim reduces overlap by iteratively removing genes

from ancestral nodes of a significantly enriched node (GO term).

more stringent in terms of reducing false positives when compared with weight algorithm.

Works better with small values of k ( diffex genes)

Weight significant node score is computed by down-

weighing the overlap gene scores of its children. significant nodes and vector of weights are

recursively updated.

A.Alexa et al. Bioinformatics, 13, 1600-1607, 2006

Clone ID Bergeys vs. Phylotypes Below is the list of the classifications tools we used in our

analysis and the classification schema used by that tool. Classification Tool Classification Schema

NCBI’s MegaBLAST NCBI’s taxonomy Hierarchy Browser

RDP II Bergey`s ManualRDPquery Bergey`s ManualSIMO Bergey`s ManualClone ID MegaBLAST Phylotypes

Bergey’s Manual is based on polyphasic numerical taxonomy and provides information about multiple phenotypic traits. The classification based on Bergey`s Manual is complicated, expensive, and time consuming. In contrast, classification using 16S rRNA phylotypes is more objective, faster, and less expensive.

Relational Database Development

Normalization 1st Normal Form 2nd Normal Form 3rd Normal Form BCNF

E-R Diagrams Joins (outer, inner ,self) Aggregate functions (sum, count, min..) Miscellaneous (decode ,nvl , instr…)

References1. http://cran.r-project.org/2. Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550) and Mootha, Lindgren, et al. (

2003, Nat Genet 34, 267-273). 3. Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance

analysis of microarrays (sam) software4. Adrian Alexa, Jörg Rahnenführer, Thomas Lengauer

Improved scoring of functional groups from gene expression data by decorrelating GO graph structure

Bioinformatics, 13, 1600-1607, 2006 5. http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html6. http://www.geneontology.org/7. Axon guidance genes identified in a large-scale RNAi screen using the

RNAi-hypersensitive Caenorhabditis elegans strain nre-1(hd20) lin-15b(hd126) Caroline Schmitz, Parag Kinge*, and Harald Hutter

8. Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3.

9. Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397-420

Acknowledgements George Mason University

Glenda Wilson (MS advisor) Dr. Patrick Gillevet (thesis advisor) Prof. James Willett

GSK Amy Creech (Supervisor) and Workbench team

Vanderbilt University Prof. Frank Harrell (supervisor) Dr. Christine Konradi Dr. Jay Snoddy Dr. Karoly Mirnics Dr. Lily Wang Dr. Jeff Franklin

NCBS Prof. Satyajit Mayor Dr. Gagan Gupta Mr. Gautam Dey

BITS, Pilani Dr. V.S Rao Dr. N.V.Muralidhara Rao Dr. A.P.Koley