57
Design and Analysis Strategies for DNA microarray data: hits to targets

Kishor Presentation

Embed Size (px)

DESCRIPTION

The presentation lists various approaches for analysing microarray data using R/Bioconductor.

Citation preview

Page 1: Kishor Presentation

Design and Analysis Strategies for DNA microarray data: hits to targets

Page 2: Kishor Presentation

Organization of the presentation

DNA Microarray data analysis gene based gene sets based functional groups based

Clone ID Lead toxicity investigation using

genetic algorithms

Page 3: Kishor Presentation

Molecular based discover The completion of “Human Genome Project” which used an

approach of sequencing to characterize and map the entire human genome turned the attention of several researchers to investigate diseases and biological mechanisms at the level of molecules which comprise mostly of DNA , RNA and Proteins.

After pinpointing to a few disease related genes the comparative genomics approach which uses evolutionary biology principles to find similar genes in model organisms gave researchers extra degrees of freedom to study and thoroughly gain insights of the underlying biological mechanisms.

This ultimately drove the discovery approach towards functional genomics to quantitatively elicit the patterns associated with diseases or biological mechanisms.

Page 4: Kishor Presentation

DNA microarrays became popular and useful functional genomics tools.

The availability of gene sequences for most of the sequenced organisms made it feasible to design Gene Chips to survey genome wide analysis implications on target discovery.

Page 5: Kishor Presentation

Microarrays DNA microarrays simultaneously measure

thousands of gene expression levels using hybridization and sequence complementarity's

useful tools for detecting biological mechanisms involved in pathogenesis , disease related and other phenotypes using comparative methods.

two types Two-channel (spotted arrays) Single Channel (oligonucleotides)

Page 6: Kishor Presentation

Applications of microarrays

Biomarker discovery Clinical outcome ( survival, response

to treatment) Diagnostic , prognostic inferences Regulatory networks (guilt by

association. Personalized medicine

Page 7: Kishor Presentation

Microarray Platforms

Agilent Affymetrix ABI 1700

Gene based common data analysis methods fold change t-test (two groups) factorial methods (multiple groups) time course experiments

Gene sets based analysis GSEA Gene Ontology (GOStats,topGO)

Page 8: Kishor Presentation

Affymetrix Commonly referred as Gene Chips

Each gene is represented by 16-20 oligonucleotides each made of 25 nucleotides (A,C,T,G)

probe pair : PM/MM probe set : vector of all probe pairs for a gene MM indicates non-specific binding. MA plots can used to understand probe specific

and intensity specific non-specific binding.

Page 9: Kishor Presentation

Preprocessing methods (BMC Bioinformatics 2006, 7:105)

Page 10: Kishor Presentation

Preprocessing

Normalization Global

Mean centering MA – plots (two channel) Quantile normalization

Local Loess (intensity dependent) Lowess (remove dye effects)

Page 11: Kishor Presentation

common experimental inquires

gene knock-out time-series phenotypic differences drug effects disease associated pathways and

biological mechanisms

Page 12: Kishor Presentation

LIMMA : Linear Models for microarray analysis (Subramanian, Tamayo, et al.

(2005, PNAS 102, 15545-15550 ) )

fits a linear model to each gene based on the RNA source and contrasts of interest for testing its differential expression

the inherent statistics borrows information across the genes/probes to assess differential expression as per the experimental design

works very efficiently even with experiments with smaller sample sizes.

some contrast comparisons may not require replicates (depending on variability between the sources of comparison).

supports factorial designs

Page 13: Kishor Presentation

Examples of comparisons

Page 14: Kishor Presentation

mock experimental design Notation ( Factors : drug treatment , age)

DG 1-10 : treated with drug A PL 1-10 : placebo D.Y 1-4 : yng patients treated with drug A D.S 5-10:old patients treated with drug A P.Y 1-6 : yng placebo P.O 7-10:old placebo

Page 15: Kishor Presentation

> cont.matrix <- makeContrasts ( PL.YvsO=PL.Y-NM.Y, DG.YvsO=DG.Y-DG.Y, Diff=(DG.Y-DG.O)-(PL.Y-PL.O), levels=design ) > fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2) topTable(fit2,coef= Diff) # combined effect topTable(fit2,coef= PL.YvsO) # age effect in normal topTable(fit2,coef= DG.YvsO, adjust=“BH”) # age

effect in drug treated

Page 16: Kishor Presentation

steps involved

construct a design matrix using target file

indicate contrasts of comparison using contrasts fit method

fit a linear model assess differential expression using

eBayes method

Page 17: Kishor Presentation

Interpretation of results

Statistics to assess differential expression using LIMMA Moderated t-Statistics

Similar to t-statistic with estimating standard errors based on the expression values of all genes.

B-Statistics log-odds that a gene is differentially expressed

F-Statistics assess differential expression the genes based

on the coefficients of all contrasts.

Page 18: Kishor Presentation

Significance Analysis of Microarrays

measures differential expression of the data for time course designed experiments.

assesses significance of differential expression of genes using repeated permutations of the sample labels

supports several experimental designs works efficiently even for smaller sample

sizes

Page 19: Kishor Presentation

Experimental designs supported by SAM(Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)

Page 20: Kishor Presentation

Sample input format(Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance analysis of microarrays (sam) software)

Page 21: Kishor Presentation

SAM statistics (Chu, G., Narasimhan, B., Tibshirani, R. &

Tusher, V. (2002), Signicance analysis of microarrays (sam) software)

Page 22: Kishor Presentation
Page 23: Kishor Presentation

SAM plot

Page 24: Kishor Presentation

Limitations of gene based approaches arbitrary cutoffs too stringent criteria ( effect of multiple

hypothesis testing) speculative selection lack of ways to efficiently differentiate

differential expression of a gene due to experimental noise and a true biological signal.

incoherence between multiple microarray results

Page 25: Kishor Presentation

Gene set enrichment analysis (GSEA)

cross comparison and validation of multiple experiments with relevant biological motives

gene set based interrogation of microarray data

infer pathway enrichment / analysis and gene regulatory networks

biomarker detection refinement or drilling down gene lists

Page 26: Kishor Presentation

Methodology1. Choose a ranking metric for sorting genes based on

their correlation with the phenotype2. Compute a running sum statistic (enrichment score)

based on the overrepresentation of the genes at the extremes of the rank ordered list.

3. Estimate the significance of enrichment score relative to null distribution (empirical phenotype based permutation test).

4. Multiple hypothesis testing is performed on the normalized enrichment score (gene set size into account) by controlling FDR which is the probability of finding false computation of the normalized enrichment score.

Page 27: Kishor Presentation

Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

Page 28: Kishor Presentation

Leading edge subset

Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

Page 29: Kishor Presentation

Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550)

Page 30: Kishor Presentation

Novel methodology based on gene set enrichment

gives the option of preserving gene-gene correlations while computing enrichment statistics.

user friendly tool with a programmatic interface (API).

availability of curated gene sets database MSig database

Page 31: Kishor Presentation

Caveats

availability and requirement of pre-defined gene sets.

more knowledge based rather than discovery based in terms of inferring biological mechanisms this is reduced to some extent with the provision of an exhaustive gene sets through MSig database.

Page 32: Kishor Presentation

Enriched gene sets Phenotype (http://www.broad.mit.edu/gsea/resources/gsea_pnas_results/p53_C2.Gsea/gsea_report_for_WT_1130958999391.html)

Page 33: Kishor Presentation

Enriched gene sets in mutant (http://www.broad.mit.edu/gsea/resources/gsea_pnas_results/p53_C2.Gsea/gsea_report_for_MUT_1130958999391.html)

Page 34: Kishor Presentation

RNA interference “RNA interference (RNAi), a form of post-

transcriptional gene silencing induced by introduction of double-stranded RNA (dsRNA), has become a powerful experimental tool for studying gene function.” [7]

“For drug developers, RNAi phenotypes can provide clues about what to assay to screen antagonist drug candidates” [7].

Page 35: Kishor Presentation

Uses the principle of reverse genetics to understand changes in biological pathways by simultaneously knocking down (silencing) multiple genes.

depends on siRNA libraries built to target specific genes and proteins.

A careful designed RNAi screen is equivalent to performing multiple gene knock-out microarray experiments.

Can be using siRNA`s (better specificity) and miRNA’s

Page 36: Kishor Presentation

Endocytotic pathways Endocytosis is a process in which several molecules

(cargos) are transported into the cytoplasm using membrane proteins. cell surface selection budding and pinching off recruited to target protein

Pathways can be inferred using high resolution microscopy which provide quantitative and qualitative information of endocytocised complexes using image processing tools.

Useful for understanding cell growth, development and pathogenesis.

Page 37: Kishor Presentation

Gene Ontology (Description) Since the completion of Human Genome Project a

major challenge has been annotation and standardized dissemination of information related to genes and gene products.

GO is a consortium which successfully derived ontology by capturing and representing gene features, relationships using direct acyclic graphs.

Accordingly, gene attributes were broadly classified into 3 categories1. Biological Process2. Molecular Function3. Subcellular Colocalization

Page 38: Kishor Presentation
Page 39: Kishor Presentation

biomaRt ( Bioconductor interface to BioMart Software Suit [http://www.biomart.org/] )

(The biomaRt user’s guide Steffen Durinck, Wolfgang Huber)

Page 40: Kishor Presentation
Page 41: Kishor Presentation
Page 42: Kishor Presentation

GO based functional characterization of gene sets using topGO Biological interpretation of gene lists

obtained from microarray or high throughput screening platforms using gene ontology based on overlap statistics.

Not only useful for functional based characterization of gene lists but can also provide clues of co-expressed genes.

Along with providing built-in statistical methodologies, features customizable incorporation of user chosen statistics for assessing the differential expression and enrichment of GO terms.

Page 43: Kishor Presentation

Alexa et al. Bioinformatics, 13, 1600-1607, 2006

Page 44: Kishor Presentation

Elim reduces overlap by iteratively removing genes

from ancestral nodes of a significantly enriched node (GO term).

more stringent in terms of reducing false positives when compared with weight algorithm.

Works better with small values of k ( diffex genes)

Weight significant node score is computed by down-

weighing the overlap gene scores of its children. significant nodes and vector of weights are

recursively updated.

Page 45: Kishor Presentation

A.Alexa et al. Bioinformatics, 13, 1600-1607, 2006

Page 46: Kishor Presentation
Page 47: Kishor Presentation
Page 48: Kishor Presentation
Page 49: Kishor Presentation
Page 50: Kishor Presentation
Page 51: Kishor Presentation
Page 52: Kishor Presentation

Clone ID Bergeys vs. Phylotypes Below is the list of the classifications tools we used in our

analysis and the classification schema used by that tool. Classification Tool Classification Schema

NCBI’s MegaBLAST NCBI’s taxonomy Hierarchy Browser

RDP II Bergey`s ManualRDPquery Bergey`s ManualSIMO Bergey`s ManualClone ID MegaBLAST Phylotypes

Bergey’s Manual is based on polyphasic numerical taxonomy and provides information about multiple phenotypic traits. The classification based on Bergey`s Manual is complicated, expensive, and time consuming. In contrast, classification using 16S rRNA phylotypes is more objective, faster, and less expensive.

Page 53: Kishor Presentation
Page 54: Kishor Presentation

Relational Database Development

Normalization 1st Normal Form 2nd Normal Form 3rd Normal Form BCNF

E-R Diagrams Joins (outer, inner ,self) Aggregate functions (sum, count, min..) Miscellaneous (decode ,nvl , instr…)

Page 55: Kishor Presentation
Page 56: Kishor Presentation

References1. http://cran.r-project.org/2. Subramanian, Tamayo, et al. (2005, PNAS 102, 15545-15550) and Mootha, Lindgren, et al. (

2003, Nat Genet 34, 267-273). 3. Chu, G., Narasimhan, B., Tibshirani, R. & Tusher, V. (2002), Signicance

analysis of microarrays (sam) software4. Adrian Alexa, Jörg Rahnenführer, Thomas Lengauer

Improved scoring of functional groups from gene expression data by decorrelating GO graph structure

Bioinformatics, 13, 1600-1607, 2006 5. http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html6. http://www.geneontology.org/7. Axon guidance genes identified in a large-scale RNAi screen using the

RNAi-hypersensitive Caenorhabditis elegans strain nre-1(hd20) lin-15b(hd126) Caroline Schmitz, Parag Kinge*, and Harald Hutter

8. Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3.

9. Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397-420

Page 57: Kishor Presentation

Acknowledgements George Mason University

Glenda Wilson (MS advisor) Dr. Patrick Gillevet (thesis advisor) Prof. James Willett

GSK Amy Creech (Supervisor) and Workbench team

Vanderbilt University Prof. Frank Harrell (supervisor) Dr. Christine Konradi Dr. Jay Snoddy Dr. Karoly Mirnics Dr. Lily Wang Dr. Jeff Franklin

NCBS Prof. Satyajit Mayor Dr. Gagan Gupta Mr. Gautam Dey

BITS, Pilani Dr. V.S Rao Dr. N.V.Muralidhara Rao Dr. A.P.Koley