41
__________________________________________________________________________________________________ 10/16/2015 GCBA 815 Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-8: WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA) Simarjeet K. Negi, Ph.D. candidate (Guda Lab) Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Tools and Algorithms in Bioinformatics

GCBA815, Fall 2015

Week-8: WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA)

Simarjeet K. Negi, Ph.D. candidate

(Guda Lab)

Department of Genetics, Cell Biology and Anatomy

University of Nebraska Medical Center

Page 2: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Why perform enrichment analysis?

• Large gene lists resulting from high- throughput analysis

• Deciphering the biology

• Organize expression changes into meaningful functional themes

• Gene enrichment analysis increases the likelihood to identify

molecular processes/functions most pertinent to the study

Page 3: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• If a biological process is abnormal in a given study, the co-functioning

genes should have a higher (enriched) potential to be selected as a

relevant group by the high-throughput screening technologies

• Analytic conclusion is based on a group of relevant genes that increases

the likelihood to identify the biological processes most pertinent to study

• Enrichment tools map a large number of ‘interesting’ genes to

biological annotation terms (e.g. GO Terms or Pathways)

• Statistical examination of the enrichment of user genes for each of the

annotation terms by comparing the outcome to the control (or reference)

background

Principle of Enrichment Analysis

Page 4: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• Based on the difference of algorithms, the current enrichment tools can be

broadly divided into three classes:

• Singular enrichment analysis (SEA); WebGestalt

• Gene set enrichment analysis (GSEA); GSEA

• Modular enrichment analysis (MEA); DAVID

• Note, some tools with diverse capabilities belong to more than one class

Classification of Enrichment Tools

Overrepresentation approaches

Aggregate score approach

Page 5: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

WebGestalt : WEB-based Gene SeT AnaLysis Toolkit

(http://bioinfo.vanderbilt.edu/webgestalt/)

Page 6: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• Input: user’s preselected (e.g. differentially expressed genes selected

between experimental versus control samples) ‘interesting’ genes

• Iteratively testing the enrichment of each annotation term one-by-one in

a linear mode

• Integrates functional enrichment analysis with information

visualization

• Constantly updated

• Efficiently processes large gene lists

• Weakness: output of terms can be large, thereby diluting the focus and

interrelationships of relevant terms

WebGestalt :WEB-based Gene SeT AnaLysis Toolkit

Page 7: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

DAVID: Database for Annotation, Visualization and Integrated

Discovery (https://david.ncifcrf.gov/home.jsp)

Page 8: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

DAVID: Database for Annotation, Visualization and Integrated

Discovery

Page 9: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• DAVID inherits the basic enrichment calculation as found in

WebGestalt

• Input: user defined gene list

• Incorporates extra network discovery algorithms by considering the

term-to-term relationships

• Improve discovery sensitivity and specificity by considering inter-

relationships of GO terms in the enrichment calculations

• Joint terms may contain unique biological meaning for a given study, not

held by individual terms

• Weakness: Not updated in the recent years, user input gene list size limited to

3000 genes

DAVID: Database for Annotation, Visualization and Integrated

Discovery

Page 10: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GSEA: Gene Set Enrichment Analysis

(http://www.broadinstitute.org/gsea/)

• Identifies the enriched pathways/gene sets between two biological states

• The program uses an underlying database (MSigDB) of about 11,000 gene sets

that include KEGG, BIOCARTA pathways, curated sets from disease states, etc.

Page 11: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Seven Broader Collections of GSEA

• Search

• Browse

• Examine gene sets

• Investigate

• Download

Using MSigDB

Page 12: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GSEA: Gene Set Enrichment Analysis

• GSEA program (download to your PC)

• Input: Expression dataset (between two conditions); Phenotype labels between two states; Gene

sets in gmx/gmt format (MSigDB - supplied by GSEA)

• GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray

experiment without selecting significant genes (e.g. genes with P-value 0.05

and fold change 2)

• GSEA method requires a summarized biological value (e.g. fold change)

• Weakness:

• Sometimes, it is a difficult task to summarize many biological aspects of a gene into one

meaningful value; example: SNP arrays, clinical microarray studies

• GSEA is less powerful to detect a gene set with a mix of genes with positive and negative

associations with the phenotype

Page 13: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Tutorial

Page 14: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012

• 11521 genes as the reference gene set from the protein-protein interaction

network used in the same paper

• Genes are from a human study

WebGestalt : example dataset

Page 15: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

WebGestalt : WEB-based Gene SeT AnaLysis Toolkit

(http://bioinfo.vanderbilt.edu/webgestalt/)

hsapiens

hsapiens_gene_symbol

Colorectal_cancer_genes

Page 16: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

PPI_network

hsapiens_gene_symbol

Page 17: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GO Analysis

nodes with red

label represents

enriched categories

and black label

represents their

non-enriched

parents

Page 18: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

KEGG Analysis

Genes highlighted in red in

the pathway map are enriched

in the user input

Page 19: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• 408 genes involved in the cellular responses to HIV envelope protein

infection in resting or suboptimally activated peripheral blood mononuclear

cells; Cicala et al. 2002

• Affymetrix U95A microarray chip (genome wide expression) as the

reference gene set

DAVID : example dataset

Page 20: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

DAVID: Database for Annotation, Visualization and Integrated

Discovery (https://david.ncifcrf.gov/home.jsp)

Page 21: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

HIV_genes

When multiple species pop up,

click on the species of interest

and press ‘Select Species’

If multiple gene lists are open

in the program, select the

gene list of interest and click

on ‘Use’

1

2

3

Page 22: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Percentage, e.g.

33/398 (involved

genes/total genes)

Page 23: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Page 24: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

KEGG Pathway

BIOCARTA

List genes are shown in red stars

Page 25: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Page 26: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Table Report is a gene-centric view which lists the genes and their associated annotation

terms (selected only). There is no statistics applied in this report

Page 27: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

User input genes classified into

big gene functional groups

Measure of the importance of a

gene group in the user’s gene list

Key biology of

this gene group

Check if there are any other

genes in the gene list or in the

genome functionally similar

to this gene group

How the members share

common annotations/biology

Page 28: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GSEA dataset

• Transcriptional profiles from p53+ and p53 mutant cancer cell lines

• Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct

'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe

set ids) have been replaced with symbols

• Phenotype labels (e.g tumor vs normal): P53.cls

• Gene set: c1.v2.symbols.gmt

Page 29: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

http://www.broadinstitute.org/gsea/datasets.jsp

GSEA: Gene Set Enrichment Analysis

(http://www.broadinstitute.org/gsea/)

Page 30: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GCT file format; expression data file

Page 31: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

CLS file format; phenotype file

Page 32: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GMT file format; gene sets

Page 33: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

1

2

3

GSEA: Gene Set Enrichment Analysis

Page 34: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

2

Page 35: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

ftp.broad.mit.edu://pub/gsea/annotations/HG_U95Av2.chip

1

3

2

Page 36: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Interpreting GSEA Results

GSEA Statistics

GSEA computes four key statistics for the gene set enrichment analysis report:

● Enrichment Score (ES)

● Normalized Enrichment Score (NES)

● False Discovery Rate (FDR)

● Nominal P Value

Page 37: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Enrichment plot ; Enrichment Score (ES)

• The ES is the maximum deviation from zero encountered in walking the list

• A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES

indicates gene set enrichment at the bottom of the ranked list

• Enrichment score (ES), reflects the degree to

which a gene set is overrepresented at the

top or bottom of a ranked list of genes

• GSEA calculates the ES by walking down

the ranked list of genes, increasing a

running-sum statistic when a gene is in the

gene set and decreasing it when it is not

• The magnitude of the increment depends on

the correlation of the gene with the

phenotype

Page 38: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

GSEA Report

Page 39: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

• To identify the subset of genes that actually contribute to the enrichment score (ES)

• The leading edge subset in a geneset are those genes that appear in the ranked list at or before

the point at which the running sum reaches its maximum

• Outputs heatmaps and set-to-set overlaps of leading edge subsets between pairs enriched

genesets

1

23

Page 40: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Interpreting Leading Edge Analysis Results

HeatMap

Gene in Subsets Histogram

Set-to-Set

Page 41: Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 · Discovery (https: //david.ncifcrf.gov ... light blue, dark blue) shows the range of expression values (high, moderate,

__________________________________________________________________________________________________

10/16/2015 GCBA 815

Heat map shows the (clustered) genes in the leading edge subsets. The

expression values are represented as colors, where the range of colors (red, pink,

light blue, dark blue) shows the range of expression values (high, moderate, low,

lowest)

Set-to-Set graph uses color intensity to show the overlap between subsets: the

darker the color, the greater the overlap between the subsets

Gene in subsets graph shows each gene and the number of subsets in which it

appears

Histogram; the Jacquard is the intersection divided by the union for a pair of

leading edge subsets. Number of Occurrences is the number of leading edge

subset pairs in a particular bin. In this example, most subset pairs have no overlap

(Jacquard = 0)

Interpreting Leading Edge Analysis Results