New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets...

Preview:

Citation preview

1

SIB course 4-8 Feb 2008Statistical analysis applied to genome

and proteome analyses

Sven BergmannDepartment of Medical Genetics

University of LausanneRue de Bugnon 27 - DGM 328

CH-1005 LausanneSwitzerland

work: ++41-21-692-5452cell: ++41-78-663-4980

http://serverdgm.unil.ch/bergmann

Part1:Analysis tools for large datasets

• Standard toolsk-means, PCA, SVD

• Modular analysis toolsCTWC, ISA, PPA

Why to study a large heterogeneous set of expression data?

Large: Better signals from noisy data!

Heterogeneous: Global view at transcription program!

Supervised vs. unsupervised approachesLarge genome-wide data may contain answers to questions we do not ask! Need for both hypothesis-driven and exploratory analyses!

MotivationsHow to get large-scale expression data?

Pool genome-wide expression measurements from many experiments!

stress

2 4 6 8

1000

2000

3000

4000

5000

6000

cell-cycle

1 2 3 4 5

1000

2000

3000

4000

5000

6000200 400 600 800 1000

1000

2000

3000

4000

5000

6000

large-scaleexpression data

genes

diverse conditionssets of specific conditions

How to make sense of millions of numbers?

New Analysis and Visualization Tools are needed!

Hundreds of samples

Thousandsof genes

K-means Clustering“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

2

K-means Clustering

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

2. Assign each data point to closest centroid

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

3. Move centroids to center of assigned points

2. Assign each data point to closest centroid

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

Iterate 1-3 until minimal cost

3. Move centroids to center of assigned points

2. Assign each data point to closest centroid

1. Start with random positions of centroids ( )

with k clusters Si, i = 1,2,...,k and centroids µi (the mean point of all the points )

“guess” k=3 (# of clusters)

K-means ClusteringPlus:• visual • intuitive• relatively fast

Minus:• have to “guess” number of clusters• can give different results for distinct “starting seeds”

• distances computed over all features• one cluster only per element• no cluster hierarchy

Hierachical Clustering

Plus:• Shows (re-orderd) data• Gives hierarchy

Minus:• Does not work well for many genes(usually apply cut-off on fold-change)

• Similarity over all genes/conditions• Clusters do not overlap

3

Principle Component Analysis

Principle components (PCs) are projections onto subspace with the largest variation in the data

http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Raw data points: {a, …, z}

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htmNormalized data points: zero mean (& unit std)!

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Identification of axes with the most variance

Most variance is along PCA1

The direction of most variance

perpendicular to PCA1 defines

PCA2

Example: 2PCs for 3d-data

Cluster?

http://ordination.okstate.edu/PCA.htm

Reminder: Matrix multiplications

Definition:

Scheme:

Vectorized:

Example:http://en.wikipedia.org/wiki/Matrix multiplication

4

How do we get the PCs?• The PCs are the eigenvectors of the

covariance matrix C computed from the (mean-centered) data matrix E:

C = ET·E /(n-1)

C·pc = λ·pcC·pc = λ·pc C

1 300

300· =

1

300

1

300·

λ

pc

C = ET·E /(n-1) ETE=C

1 300

300·

1

300

30016k

6k/(n-1)

PCA: Example deletion mutants

And how to project?• The projected data is just the product of

the original data with the PCs:

E’ = E · PC

• Principle Component or Transformation Matrix:PC = [pc1, pc2, …, pcn]

(where n is the number of PCs used)

E’ = E · PC E3001

6k

=

n

·E’1

6k

• The original gene expression profiles are over 300 arrays.

• The transformed data contain projections on n “eigen-genes”(linear combinations of the 300 arrays shown in red)

300

…n1 2

1

PCA: Example deletion mutants

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04-0.1

-0.05

0

0.05

0.1

0.15

PCA1

PC

A2

The first 2 “eigen-genes” separate data into 3 clusters

PCA: Example deletion mutants

-0.04 -0.02 0 0.02 0.04 0.06 0.08-0.15

-0.1

-0.05

0

0.05

0.1

PCA1

PC

A3

Third “eigen-gene” (PCA3) reveals little structure!

PCA: Example deletion mutants

5

Singular Value Decomposition

V: PC matrix of “eigen-genes”(composed of eigenvectors of C = ET·E)

U: PC matrix of “eigen-arrays”(composed of eigenvectors of C’ = E·ET)

D: diagonal matrix

E = U·D·VT

“SVD = bi-PCA”

http://public.lanl.gov/mewall/kluwer2002.html

SVD: Matrix representation

E = U·D·VT

E3001

6k

= · …

3001

…unu1u2

…λ1λ2

λn0

0 v1v2

vn

·6k

n 1 n

n

1

nU D VT

ui: eigen-arrays vi: eigen-genes λi: eigenvaluesi = 1, …, n n: rank(E) = #(independent arrays)

Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97:10101-06.

E = U·D·VT = ∑i λi·ui·viT (full expansion)

E1 = λ1·u1·v1T (rank-1 expansion)

∆ = |E - E1|2 (sum of residuals)

minimize ∆ for free u1 and v1:E·v1= λ1·u1 & ET·u1 = λ1·v1implying:E·ET·u1 = λ1

2·u1 & ET·E·v1 = λ12·v1

SVD: What is optimized?

Bergmann et al., Phys. Rev. E 67, 031902 (2003)

SVD: Example deletion mutants

E1 = λ1·u1·v1T

E1

3001

6k

= · 300λ1

v1· 1 =u1

(1)·v1(1) ··· u1

(1)·v1(300)

: : : :

u1(6k)·v1

(1) ··· u1(6k)·v1

(300)

λ1

1

u16k

= · · =high low

low low

highhigh

low

low

SVD: Example deletion mutants

gene

s

arrays

original data

50 100 150 200 250 300

50

100

150

200

gene

s

eigen-arrays

U (n=1)

1

50

100

150

200

eige

n-ge

nes

arrays

VT (n=1)

50 100 150 200 250 300

1

arrays

gene

s

SVD(data) = U D VT (n=1)

50 100 150 200 250 300

50

100

150

200

-1

0

1

SVD: Example deletion mutants

gene

s

arrays

original data

50 100 150 200 250 300

50

100

150

200

gene

s

eigen-arrays

U (n=2)

1 2

50

100

150

200

eige

n-ge

nes

arrays

VT (n=2)

50 100 150 200 250 300

1

2

arrays

gene

s

SVD(data) = U D VT (n=2)

50 100 150 200 250 300

50

100

150

200

-1

0

1

6

SVD: Example deletion mutants

gene

s

arrays

original data

50 100 150 200 250 300

50

100

150

200ge

nes

eigen-arrays

U (n=3)

1 2 3

50

100

150

200

eige

n-ge

nes

arrays

VT (n=3)

50 100 150 200 250 300

1

2

3

arrays

gene

s

SVD(data) = U D VT (n=3)

50 100 150 200 250 300

50

100

150

200

-1

0

1

Part1:Analysis tools for large datasets

• Standard toolsk-means, PCA, SVD

• Modular analysis toolsCTWC, ISA, PPA

How to extract biological information from large-scale expression data?

200 400 600 800 1000

1000

2000

3000

4000

5000

6000

Hierarchical clustering and other correlation-based methods may begood for small data sets, but:

Problems with large data:• Clusters cannot overlap!

• Clustering based oncorrelations over all conditions:- sensitive to noise- computation intensive

Search for transcription modules:

Set of genes co-regulated undera certain set of conditions

• context specific

• allow for overlaps

How to extract biological information from large-scale expression data?

Overview of “modular” analysis tools• Cheng Y and Church GM. Biclustering of expression data.

(Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)• Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene

microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)• Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization

in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)

• Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205)

• Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering.(Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)

• Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)

… and many more! http://serverdgm.unil.ch/bergmann/Publications/review.pdf

Coupled two-way Clustering

7

How to “hear” the relevant genes?

Song A

Song B

Inside CTWC: Iterations

S1G1Init

S68……S113

S2(G6)...S2(G21)S3(G6)…S3(G21)

G161………G216

G2(S4)...G2(S11)…G5(S4)...G5(S11)

5

S52,...

S67

S1(G6)…S1(G21)

G98,..G105…G151,..G160

G1(S4)…G1(S11)

4

S12,……S51

S2(G1)…S2(G5)S3(G1)…S3(G5)

G22………G97

G2(S1)…G2(S3)…G5(S1)…G5(S3)

3

S4,S5,S6S10,S11None

S1(G2)…S1(G5)

G6,G7,….G13G14,…G21

G1(S2)G1(S3)

2

S2,S3S1(G1)G2,G3,…G5G1(S1)1

SamplesGenesDepth

Two-way clustering

• No need for correlations!

• decomposes data into “transcription modules”

• integrates external information

• allows for interspecies comparative analysis

One example in more detail:

The (Iterative) Signature Algorithm:

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

Trip to the “Amazon”:

5 10 15 20 25 30 35 40 45 50

10

20

30

40

50

60

70

80

90

100

How to find related items?

items

customers

re-commended

items

your choice

customers with

similar choice

False Positives:

8

5 10 15 20 25 30 35 40 45 50

10

20

30

40

50

60

70

80

90

100

How to find related genes?

genes

conditions

similarly expressed

genes

your guess

relevant conditions

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

IGg

gcGc Es

∈=

}:{ CCCcccC tssCcS σ>−∈=∈

cSc

gcCcg Ess

∈=

}:{ GGGgggG tssGgS σ>−∈=∈

IG

Signature Algorithm: Score definitions

initial guesses(genes)

thresholding:

condition scores

How to find related genes? Scores and thresholds!ge

ne s

core

scondition scores

thre

shol

ding

:

How to find related genes? Scores and thresholds!

gene

sco

res

condition scores

thresholding:

How to find related genes? Scores and thresholds!Iterative Signature Algorithm

INPUT OUTPUTOUTPUT = INPUT

“Transcription Module”SB, J Ihmels & N Barkai Physical Review E (2003)

9

Identification of transcription modules using many random “seeds”

random“seeds”

Transcription modules

Independent identification:Modules may overlap!

New Tools: Module Visualization

http://serverdgm.unil.ch/bergmann/Fibroblasts/visualiser.html

Gene enrichment analysisThe hypergeometric distribution f(M,A,K,T) gives the probability

that K out of A genes with a particular annotation match with a

module having M genes if there are T genes in total.

http://en.wikipedia.org/wiki/Hypergeometric_distribution

Decomposing expression data into annotated transcriptional modules

identified >100 transcriptional modules in yeast:

high functional consistency!

many functional links “waiting” to be verified experimentally

J Ihmels, SB & N Barkai Bioinformatics 2005

Module hierarchies and networks Higher-order structure

correlated

anti-correlated

C

10

Organisms

Data types

Conditions

Developmental

Physiological

Environmental

Experimental

Clinical

– Protein expression– Tissue specific expression– Interaction data– Localization data– …?

Biological Insight

The challenge of many datasets: How to integrate all the information?

BLASTsignature algorithm

Mapping Transcription Modules

For distant organisms correlation patterns generally are distinct

SB, J Ihmels & N Barkai PLoS Biology (2004)

What about related organisms?

J Ihmels, SB, J Berman & N Barkai Science (2005)

pairwise correlation (over all arrays)

gene

s

Promoter analysis: The “Rapid Growth Element” AATTTT Data Integration: Example NCI60

11

Our (modular) approach: The model

Co-modulesGene-modules Drug-modules

C3

F4

C4

F3

G3

G4

[AGF]

[AGF]

[BFC]

[BFC]

C5 D3

F6

C6 D4

F5

[BFC]

[BFC]

[CDF]

[CDF]

C1 D1

F2

C2 D2

F1

G1

G2

[AGF] [CDF][BFC]

[AGF] [CDF][BFC]

G

D

CM E

M D

G4

D4

C3 C4

Drug-modules

Gene-modules

C5 C6

Modules and Co-modules

D3

G3

M ED

Co-modules

G2

G1

C1 C2

D1

D2

E CG R D MED

Iteratively refine genes, cell-lines and drugs to get co-modules

The Ping-Pong algorithm!

1

2

3

4

Co-modules have predictive power for drug-gene associations

Co-modules analysis provides biological focus through data integration

• Analysis of large-scale expression data bears great potential to understand global transcription programs and their evolution

• Innovative analysis tools needed to extract information from such data

• (Iterative) Signature & Ping-Pong Algorithms:– decomposes data into “transcription modules”– integrates external information– allows for interspecies comparative analysis

Take-home Messages:

Recommended