New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets...

SIB course 4-8 Feb 2008Statistical analysis applied to genome

and proteome analyses

Sven BergmannDepartment of Medical Genetics

University of LausanneRue de Bugnon 27 - DGM 328

CH-1005 LausanneSwitzerland

work: ++41-21-692-5452cell: ++41-78-663-4980

http://serverdgm.unil.ch/bergmann

Part1:Analysis tools for large datasets

• Standard toolsk-means, PCA, SVD

• Modular analysis toolsCTWC, ISA, PPA

Why to study a large heterogeneous set of expression data?

Large: Better signals from noisy data!

Heterogeneous: Global view at transcription program!

Supervised vs. unsupervised approachesLarge genome-wide data may contain answers to questions we do not ask! Need for both hypothesis-driven and exploratory analyses!

MotivationsHow to get large-scale expression data?

Pool genome-wide expression measurements from many experiments!

stress

2 4 6 8

cell-cycle

1 2 3 4 5

6000200 400 600 800 1000

large-scaleexpression data

diverse conditionssets of specific conditions

How to make sense of millions of numbers?

New Analysis and Visualization Tools are needed!

Hundreds of samples

Thousandsof genes

K-means Clustering“guess” k=3 (# of clusters)

http://en.wikipedia.org/wiki/K-means_algorithm

K-means Clustering

1. Start with random positions of centroids ( )

“guess” k=3 (# of clusters)

K-means Clustering

2. Assign each data point to closest centroid

K-means Clustering

3. Move centroids to center of assigned points

K-means Clustering

Iterate 1-3 until minimal cost

3. Move centroids to center of assigned points

with k clusters Si, i = 1,2,...,k and centroids µi (the mean point of all the points )

K-means ClusteringPlus:• visual • intuitive• relatively fast

Minus:• have to “guess” number of clusters• can give different results for distinct “starting seeds”

• distances computed over all features• one cluster only per element• no cluster hierarchy

Hierachical Clustering

Plus:• Shows (re-orderd) data• Gives hierarchy

Minus:• Does not work well for many genes(usually apply cut-off on fold-change)

• Similarity over all genes/conditions• Clusters do not overlap

Principle Component Analysis

Principle components (PCs) are projections onto subspace with the largest variation in the data

http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Example: 2PCs for 3d-data

http://ordination.okstate.edu/PCA.htm

Raw data points: {a, …, z}

http://ordination.okstate.edu/PCA.htmNormalized data points: zero mean (& unit std)!

Identification of axes with the most variance

Most variance is along PCA1

The direction of most variance

perpendicular to PCA1 defines

Cluster?

Reminder: Matrix multiplications

Definition:

Scheme:

Vectorized:

Example:http://en.wikipedia.org/wiki/Matrix multiplication

How do we get the PCs?• The PCs are the eigenvectors of the

covariance matrix C computed from the (mean-centered) data matrix E:

C = ET·E /(n-1)

C·pc = λ·pcC·pc = λ·pc C

300· =

C = ET·E /(n-1) ETE=C

30016k

6k/(n-1)

PCA: Example deletion mutants

And how to project?• The projected data is just the product of

the original data with the PCs:

E’ = E · PC

• Principle Component or Transformation Matrix:PC = [pc1, pc2, …, pcn]

(where n is the number of PCs used)

E’ = E · PC E3001

·E’1

• The original gene expression profiles are over 300 arrays.

• The transformed data contain projections on n “eigen-genes”(linear combinations of the 300 arrays shown in red)

…n1 2

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04-0.1

The first 2 “eigen-genes” separate data into 3 clusters

-0.04 -0.02 0 0.02 0.04 0.06 0.08-0.15

Third “eigen-gene” (PCA3) reveals little structure!

Singular Value Decomposition

V: PC matrix of “eigen-genes”(composed of eigenvectors of C = ET·E)

U: PC matrix of “eigen-arrays”(composed of eigenvectors of C’ = E·ET)

D: diagonal matrix

E = U·D·VT

“SVD = bi-PCA”

http://public.lanl.gov/mewall/kluwer2002.html

SVD: Matrix representation

E = U·D·VT

= · …

…unu1u2

…λ1λ2

0 v1v2

nU D VT

ui: eigen-arrays vi: eigen-genes λi: eigenvaluesi = 1, …, n n: rank(E) = #(independent arrays)

Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97:10101-06.

E = U·D·VT = ∑i λi·ui·viT (full expansion)

E1 = λ1·u1·v1T (rank-1 expansion)

∆ = |E - E1|2 (sum of residuals)

minimize ∆ for free u1 and v1:E·v1= λ1·u1 & ET·u1 = λ1·v1implying:E·ET·u1 = λ1

2·u1 & ET·E·v1 = λ12·v1

SVD: What is optimized?

Bergmann et al., Phys. Rev. E 67, 031902 (2003)

SVD: Example deletion mutants

E1 = λ1·u1·v1T

= · 300λ1

v1· 1 =u1

(1)·v1(1) ··· u1

(1)·v1(300)

: : : :

u1(6k)·v1

(1) ··· u1(6k)·v1

= · · =high low

low low

highhigh

arrays

original data

50 100 150 200 250 300

eigen-arrays

U (n=1)

arrays

VT (n=1)

50 100 150 200 250 300

arrays

SVD(data) = U D VT (n=1)

50 100 150 200 250 300

arrays

original data

50 100 150 200 250 300

eigen-arrays

U (n=2)

arrays

VT (n=2)

50 100 150 200 250 300

arrays

50 100 150 200 250 300

arrays

original data

50 100 150 200 250 300

eigen-arrays

U (n=3)

arrays

VT (n=3)

50 100 150 200 250 300

arrays

50 100 150 200 250 300

Part1:Analysis tools for large datasets

• Standard toolsk-means, PCA, SVD

• Modular analysis toolsCTWC, ISA, PPA

How to extract biological information from large-scale expression data?

200 400 600 800 1000

Hierarchical clustering and other correlation-based methods may begood for small data sets, but:

Problems with large data:• Clusters cannot overlap!

• Clustering based oncorrelations over all conditions:- sensitive to noise- computation intensive

Search for transcription modules:

Set of genes co-regulated undera certain set of conditions

• context specific

• allow for overlaps

How to extract biological information from large-scale expression data?

Overview of “modular” analysis tools• Cheng Y and Church GM. Biclustering of expression data.

(Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103)• Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene

microarray data. (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84)• Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization

in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6)

• Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205)

• Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering.(Genome Biol. 2002 Oct 10;3(11):RESEARCH0059)

• Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. (Genome Biol. 2000;1(2):RESEARCH0003.)

… and many more! http://serverdgm.unil.ch/bergmann/Publications/review.pdf

Coupled two-way Clustering

How to “hear” the relevant genes?

Song A

Song B

Inside CTWC: Iterations

S1G1Init

S68……S113

S2(G6)...S2(G21)S3(G6)…S3(G21)

G161………G216

G2(S4)...G2(S11)…G5(S4)...G5(S11)

S52,...

S1(G6)…S1(G21)

G98,..G105…G151,..G160

G1(S4)…G1(S11)

S12,……S51

S2(G1)…S2(G5)S3(G1)…S3(G5)

G22………G97

G2(S1)…G2(S3)…G5(S1)…G5(S3)

S4,S5,S6S10,S11None

S1(G2)…S1(G5)

G6,G7,….G13G14,…G21

G1(S2)G1(S3)

S2,S3S1(G1)G2,G3,…G5G1(S1)1

SamplesGenesDepth

Two-way clustering

• No need for correlations!

• decomposes data into “transcription modules”

• integrates external information

• allows for interspecies comparative analysis

One example in more detail:

The (Iterative) Signature Algorithm:

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

Trip to the “Amazon”:

5 10 15 20 25 30 35 40 45 50

How to find related items?

customers

re-commended

your choice

customers with

similar choice

False Positives:

5 10 15 20 25 30 35 40 45 50

How to find related genes?

conditions

similarly expressed

your guess

relevant conditions

J Ihmels, G Friedlander, SB, O Sarig, Y Ziv & N Barkai Nature Genetics (2002)

gcGc Es

}:{ CCCcccC tssCcS σ>−∈=∈

gcCcg Ess

}:{ GGGgggG tssGgS σ>−∈=∈

Signature Algorithm: Score definitions

initial guesses(genes)

thresholding:

condition scores

How to find related genes? Scores and thresholds!ge

scondition scores

How to find related genes? Scores and thresholds!

condition scores

thresholding:

How to find related genes? Scores and thresholds!Iterative Signature Algorithm

INPUT OUTPUTOUTPUT = INPUT

“Transcription Module”SB, J Ihmels & N Barkai Physical Review E (2003)

Identification of transcription modules using many random “seeds”

random“seeds”

Transcription modules

Independent identification:Modules may overlap!

New Tools: Module Visualization

http://serverdgm.unil.ch/bergmann/Fibroblasts/visualiser.html

Gene enrichment analysisThe hypergeometric distribution f(M,A,K,T) gives the probability

that K out of A genes with a particular annotation match with a

module having M genes if there are T genes in total.

http://en.wikipedia.org/wiki/Hypergeometric_distribution

Decomposing expression data into annotated transcriptional modules

identified >100 transcriptional modules in yeast:

high functional consistency!

many functional links “waiting” to be verified experimentally

J Ihmels, SB & N Barkai Bioinformatics 2005

Module hierarchies and networks Higher-order structure

correlated

anti-correlated

Organisms

Data types

Conditions

Developmental

Physiological

Environmental

Experimental

Clinical

– Protein expression– Tissue specific expression– Interaction data– Localization data– …?

Biological Insight

The challenge of many datasets: How to integrate all the information?

BLASTsignature algorithm

Mapping Transcription Modules

For distant organisms correlation patterns generally are distinct

SB, J Ihmels & N Barkai PLoS Biology (2004)

What about related organisms?

J Ihmels, SB, J Berman & N Barkai Science (2005)

pairwise correlation (over all arrays)

Promoter analysis: The “Rapid Growth Element” AATTTT Data Integration: Example NCI60

Our (modular) approach: The model

Co-modulesGene-modules Drug-modules

[AGF] [CDF][BFC]

Drug-modules

Gene-modules

Modules and Co-modules

Co-modules

E CG R D MED

Iteratively refine genes, cell-lines and drugs to get co-modules

The Ping-Pong algorithm!

Co-modules have predictive power for drug-gene associations

Co-modules analysis provides biological focus through data integration

• Analysis of large-scale expression data bears great potential to understand global transcription programs and their evolution

• Innovative analysis tools needed to extract information from such data

• (Iterative) Signature & Ping-Pong Algorithms:– decomposes data into “transcription modules”– integrates external information– allows for interspecies comparative analysis

Take-home Messages:

New Analysis of large datasets - Part1 · 2013. 6. 28. · Part1: Analysis tools for large datasets...

Documents

IHE 2002-2003 Modelling tools - MIKE11 Part1 - Introduction

Tools and Datasets Exploring the tools of the trade

Open Government Data One Year Later - ECCMA · 2016. 7. 1. · Open Government Data One Year Later. 2 ... Geodata Catalogs Search Raw Datasets, Tools, or Geodata Catalogs See datasets

DATA AT SCALE: WORKING WITH LARGE RADAR DATASETS USING OPEN SOURCE TOOLS · 2017. 12. 12. · DATA AT SCALE: WORKING WITH LARGE RADAR DATASETS USING OPEN SOURCE TOOLS SCOTT COLLIS

Chinese Textual Sentiment Analysis: Datasets, …coling2016.anlp.jp/doc/tutorial/slides/T2/Coling2016_T2.pdfChinese Textual Sentiment Analysis: Datasets, Resources and Tools Program

Scaling classical clone detection tools for ultra large datasets

Combining Financial and Physicals People, Data and Tools in Business Planning Two cultures. Two datasets. Two toolsets. One budget

Webinar Presentation TIR Amendments Part1 April 11 2011 Part1

Critical datasets & potential new tools for detection of climate impact on the water cycle

Seminario Sobre Datasets Consorcio Madrono – 17 Nov. 2008 Open Data: projects, tools, initiatives Stuart Macdonald DISC-UK Datashare -

Universität Leipzig · ClearCase JES and PD Tools •Read/Write/Update VSAM datasets via integration with IBM File Manager ... migrating datasets • Perform typical edit, compile,

Tools for Finding Structure in Large Datasets Finding ... · Finding Needles in Haystacks: Tools for Finding Structure in Large Datasets Brian D. Ripley Workshop 24–26 June 2000;

Selected Datasets & Tools Energy Data Jam July 9, 2012 The Energy Data Initiative

Diagnostic tools for all rare diseases by 2020 ... · Genomic data analyses Lack of standardized and optimized tools for informatics pipeline Lack of control datasets; population

MFC Datasets: Large-Scale Benchmark Datasets for Media

Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us: curator@arabidopsis.org

Sea level rise and storm surge tools and datasets supporting Municipal Resiliency - GSMSummit 2014,Peter Slovinsky

Using GEM’S Tools and Datasets for Calculating Risk Across the Globe

Glycoinformatics tools to analyze and curate large scale ... · Glycoinformatics tools to analyze and curate large scale experimental datasets Sriram Neelamegham Departments of Chemical

GIS in Watershed Analysis. Why watershed Analysis with GIS? Concepts Important datasets Analysis Tools