65
BIOINFORMATICS Datamining #2 Mark Gerstein, Yale University gersteinlab.org/courses/452

BIOINFORMATICS Datamining #2

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIOINFORMATICS Datamining #2

1 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

BIOINFORMATICS Datamining #2

Mark Gerstein, Yale University gersteinlab.org/courses/452

Page 2: BIOINFORMATICS Datamining #2

2 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Spectral Methods Outline & Papers

•  Simple background on PCA (emphasizing lingo) •  More abstract run through on SVD •  Application to

◊  J Qian et al. (2001). "Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions." J Mol Biol 314: 1053-66.

◊  O Alter et al. (2000). "Singular value decomposition for genome-wide expression data processing and modeling." PNAS vol. 97: 10101-10106

◊  Y Kluger et al. (2003). "Spectral biclustering of microarray data: coclustering genes and conditions." Genome Res 13: 703-16.

Page 3: BIOINFORMATICS Datamining #2

3 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Typical Predictors and Response for Yeast

Page 4: BIOINFORMATICS Datamining #2

4 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Represent predictors in abstract high dimensional space

Core

Page 5: BIOINFORMATICS Datamining #2

5 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

“cluster” predictors Core

Page 6: BIOINFORMATICS Datamining #2

6 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Use clusters to predict Response Core

Page 7: BIOINFORMATICS Datamining #2

7 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Clustering the

yeast cell cycle to uncover

interacting proteins

Microarray timecourse of 1 ribosomal protein

mR

NA

expr

essi

on le

vel (

ratio

)

Time->

[Brown, Davis] Extra

Page 8: BIOINFORMATICS Datamining #2

8 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Clustering the

yeast cell cycle to uncover

interacting proteins

Random relationship from ~18M

mR

NA

expr

essi

on le

vel (

ratio

)

Time->

Extra

Page 9: BIOINFORMATICS Datamining #2

9 (

c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Clustering the

yeast cell cycle to uncover

interacting proteins

Close relationship from 18M (2 Interacting Ribosomal Proteins)

mR

NA

expr

essi

on le

vel (

ratio

)

Time->

[Botstein; Church, Vidal] Extra

Page 10: BIOINFORMATICS Datamining #2

10

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Clustering the

yeast cell cycle to uncover

interacting proteins

Predict Functional Interaction of Unknown Member of Cluster

mR

NA

expr

essi

on le

vel (

ratio

)

Time->

Extra

Page 11: BIOINFORMATICS Datamining #2

11

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Global Network of Relationships

~470K significant relationships from ~18M

possible

Core

Page 12: BIOINFORMATICS Datamining #2

12

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Intuition in terms of Adj. Matrix

Page 13: BIOINFORMATICS Datamining #2

13

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Clustering algorithm identifies further

(reasonable) types of

expression relation-

ships

Simultaneous

Traditional Global

Correlation

Inverted

Time-Shifted

[Church]

Core

Page 14: BIOINFORMATICS Datamining #2

14

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Alignment

Suppose there are n (1, 2, …, n) time points:

 The expression ratio is normalized in “Z-score” fashion;

 Score matrix: Si,j = S(xi,yj) = xi • yj ;

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Page 15: BIOINFORMATICS Datamining #2

15

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Alignment

Suppose there are n (1, 2, …, n) time points:

  Sum matrices Ei,j and Di,j :

Ei,j = max(Ei-1,j-1 + Si,j , 0);

Di,j = max(Di-1,j-1 - Si,j , 0);

  Match Score = max(Ei,j , Di,j )

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Page 16: BIOINFORMATICS Datamining #2

16

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Simultaneous

Page 17: BIOINFORMATICS Datamining #2

17

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Simultaneous

Page 18: BIOINFORMATICS Datamining #2

18

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Time-Shifted

Page 19: BIOINFORMATICS Datamining #2

19

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Inverted

Page 20: BIOINFORMATICS Datamining #2

20

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Global (NW) vs Local (SW)Alignments

TTGACACCCTCCCAATTGTA... |||| || | .....ACCCCAGGCTTTACACAT 123444444456667

T T G A C A C C... | | - | | | | - T T T A C A C A... 1 2 1 2 3 4 5 4 0 0 4 4 4 4 4 8

Match Score = +1 Gap-Opening=-1.2, Gap-Extension=-.03 for local alignment Mismatch = -0.6

Adapted from D J States & M S Boguski, "Similarity and Homology," Chapter 3 from Gribskov, M. and Devereux, J. (1992). Sequence Analysis Primer. New York, Oxford University Press. (Page 133)

mismatch

Page 21: BIOINFORMATICS Datamining #2

21

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Statistical Scoring

Page 22: BIOINFORMATICS Datamining #2

22

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Examples time-shifted relationships

Suggestive

ARP3 : in actin remodelling cplx.

ARC35 : in same cplx. (required late in cell cycle)

Time E

xpr.

Rat

io

Predicted

J0544 : unknown function

MRPL19: mito.ribosome Extra

Page 23: BIOINFORMATICS Datamining #2

23

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Examples time-shifted relationships

Suggestive

ARP3 : in actin remodelling cplx.

ARC35 : in same cplx. (required late in cell cycle)

Time E

xpr.

Rat

io

Predicted

J0544 : unknown function

MRPL19: mito.ribosome Extra

Page 24: BIOINFORMATICS Datamining #2

24

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Global Network of 3 Different

Types of Relationships

Simultaneous Inverted

Shifted

~470K significant relationships from ~18M

possible

Extra

Page 25: BIOINFORMATICS Datamining #2

25

(c) M

ark

Ger

stei

n, 1

999,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Large-scale Datamining

•  Gene Expression ◊  Representing Data in a Grid ◊  Description of function prediction in abstract context

•  Unsupervised Learning ◊  clustering & k-means ◊  Local clustering

•  Supervised Learning ◊  Discriminants & Decision Tree ◊  Bayesian Nets

•  Function Prediction EX ◊  Simple Bayesian Approach for Localization Prediction

Page 26: BIOINFORMATICS Datamining #2

26

Intuition on interpretation of SVD in terms of genes

and conditions

Page 27: BIOINFORMATICS Datamining #2

27

SVD for microarray data (Alter et al, PNAS 2000)

Page 28: BIOINFORMATICS Datamining #2

28

Page 29: BIOINFORMATICS Datamining #2

29

Notation •  m=1000 genes

–  row-vectors –  10 eigengene (vi) of dimension 10

conditions •  n=10 conditions (assays)

–  column vectors –  10 eigenconditions (ui) of

dimension 1000 genes

Page 30: BIOINFORMATICS Datamining #2

30

Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix

Page 31: BIOINFORMATICS Datamining #2

31

Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix

Bra - ket notation

Page 32: BIOINFORMATICS Datamining #2

32

Plotting Experiments in Low Dimension Subspace

Page 33: BIOINFORMATICS Datamining #2

33

Close up on Eigengenes

Page 34: BIOINFORMATICS Datamining #2

34 Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Genes sorted by correlation with top 2 eigengenes

Page 35: BIOINFORMATICS Datamining #2

35 Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Same thing different experiment: Genes sorted by relative correlation with first two eigengenes for alpha-factor experiment

Page 36: BIOINFORMATICS Datamining #2

36 Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Normalized elutriation

expression in the subspace

associated with the cell cycle

Page 37: BIOINFORMATICS Datamining #2

37

Biplot Applied to Genes and Conditions

See grouping of arrays and genes on same plot

Page 38: BIOINFORMATICS Datamining #2

38

Biplot Ex

Page 39: BIOINFORMATICS Datamining #2

39

Biplot Ex #2

Page 40: BIOINFORMATICS Datamining #2

40

Biplot Ex #3

Assuming s=1, Av = u ATu = v

Page 41: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

41

Spectral Biclustering

Page 42: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

42

Biclustering to associate particular genes with certain phenotypes

Conditions

Reo

rder

ed G

enes

(S

orte

d ac

cord

ing

to

a cl

assi

ficat

ion

vect

or)

?

Matrix of raw data

Gen

es

Reordered Conditions (Sorted according to

a classification vector)

Shuffled Matrix (containing checkerboard

“biclusters” of conditions with marker genes)

Page 43: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

43

Pomeroy et. al. , Nature 415 (2002) 436 Prediction of central nervous system embryonal tumor outcome based on gene expression

5 types of brain tumors

Page 44: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

44

Intuition on Identification of Blocky Matrices

2

Page 45: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

45

Gene partition vector

tumor 1 tumor 2

Gen

e cl

uste

r 1

Gen

e cl

uste

r 2

tumor 3

Page 46: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

46

Tissue partition vector

Page 47: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

47

Biclustering by SVD

Page 48: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

48

Identify checkerboard matrices by their action

on classification vectors: Formulation as “eigenproblem”

Checkerboard Matrix A

Condition Classification Vect. x Conditions

Gen

es

Gene Classification Vector y

A A x = x’ T

A A y = y’ T

Genes

Con

ditio

ns

x’

y A T

Page 49: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

49

SVD to Solve Eigenproblem

[Botstein]

Page 50: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

50

Yuval Kluger et al. Genome Res. 2003; 13: 703-716

Figure 1. Overview of important parts of the biclustering process

Page 51: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

51

Gene partition with noisy data

Page 52: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

52

Normalization Rescales Rows and Columns to Same Means

Page 53: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

53

Rescale columns

Page 54: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

54

Representative Cancer Data set

•  Lymphoma Data from Dalla-Favera et al. at Columbia

•  Informatics from Stolovitzky & Califano at IBM

•  Supervised learning some identified characteristic genes associated with different types of lymphoma

Page 55: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

55

Patients (samples) sorted according to projection onto blocky classification eigenvector (u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor (

v2)

Matrix values represent outer products of two blocky

classification eigenvectors

Results on Representative Cancer Data set

Page 56: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

56

Actual Data with Normalization and Sorting

Page 57: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

57

Actual Data just with Sorting

(no normalization)

Page 58: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

58

Actual Data (no normalization

or sorting)

Page 59: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

59

Actual Data just with Sorting

(no normalization)

Page 60: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

60

Actual Data with Normalization and Sorting

Page 61: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

61

Patients (samples) sorted according to projection onto blocky classification eigenvector (u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor (

v2)

Matrix values represent outer products of two blocky

classification eigenvectors

Just signal from top classification

eigenvectors

Page 62: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

62

Low Dimension Representation

Page 63: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

63

Patients (samples) sorted according to projection onto blocky classification eigenvector (u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor (

v2)

Actual Values of Projections onto

Classification Eigenvectors

Page 64: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

64

Classification of Cancers Based on Projection onto two top classification

eigenvectors: Better with Normalization

Normalized (“bistochastization”)

CLL DLCL FL DLCL

Straight SVD Four types of Cancer in Della Favera dataset

Page 65: BIOINFORMATICS Datamining #2

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

65

Golub, TR et. al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286

biclustering bistochastization

SVD bi-normalization Normalized cuts

ALL (B) ALL (T) AML