45
BASIC METHODOLOGIES OF ANALYSIS : UPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES SUPERVISED METHODS CAN ONLY VALIDATE OR REJECT HYPOTHESES. CAN NOT LEAD TO DISCOVERY OF UNEXPECTED PARTITIONS . UNSUPERVISED: EXPLORATORY ANALYSIS •NO PRIOR KNOWLEDGE IS USED • EXPLORE STRUCTURE OF DATA ON THE BASIS OF CORRELATIONS AND SIMILARITIES

BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Embed Size (px)

Citation preview

Page 1: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

BASIC METHODOLOGIES OF ANALYSIS:

SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.)

IDENTIFY DIFFERENTIATING GENES

Basic methodologies1

SUPERVISED METHODS CAN ONLY VALIDATE OR

REJECT HYPOTHESES. CAN NOT LEAD TO DISCOVERY

OF UNEXPECTED PARTITIONS

. UNSUPERVISED: EXPLORATORY ANALYSIS

•NO PRIOR KNOWLEDGE IS USED

• EXPLORE STRUCTURE OF DATA ON THE BASIS OF

CORRELATIONS AND SIMILARITIES

Page 2: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Expression 1-99%

20 40 60 80 100 120 140

100

200

300

400

500

600

700

800

900

1000

AIMS: ASSIGN PATIENTS TO GROUPS ON THE BASIS OF THEIR EXPRESSION PROFILES

IDENTIFY DIFFERENCES BETWEEN TUMORS AT DIFFERENT STAGES

IDENTIFY GENES THAT PLAY CENTRAL ROLES IN DISEASE PROGRESSION

EACH PATIENT IS DESCRIBED BY 30,000 NUMBERS: ITS EXPRESSION PROFILE

Page 3: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

UNSUPERVISED ANALYSIS

•GOAL A: FIND GROUPS OF GENES THAT HAVE

CORRELATED EXPRESSION PROFILES. THESE GENES ARE

BELIEVED TO BELONG TO THE SAME BIOLOGICAL

PROCESS.

•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR

GENE EXPRESSION PROFILES. THESE TISSUES ARE

EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)

STATE.

CLUSTERING, SORTING

Unsupervised analysis

Page 4: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

Page 5: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

LINEAR ORDERING OF DATA

Page 6: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Giraffe + Okapi

BUT WHAT ABOUT THE OKAPI ?

Page 7: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Page 8: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

CLUSTER ANALYSIS YIELDS DENDROGRAM

Dendrogram2

T

STABILITY

T

LINEAR ORDERING OF DATA

YOUNG OLD

Page 9: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

• CENTROID (REPRESENTATIVE)

–SELF ORGANIZED MAPS (KOHONEN 1997;

(GENES: GOLUB ET. AL., SCIENCE 1999)

–K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999)• AGGLOMERATIVE HIERARCHICAL

- AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998)

•PHYSICALLY MOTIVATED

–DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990;

GENES: ALON ET. AL., PNAS 1999)

–SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL.

GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000)

--COUPLED MAPS (ANGELINI ET. AL., PRL 2000)

CLUSTERING METHODS

Clustering methods

Page 10: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

•INFORMATION THEORY

–AGGLOMERATIVE INFORMATION BOTTLENECK (TISHBY ET. AL.)

•LINEAR ALGEBRA

–SPECTRAL METHODS (MALIK ET. AL.)

•MULTIGRID BASED METHODS (BRANDT ET. AL., )

CLUSTERING METHODS (Cont)

Clustering methods

Page 11: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Centroid methods – K-means

i = 1,...,N DATA POINTS, AT Xi

= 1,...,K CENTROIDS, AT Y

ASSIGN DATA POINT i TO CENTROID ; Si =

COST E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

MINIMIZE E OVER Si , Y

2

1 1

))(,(

YXS i

N

i

K

i

Page 12: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

K-means

“GUESS” K=3

Page 13: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

K-means

Iteration = 0

•Start with random positions of centroids.

Page 14: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign each data point to

closest centroid

Page 15: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

K-means

Iteration = 1

• Start with random

positions of centroids.

• Assign each data point to

closest centroid

• Move centroids to center

of assigned points

Page 16: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

K-means; algorithm to find minima

Iteration = 3

•Start with random positions of centroids.

•Assign each data point to

closest centroid

•Move centroids to center

of assigned points

•Iterate till minimal cost

Page 17: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

E=Total Sum of Squares vs K

Page 18: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

• Result depends on initial centroids’ position

• Fast algorithm: compute distances from data points to centroids

• O(N) operations (vs O(N2))

• Must preset K

• Fails for non-spherical distributions

K-means - Summary

Page 19: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

at each step merge pair of nearest clustersinitially – each point = cluster

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Page 20: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic
Page 21: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

Page 22: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

2 flat clouds

2 FLAT CLOUDS - SINGLE LINKAGE WORKS

Page 23: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

average linkage

Page 24: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

average linkage

Page 25: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

filament

SINGLE LINKAGE SENSITIVE TO NOISE

Page 26: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

filament

SINGLE LINKAGE SENSITIVE TO NOISE

Page 27: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

filament with one point removed

SINGLE LINKAGE SENSITIVE TO NOISE

Page 28: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Hierarchical Clustering -Summary

• Results depend on distance update method

• Greedy iterative process

• NOT robust against noise

• No inherent measure to identify stable clusters

Average Linkage – the most widely used clustering method in gene expression analysis

Page 29: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

nature 2002 breast cancer

Page 30: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

how many clusters?

3 LARGEMANY small

SuperParamagneticClustering (SPC)

toy problem SPC

Page 31: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

other methods

Page 32: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Graph based clustering

Undirected graph: Vertices (nodes). Edges.

A cut.

J i,j

i

j

Page 33: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Graph based clustering (cont.)i=1,2,...N data points = vertices (nodes) of graph

J i,j – weight associated with edge i,j

5

1

8

J 5,8

J i,j depends on distance D i,j

J i,j

D i,j

A cut in the graph represents a clustering solution (partition).

Page 34: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

= cut edge

high cost

(high resolution)

low cost

(low resolution)

COST OF A CUT, I.E, PARTITION = WEIGHTS OF ALL CUT EDGES

Page 35: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

highest cost =

sum of all edges.

each point is a cluster

lowest cost = 0

One cluster.

Conclusion –minimization/maximization of the cost are meaningless

Page 36: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Clustering: The SPC spiritM.Blatt, S.Weisman and E.Domany (1996)

• SPC’s idea – consider ALL cuts, i.e. partitions {S}.

• Each partition appears with probability p({S}).

• Measure the correlation between points i,j connected by an edge, over all partitions:

Cij = probability that the edge i-j was NOT cut.

{S}1: p({S}1)

i

j

i

j

i

j

i

j

{S}2: p({S}2) {S}3: p({S}3) {S}4: p({S}4)

Cij = p({S}2)+ p({S}3)+ p({S}4)

Page 37: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Clustering: The SPC spirit (cont) • We have a graph, whose edge values are the correlations

0.45

0.75

0.75

1 0.8

0.90.2

0.45

0.85 0.7

0.7 0.70.9

1

• Create the clustering solution by deleting edges for which Cij < 0.5

Page 38: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

What is p({S}) ?

• COST OF {S} = H({S}) CORRESPONDS TO THE RESOLUTION

• SOUNDS REASONABLE TO FIND A SOLUTION FOR EACH

VALUE OF THE COST/RESOLUTION, E.

• FIX H=E, AND GENERATE PARTITIONS FOR WHICH H({S})=E.

• P({X}) = 1/(# PARTITIONS WITH H({S})=E), IF H({X})=E

0 OTHERWISE

Page 39: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

What is p({S}) ? (Cont)

• Due to computational issues it is easier to generate partitions for with

an AVERAGE cost E:

INSTEAD OF FINDING PARTITIONS WITH H({S})=E

FIND PARTITIONS WITH<H{S}>=E

P({X})=exp [-H({X})/T ] /Z

Boltzmann distribution T is the temperature = the resolution parameter

Page 40: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Outline of SPC

• Go over resolutions T (minT to maxT is steps of deltaT):•Generate thousands (Cycles) of partitions withaverage cost that corresponds to the current resolution.•Calculate pair correlations : Ci,j(T).•Clusters(T): connected components of C i,j > 0.5

• Map data to a graph G.

Page 41: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

• Example: N=4800 points in D=2

Super-Paramagnetic Clustering (SPC)

Page 42: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Output of SPC

Size of largest clusters as function of T

Size of largest clusters as function of T

DendrogramDendrogram

Stable clusters “live” for large T

Stable clusters “live” for large T

A function (T) that peaks when stable clusters break

A function (T) that peaks when stable clusters break

Page 43: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Identify the stable clusters

Page 44: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Same data - Average Linkage

Examining this cluster

Examining this cluster

No analog to (T)No analog to (T)

Page 45: BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic

Advantages of SPC

• Scans all resolutions (T)

• Robust against noise and initialization -calculates collective correlations.

• Identifies “natural” () and stable clusters (T)

• No need to pre-specify number of clusters

• Clusters can be any shape

• Can use distance matrix as input (vs coordinates)