Microarray data analysis - ULisboa · PDF fileMicroarray data analysis ... Where N = Number of Genes, a and b = different colors ... “Normalization constant

1

Computational Biology IST

Technical University of Lisbon Ana Teresa Freitas

2016/2017

Microarray data analysis

Microarrays l  Rows represent

genes l  Columns represent

samples

l  Many problems may be solved using clustering

l  Example of microarray dataset

2

Microarray data

Expression levels of gene i, across samples

Gi

Expression levels of all genes, for one sample

Sj

Typical examples of samples: Heat shock, phases in cell cycle, cancer, normal, …

Microarray data

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

Log (treated-exp-value /controlled-exp-value )

sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

3

What do we actually measure?

l  We measure signal of cDNA target(s) which hybridize(s) to the probe (and backgrounds, ratios, standard deviations, dust etc.…)

l  What do we wish to know (an abstraction)? [mRNA]1a , [mRNA] 1b ,….. [mRNA]Na , [mRNA] Nb

Where N = Number of Genes, a and b = different colors

Factors with impact on the signal level

l  Amount of mRNA l  Labeling efficiencies l  Quality of the RNA l  Laser/dye combination l  Detection efficiency of photomultiplier …

4

Typical Assumption

[mRNA]n,a α signaln,a

“Normalization constant”

[mRNA]n,a = k * signaln,a

n = gene indexa = color

Low level analysis l  Image analysis - computation of probes’

intensities/signals

l  Normalization - is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes.

5

Normalization

l Sources of Systematic Errors l  Different incorporation efficiency of dyes l  Different amounts of mRNA l  Experimenter/protocol issues (comparing

chips processed by different labs) l  Different scanning parameters l  Batch bias

Normalization

l  Two problems:

l  How to detect biases? Which genes to use for estimating biases among chips?

l  How to remove the biases?

6

Which genes to use for bias detection?

•  All genes on the chip l  Assumption: Most of the genes are equally

expressed in the compared samples, the proportion of the differential genes is low (<20%).

l  Limits: l  Not appropriate when comparing highly

heterogeneous samples (different tissues) l  Not appropriate for analysis of ‘dedicated

chips’ (apoptosis chips, inflammation chips etc)

Which genes to use for bias detection?

•  Housekeeping genes –  Assumption: based on prior knowledge a set of genes

can be regarded as equally expressed in the compared samples

•  Affy novel chips: ‘normalization set’ of 100 genes •  NHGRI’s cDNA microarrays: 70 "house-keeping" genes

set –  Limits:

•  The validity of the assumption is questionable •  Housekeeping genes are usually expressed at high

levels, not informative for the low intensities range

7

Normalization methods

l  Global normalization (Scaling) l  enforces the chips to have equal mean (median) intensity

l  Intensity-dependent normalization (Lowess) l  enforces equal means at all intensities

l  Quantile Normalization l  enforces the chips to have identical intensity distribution

Quantile Normalization

l  Sort each column in the data matrix according to genes’ (probes’) intensities in each chip

l  Compute mean intensity in each rank across the chips l  Replace each intensity by the mean intensity at its rank l  Re-order columns to original state, each row corresponds

to a gene

Chip #1 Chip #2 Chip #3 Average chip

8

Quantile Normalization

Before

After

What is Cluster Analysis?

l  Cluster: a collection of data objects l  Similar to one another within the same cluster l  Dissimilar to the objects in other clusters

l  Cluster analysis l  Grouping a set of data objects into clusters

l  Clustering is unsupervised classification: no predefined classes

l  Typical applications l  As a stand-alone tool to get insight into data distribution l  As a preprocessing step for other algorithms

9

Things to study (1)

•  Clustering (grouping) genes: i.e., finding groups of co-regulated genes

Expression levels across time of two clusters of co-regulated genesExample:

samples samples

Things to study (2)

•  Clustering (grouping) samples

Groups of similarbehaviour ?

i.e., finding groups of samples with similar genetic profiles (e.g., cancer types).

10

Things to study (3)•  Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their expression profiles across samples.

samples

Annotated gene 1

Annotated gene 2

samples

samples

Unknown gene

Co-regulation? Similar biological function? Same transcription factor?

Things to study (4)

•  Classifying samples: i.e., classifying new samples, based on a set of classified samples (example: cancer versus normal; different types of cancer;...)

classified samplesA B samples to be classified

11

Things to study (5)

•  Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus experimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment).

b) Selecting which group genes is significantly different in a control versus experimental situation (same examples). c) Selecting which group of genes is relevant for a given classification problem.

Clustering methods l  Similarity-based (need a similarity function)

l  Construct a partition l  Agglomerative, bottom up l  Searching for an optimal partition

l  Typically “hard” clustering

l  Model-based (latent models, probabilistic or algebraic) l  First compute the model l  Clusters are obtained easily after having a model l  Typically “soft” clustering

12

Similary-based clustering

l  Define a similarity function to measure similarity between two objects

l  Common criteria: Find a partition to l  Maximize intra-cluster similarity l  Minimize inter-cluster similarity

l  Two ways to construct the partition l  Hierarchical (e.g.,Agglomerative Hierarchical Clustering) l  Search by starting at a random partition (e.g., K-means)

Agglomerative Hierarchical Clustering

l  Given a similarity function to measure similarity between two objects

l  Gradually group similar objects together in a bottom-up fashion

l  Stop when some stopping criterion is met l  Variations: different ways to compute

group similarity based on individual object similarity

13

Distance Metrics

l  For clustering algorithms the calculation of a distance between gene vectors or experiment vectors is a necessary step

l  Distances metrics can be classified as •  Metric distances •  Semi-metric distances

l  Metric distances: 1.  dab >= 0 2.  dab = dba 3.  daa = 0 4.  dab <= dac + dcb

l  Semi-metric distances: obey 1) to 3), fail in 4)

Distance Metrics

Minkowski distance If q = 1, d is Manhattan distance (semi-metric distance)

If q = 2, d is Euclidean distance (metric distance)

q q

pp

qq

jxixjxixjxixjid )||...|||(|),(2211

−++−+−=

||...||||),(2211 pp jxixjxixjxixjid −++−+−=

)||...|||(|),( 22

22

2

11 pp jxixjxixjxixjid −++−+−=

14

Distance Metrics

Pearson correlation coefficient (semi-metric distance)

∑∑

∑

=−

=−

=−−

= n

ix

ixn

ix

ix

n

ix

ixx

ix

jid

12)

22(

12)

11(1

)22

)(11

(),(

-1 <= d(i,j) <= +1

)2,1( xx

x1

x2 )11( xx −

)22( xx −

Distance Metrics

Entropy based distances:Mutual Information(semi-metric distance)

•  Mutual Information (MI) is a statistical representation of the correlation of two signals A and B.

•  MI is a measure of the additional information known about one expression pattern when given another.

•  MI is not based on linear models and can therefore also see non-linear dependencies (see picture).

15

Similarity-induced Structure

How to Compute Group Similarity?

Three Popular Methods: Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair Complete-link algorithm: s(g1,g2)= similarity of the farthest pair Average-link algorithm: s(g1,g2)= average of similarity of all pairs

16

Comparison of the Three Methods

l  Single-link l  “Loose” clusters l  Individual decision, sensitive to outliers

l  Complete-link l  “Tight” clusters l  Individual decision, sensitive to outliers

l  Average-link l  “In between” l  Group decision, insensitive to outliers

l  Which one is the best? Depends on what you need!

Hierarchical (agglomerative) clustering.

Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram

2 3 5 1 4

Cutting the dendogram at a certain level yields clusters.

2 3 5 1 4

diss

imila

rity

Dendogram cutting is a problem analogous to the selection of K in K-means clustering.

17

Microarray data from time course of serumstimulation of primary human fibroblasts.

Experiment:Foreskin fibroblasts were grown in culture and weredeprived of serum for 48 hr. Serum was added back andsamples taken at time 0, 15 min, 30 min, 1hr, 2 hr, 3 hr, 4hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr.

Clustering:Correlation Coefficient +

Agglomerative clustering (average-link)

Clusters with biological interpretation:(A) cholesterol biosynthesis,(B) the cell cycle,(C) the immediate-early response,(D) signalling and angiogenesis,(E) wound healing and tissue remodelling.

Example of agglomerative gene clustering (Eisen et al, 98)

Data Structures

l  Data matrix

l  Dissimilarity matrix

⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

18

Partitioning Algorithms: Basic Concept

l  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

l  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion l  Global optimal: exhaustively enumerate all partitions l  Heuristic methods: k-means and k-medoids algorithms l  k-means (MacQueen’67): Each cluster is represented by the

center of the cluster l  k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method

l  Given k, the k-means algorithm is implemented in four steps: l  Step 1: Partition objects into k nonempty subsets

l  Step 2: Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)

l  Step 3: Assign each object to the cluster with the nearest seed point

l  Go back to Step 2, stop when no more new assignment

19

The K-Means Clustering Method l  Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassign reassign

Comments on the K-Means Method

l  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

l  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

l  Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing

l  Weakness l  Applicable only when mean is defined, then what about

categorical data?

l  Need to specify k, the number of clusters, in advance l  Unable to handle noisy data and outliers

l  Not suitable to discover clusters with non-convex shapes

20

Variations of the K-Means Method

l  A few variants of the k-means which differ in

l  Selection of the initial k means

l  Dissimilarity calculations

l  Strategies to calculate cluster means

l  Handling categorical data: k-modes (Huang’98)

l  Replacing means of clusters with modes

l  Using new dissimilarity measures to deal with categorical objects

l  Using a frequency-based method to update modes of clusters

l  A mixture of categorical and numerical data: k-prototype method

What is the problem of k-Means Method?

l  The k-means algorithm is sensitive to outliers ! l  Since an object with an extremely large value may substantially

distort the distribution of the data.

l  K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the

most centrally located object in a cluster.

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

21

Problem 1 Consider the following expression matrix where the expression levels of 2 genes (G1 and G2) were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of grouping tissues given the expression profiles of the genes using clustering algorithms. l  Determine the dendogram found by a hierarchical clustering algorithm (HCA)

using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the single-link distance to compute the distance between groups (intercluster distance).

l  How would you use the dendogram to group the tissues in 2 groups (clusters) and which will be those clusters?

l  Determine the groups found by the K-means (K=2) algorithm when the centroids are initialized with C5 = (4,3) and C6 = (1,1).

Biclustering: Motivation l  Gene expression matrices have been

extensively analyzed using clustering in one of two dimensions l  The gene dimension l  The condition dimension

l  This corresponds to the l  Analysis of expression patterns of genes by

comparing rows in the matrix. l  Analysis of expression patterns of samples by

comparing columns in the matrix.

22

Biclustering: Motivation

l  Common objectives pursued when analyzing gene expression data include: 1.  Grouping of genes according to their expression under

multiple conditions. 2.  Classification of a new gene, given its expression and the

expression of other genes, with known classification. 3.  Grouping of conditions based on the expression of a

number of genes. 4.  Classification of a new sample, given the expression of the

genes under that experimental condition.

What is Biclustering?

l  Biclustering = Simultaneous clustering of both rows and columns of

a data matrix.

l  Concept can be traced back to the 70’ (Hartigan, 1972), although it

has been rarely used or studied.

l  The term was introduced by (Cheng and Church, 2000) who were

the first to used it in gene expression data analysis.

l  Technique used in other fields, such as collaborative filtering,

information retrieval and data mining.

23


l  We consider a n by m data matrix, A=(X,Y), where l  X={x1,..., xn} = Set of n rows l  Y={y1,..., ym} = Set of m columns l  aij = numeric value (discrete or real) representing the relation

between row i and column j. l  In the case of gene expression matrices

l  X = Set of Genes l  Y = Set of Conditions l  aij = expression level of gene i under condition j (real value).


anm ... anj ... an1 Gene n

... ... ... ... ... ...

aim ... aij ... ai1 Gene i

... ... ... ... ... ...

a1m ... a1j ... a11 Gene 1

Condition m ... Condition j ... Condition

1

A = (X,Y)

Gene Expression Matrix

24

What is Biclustering? Given the matrix A = (X,Y) I = Subset of rows J = Subset of columns l  (I,Y) = a subset of rows that exhibit similar behavior

across the set of all columns = cluster of rows l  (X,J) = a subset of columns that exhibit similar

behavior across the set of all rows = cluster of columns


l  (I,J) = a subset of rows and a subset of columns, where the rows exhibit similar behavior across the columns and vice-versa.

= sub-matrix of A that contains only the elements aij with set of rows I and set of columns J.

= bicluster l  We want to identify a set of biclusters Bk = (Ik,Jk). l  Each bicluster Bk must satisfies some specific

characteristics of homogeneity.

25

a69

a59

a49

a39

a29

a19

C9

G6

G5

G4

G3

G2

G1 a110

a18 a17 a16 a15 a14 a13 a12 a11

a610

a68 a67 a66 a65 a64 a63 a62 a61

a510

a58 a57 a56 a55 a54 a53 a52 a51

a410

a48 a47 a46 a45 a44 a43 a42 a41

a310

a38 a37 a36 a35 a34 a33 a32 a31

a210

a28 a27 a26 a25 a24 a23 a22 a21

C10 C8 C7 C6 C5 C4 C3 C2 C1

X = {G1, G2, G3, G4, G5, G6}

Y= {C1, C2, C3, C4, C5, C6, C7, C8, C9, C10}

I = {G2, G3, G4}

J = {C4, C5, C6}

Bicluster (I,J)

{{G2, G3, G4}, {C4, C5, C6}}

Cluster of Rows (I,Y)

{G2, G3, G4}

Cluster of Columns (X,J)

{C4, C5, C6}


What is Biclustering? l  Biclustering Goals

l  Perform simultaneous clustering on the row and column dimensions of the gene expression matrix instead of clustering the rows and columns separetely.

l  Identify sub-matrices (subsets of rows and subsets of columns) with interesting properties.

l  Gene Expression Data Analysis l  Identify subgroups of genes and subgroups of

conditions, where the genes exhibit highly correlated activities for every condition

Madeira, Sara C. and Oliveira, Arlindo L. Biclustering Algorithms for Biological Data Analysis: A Survey IEEE/ACM Trans. Comput. Biol. Bioinformatics January 2004

26

Bicluster Types l  An interesting criteria to evaluate a biclustering algorithm

concerns the identification of the type of biclusters the algorithm

is able to find.

l  There are four major classes of biclusters

1.  Biclusters with constant values.

2.  Biclusters with constant values on rows or columns.

3.  Biclusters with coherent values.

4.  Biclusters with coherent evolutions.

Constant Values

1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

27

4.0 4.0 4.0 4.0

3.0 3.0 3.0 3.0

2.0 2.0 2.0 2.0

1.0 1.0 1.0 1.0

4.0 3.0 2.0 1.0

4.0 3.0 2.0 1.0

4.0 3.0 2.0 1.0

4.0 3.0 2.0 1.0

Constant Rows Constant Columns

Constant Values on Rows or Columns

Coherent Values

4.0 9.0 6.0 5.0

3.0 8.0 5.0 4.0

1.0 6.0 3.0 2.0

0.0 5.0 2.0 1.0

4.5 1.5 6.0 3.0

6.0 2.0 8.0 4.0

3.0 1.0 4.0 2.0

1.5 0.5 2.0 1.0

Additive Model Multiplicative Model

28

Coherent Evolutions

S1 S1 S1 S1

S1 S1 S1 S1

S1 S1 S1 S1

S1 S1 S1 S1

S4 S4 S4 S4

S3 S3 S3 S3

S2 S2 S2 S2

S1 S1 S1 S1

Overall Coherent

Evolution

Coherent Evolution

On the Rows

Coherent Evolutions

S4 S3 S2 S1

S4 S3 S2 S1

S4 S3 S2 S1

S4 S3 S2 S1

12 20 15 90

15 27 20 40

35 49 40 49

10 19 13 70

Coherent Evolution

On the Columns

Order Preserving

Sub-Matrix (OPSM)

29

Algorithms

l  When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph.

l  Finding a maximum size bicluster l  Is equivalent to finding the maximum edge biclique in a

bipartite graph. l  This problem is known to be NP-complete (Peeters, 2003).

l  More complex cases l  Where the actual numeric values in the matrix A are taken

into account to compute the quality of a bicluster l  Have a complexity that is necessarily no lower than this

simpler case.

Algorithms

l  Given this, the large majority of the algorithms use heuristic approaches to identify biclusters.

l  In many cases the algorithm is preceded by a normalization step that is applied to the data matrix. l  The goal is to make more evident the patterns of

interest. l  Some algorithms avoid heuristics but exhibit

an exponential worst case runtime.

30

Algorithms l  Different Objectives

l  Identify one bicluster. l  Identify a given number of biclusters.

l  Different Approaches l  Discover one bicluster at a time. l  Discover one set of biclusters at a time. l  Discover all biclusters at the same time

(Simultaneous bicluster identification)

Algorithms: Heuristic Approaches

l  Iterative Row and Column Clustering Combination l  Apply clustering algorithms to the rows and columns of the

data matrix, separately. l  Combine the results using some sort of iterative procedure

to combine the two cluster arrangements.

l  Divide and Conquer l  Break the problem into several subproblems that are similar

to the original problem but smaller in size. l  Solve the problems recursively.

31

Algorithms: Heuristic Approaches

l  Combine the intermediate solutions to create a solution to the original problem.

l  Usually break the matrix into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices.

l  Greedy Iterative Search l  Always make a locally optimal choice in the hope that this

choice will lead to a globally good solution. l  Usually perform greedy row/column addition/removal.

Algorithms

l  Exhaustive Bicluster Enumeration l  A number of methods have been used to speed up

exhaustive search. l  In some cases the algorithms assume restrictions

on the size of the biclusters that should be listed.

32

Measure cluster homogeneity

33

Missing values: Random numbers Find one bicluster at a time Hide biclustering using random numbers

34

35

Example

l  Consider the following expression matrix J

A(X,Y) =

I Run Brute-Force Deletion and Addition algorithm to find a Biclustering

36

Example l  Run Algorithm 2, δ = 0 (maximum acceptable mean squared residue

score), α = 1,5 (a threshold for the multiple node deletion)

aIj – column aiJ – row aIJ = 29/(4x4) a1J = 5/4 aI1 = 7/4 a2J = 8/4 aI2 = 4/4 a3J = 6/4 aI3 = 9/4 a4J = 10/4 aI4 = 9/4 H(I,J) = (1/(4x4)) * ((a11 – a1J – aI1 + aIJ)^2 + (a12 – a1J – aI2 + aIJ)^2) + (a13 – a1J – aI3 + aIJ)^2) + (a14 – a1J – aI4 + aIJ)^2) + (a21 – a2J – aI1 + aIJ)^2) + … = 1,28 http://www.kemaleren.com/cheng-and-church.html

Documents

Microarray data analysis - ULisboa · PDF fileMicroarray data analysis ... Where N = Number of Genes, a and b = different colors ... “Normalization constant