Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 4, 2010 Lecture hours 14-15 Nataša Pržulj [email protected]

Introduction to Bioinformatics

Biological NetworksDepartment of ComputingImperial College London

March 4, 2010

Lecture hours 14-15

Nataša Prž[email protected]

Data Clustering

• find relationships and patterns in the data to achieve insights in underlying biology

• Clustering algorithms can be applied to the data to find groups of similar genes/proteins, or groups of similar samples

What is data clustering?• Clustering of data is a method by which large sets of data is

grouped into clusters (groups) of smaller sets of similar data.

• Example: There are a total of 10 balls which are of three different colours. We are interested in clustering the balls into three different groups.

• An intuitive solution is that balls of same colour are clustered (grouped together) by colour.

• Identifying similarity by colour was easy, however we want to extend this to numerical values to be able to deal with biological data, and also to cases when there are more features (not just colour).

Clustering

• Partition a set of elements into subsets called clusters such that

– elements of the same cluster are similar to each other (homogeneity property, H)

– Elements from different clusters are different (separation property, S)

Clustering Algorithms

• A clustering algorithm attempts to find natural groups of components (or data) based on some notion similarity over the features describing them.

• Also, the clustering algorithm finds the centroid of a group of data sets.

• To determine cluster membership, many algorithms evaluate the distance between a point and the cluster centroids.

• The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.

Cluster centroid :• The centroid of a cluster is a point whose parameter values

are the mean of the parameter values of all the points in the clusters.

Distance:• Generally, the distance between two points is taken as a

common metric to assess the similarity among the components of a population. The commonly used distance measure is the Euclidean distance which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :


• There are many possible distance metrics.• Some theoretical (and intuitive) properties of distance

metrics– Distance between two items (elements) must be greater than or

equal to zero,– Distances cannot be negative.– The distance between an item and itself must be zero– Conversely if the difference between two items is zero, then the

items must be identical.– The distance between item A and item B must be the same as

the distance between item B and item A.– The distance between item A and item C must be less than or

equal to the sum of the distance between items A and B and items B and C (triangle inequality).


Example distances:• Euclidean (L2) distance• Manhattan (L1) distance• Lm: (|x1-x2|m+|y1-y2|m)1/m

• L∞: max(|x1-x2|,|y1-y2|)• Inner product: x1x2+y1y2

• Correlation coefficient• For simplicity we will concentrate on Euclidean

and Manhattan distances


Distance Measures: Minkowski Metric• Suppose two objects and both have features :

• The Minkowski metric is defined as:


Commonly used Minkowski metrics:


Examples of Minkowski metrics:


Distance/Similarity matrices:• Clustering is based on distances –

distance/similarity matrix:• Represents the distance between objects• Only need half the matrix, since it is symmetric


Hierarchical vs Non-hierarchical:

• Hierarchical clustering is the most commonly used methods for identifying groups of closely related genes or tissues. Hierarchical clustering is a method that successively links genes or samples with similar profiles to form a tree structure.

• K-means clustering is a method for non-hierarchical (flat) clustering that requires the analyst to supply the number of clusters in advance and then allocates genes and samples to clusters appropriately.


Hierarchical Clustering:Given a set of N items to be clustered, and an NxN distance (orsimilarity) matrix, the basic process hierarchical clustering is this:

1. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item.

2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

3. Compute distances (similarities) between the new cluster and each of the old clusters

4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.


Hierarchical Clustering:1. Scan the matrix for the minimum

2. Join items into one node

3. Update matrix and repeat from step 1


Hierarchical Clustering:Distance between two points – easy to compute Distance between two clusters – harder to compute:

1. Single-Link Method / Nearest Neighbor2. Complete-Link / Furthest Neighbor3. Average of all cross-cluster pairs


Hierarchical Clustering:1. Single-Link Method / Nearest Neighbor (also called the connectedness,

or minimum method): distance between one cluster and another cluster is equal to the

shortest distance from any member of one cluster to any member of the other cluster

2. Complete-Link / Furthest Neighbor (also called the diameter or maximum method) the distance between one cluster and another is equal to the longest

distance from any member of one cluster to any member of the other cluster

3. Average-link clustering the distance between one cluster and another cluster to be equal to

the average distance from any member of one cluster to any member of the other cluster


Clustering AlgorithmsHierarchical Clustering:2. Example: Single-Link (Minimum) Method:

Resulting Tree, orDendrogram:

Clustering AlgorithmsHierarchical Clustering:1. Example: Complete-Link (Maximum) Method:

Resulting Tree, orDendrogram:

Clustering AlgorithmsHierarchical Clustering:In a dendrogram, the length of each tree branch represents the distancebetween clusters it joins.

Different dendrograms may arise when different Linkage methods are used.

K-Means Clustering:• Basic Ideas : use cluster centroids (means) to represent cluster.

• Assigning data elements to the closet cluster (centroid).

• Goal: Minimize intra-cluster dissimilarity.


K-Means Clustering:• Pick (usually randomly) k points as centers of k clusters.

• Compute distances between a non-center point v and each of the k center points

• find the minimum distance, say it is to center point Ci, and assign v to the cluster defined by Ci.

• Do this for all non-center points and obtain k non-overlapping clusters containing all the points.

• For each cluster, compute its new center, which is the point the with minimum sum of distances from that point to all other points in the cluster.

• Repeat until the algorithm converges, i.e., the same set of centers is chosen as in previous iteration.

This results in non-overlapping clusters of potentially different sizes.


K-Means ClusteringExample:


K-means vs. Hierarchical clustering:• Computation Time:

– Hierarchical clustering: O( m n2 log(n) )– K-means clustering: O( k t m n )– t: number of iterations– n: number of objects– m-dimensional vectors– k: number of clusters

• Memory Requirements:– Hierarchical clustering: O( mn + n2 )– K-means clustering: O( mn + kn )

• Other:– Hierarchical Clustering:

• Need to select Linkage Method• to perform any analysis, it is necessary to partition the dendrogram into k disjoint clusters, cutting the

dendrogram at some point. A limitation is that it is not clear how to choose this k

– K-means: Need to select K– In both cases: Need to select distance/similarity measure


Documents

Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 4, 2010 Lecture hours 14-15 Nataša Pržulj [email protected]