Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee...
24
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in part by NDSEG and Olin Fellowships and by NSF grants IIS-0196057 and ITR/EIA-0113618.
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal
Rearrangement Clustering Sharlee Climer and Weixiong Zhang This
research was supported in part by NDSEG and Olin Fellowships and by
NSF grants IIS-0196057 and ITR/EIA-0113618.
Slide 2
Sharlee Climer Washington University in St. Louis 2 Overview
Introduction Example Results Conclusion
Slide 3
Sharlee Climer Washington University in St. Louis 3
Introduction Rearrangement clustering Rearrange rows of a matrix
Minimize the sum of the differences between adjacent rows min d(i,
i+1) Rows correspond to objects Columns correspond to features
Slide 4
Sharlee Climer Washington University in St. Louis 4
Introduction Applications Information retrieval Manufacturing
Software engineering
Slide 5
Sharlee Climer Washington University in St. Louis 5
Example
Slide 6
Sharlee Climer Washington University in St. Louis 6 Example
Bond Energy Algorithm (BEA) Introduced in 1972 (McCormick,
Schweitzer, White) Approximate solution Still widely used
Slide 7
Sharlee Climer Washington University in St. Louis 7
Example
Slide 8
Sharlee Climer Washington University in St. Louis 8 Example
Optimal solution Lenstra (1974) observed equivalence to the
Traveling Salesman Problem (TSP) Given n cities and the distance
between each pair Find shortest cycle visiting every city NP-hard
problem
Slide 9
Sharlee Climer Washington University in St. Louis 9 Example
Transform into a TSP Each object corresponds to a city Distance
between two cities equal to difference between the corresponding
objects Dummy city added to problem Costs from dummy city to all
other cities equal a constant Location of dummy city indicates
position to cut cycle into a path
Slide 10
Sharlee Climer Washington University in St. Louis 10 Example
TSP solvers extremely slow even for small problems in the 70s
Massive research efforts to solve TSP over last three decades
Current solvers Concorde (Applegate, Bixby, Chvatal, Cook, 2001)
Solved a 15,112 city TSP
Slide 11
Sharlee Climer Washington University in St. Louis 11
Example
Slide 12
Sharlee Climer Washington University in St. Louis 12 Example
BEA and TSP offer approximate and optimal solutions We have
observed a flaw in the objective function when the objects form
natural clusters The objective minimizes the sum of every pair of
adjacent rows Inter-cluster distances tend to be significantly
larger than intra-cluster distances Summation dominated by
inter-cluster distances
Slide 13
Sharlee Climer Washington University in St. Louis 13 Example
TSPCluster addresses this flaw Add k dummy cities k clusters are
specified by the output TSP solver ignores inter-cluster distances
Minimizes sum of intra-cluster distances Use sufficiently small
constant for distances to/from dummy cities Dummy cities never
adjacent to each other
Slide 14
Sharlee Climer Washington University in St. Louis 14
Example
Slide 15
Sharlee Climer Washington University in St. Louis 15 Results
Arabidopsis 499 genes 25 conditions Comparison with BEA Used BEA
similarity measure BEA score: 447,070 TSPCluster score: 452,109 (k
= 1)
Slide 16
Sharlee Climer Washington University in St. Louis 16 Results
BEATSPCluster
Slide 17
Sharlee Climer Washington University in St. Louis 17 Results
Compared with Cluster (Eisen et al., 1998) and k-ary (Bar-Joseph et
al., 2003) Used Pearson correlation coefficient Cluster: 398 k-ary:
427 TSPCluster: 436 (k = 1)
Slide 18
Sharlee Climer Washington University in St. Louis 18 Results
Clusterk-aryTSPCluster
Slide 19
Sharlee Climer Washington University in St. Louis 19 Results
TSPCluster with k equal to 2 to 50 How many clusters? Average
inter-cluster distances BEA local peaks: 6, 13, 19, 26, 29, 35, 40,
47 Pearson correlation coefficient local peaks: 3, 9, 12, 21, 26,
40 Computation time varied Less than half minute to ~3 minutes
Slide 20
Sharlee Climer Washington University in St. Louis 20 Results k
= 26k = 40
Slide 21
Sharlee Climer Washington University in St. Louis 21 Conclusion
Most problems have errors in their data Error introduced by
approximation algorithms cant be expected to undo this error
Computers are cheap Computers and solvers are sophisticated Dont
have to always resort on approximate solutions even for NP-hard
problems
Slide 22
Sharlee Climer Washington University in St. Louis 22 Conclusion
Rearrangement clustering provides a linear ordering Linear ordering
inherent to many applications Information retrieval Manufacturing
Software engineering
Slide 23
Sharlee Climer Washington University in St. Louis 23 Conclusion
Gene data arranged in linear order to examine data Linear ordering
not necessarily essential to gene clustering problems Current work
Optimally solve subproblems in clustering algorithms
Slide 24
Sharlee Climer Washington University in St. Louis 24
Questions?