39
Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana University, Bloomington http:// biokdd.informatics.indiana.edu/~agopu Email: [email protected]

Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Embed Size (px)

Citation preview

Page 1: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Framework for Sequence Cluster Merging (Also showing importance of domain knowledge)

Arvind Gopu

Masters student, Computer Science & Bioinformatics

Indiana University, Bloomington

http://biokdd.informatics.indiana.edu/~agopu

Email: [email protected]

Page 2: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Introduction

Sequence Clustering very important research topic. Bottom-up approach – basically merge elements

recursively upto certain specificity Top-down approach – split elements until desired

specificity is achieved Two important issues: selectivity and sensitivity

Sequence clustering problem is unique No “observable” attributes unlike most clustering problems Example:

Supermarket: Soda, Fruit juice, Frozen foods, Clothing, etc. Demographic: Height, Race, etc.

Sequence clustering: Just a bunch of amino acid characters! (with accompanying well studied sequence comparison/alignment programs).

Page 3: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Introduction …

Getting back to sequence clustering… Fragmentation problem – well known in sequence

clustering algorithms. Example: BAG (Sun Kim) 99 % accuracy (selective) but at cost of ~40-50 %

fragmentation (over-sensitive) Solution?

Bottom-Up merging back of fragmented clusters

Page 4: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Need for framework

Suggested bottom-up approach possible using various sub-methods Framework: Do common and unique tasks

seamlessly Insert new sub-methods easily with very little

hassle Implemented primarily in Perl with supporting

C programs and Unix Shell scripts

Page 5: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Framework Schematic

Test Merge’bility

Merge Suggestions

from Clustering Algorithm

Prepare Sequence Data

Post-process New Clustering

Result

Generate Combined Profile for Two

Fragment Clusters

Enhanced Clustering

Result

Test Scaffold

Page 6: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Framework – Profile Generation

Test Merge’bility

Merge Suggestions

from Clustering Algorithm

Prepare Sequence Data

Post-process New Clustering

Result

GENERATE COMBINED PROFILE

FOR TWO FRAGMENT CLUSTERS

Enhanced Clustering

Result

Test Scaffold

Page 7: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Profile Generation – MSA

MSA (C1)

MSA (C2)

MSA (C1, C2)

C1

C2

Combined Profile

MSA = Multiple Sequence Alignment

Page 8: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Profile Generation – MSA

Common first step: MSA profile generation for two fragment clusters C1 and C2 (Clustalw) MSA (C1) and MSA (C2) Most expensive step in framework

Common second step: Combined profile generation (Clustalw) Prof_Align [MSA (C1), MSA (C2)]

Page 9: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Profile Generation – MSA explained.. All of the implemented techniques depend on MSA

profiles MSA profile: align more than 2 sequences

simultaneously

Image from http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/

Page 10: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Profile Generation – MSA explained..

Image from http://www.mscs.mu.edu/~cstruble/class/mscs230/fall2002/notes/3

Page 11: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Framework – Merge’bility Test

TEST MERGE’BILITY

Merge Suggestions

from Clustering Algorithm

Prepare Sequence Data

Post-process New Clustering

Result

Generate Combined Profile for Two

Fragment Clusters

Enhanced Clustering

Result

Test Scaffold

Page 12: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test

Page 13: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test Statistics/Machine learning technique based

method: Uses Relative Entropy and Statistical measures

w.r.t. Runs test Drawbacks

Almost impossible to nail down on threshold values for z-score or any other statistical measure

Extremely dependent sample size equality – does not work well when the two fragment sizes vary

Page 14: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test Each column in a MSA profile is a probabilistic

model (details of construction beyond the scope of this talk)

Compute similarity between corresponding columns in the two fragments – Kullback Liebler distance Need to consider gaps while matching up columns –

challenging task Also need to screen for random “good” distances – taken

care off using random model in distance computation

Page 15: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test

Page 16: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test Using column wise comparison distance

scores, compute “distance vector” Symbolic representation for “good”, “bad” and

“don’t care” distances (detail abstracted) Do standard statistical test: Runs test to

check out how random distance vector is… Nice pattern:

y | y | y | n | n | y | y | y | n | n | y | y | y Random pattern:

y | n | y | y | n | n | n | y | n | y | n | n | y | y

Page 17: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test

4) Do Runs test

Page 18: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Model Comparison based Merge Test Compute mean, standard deviation and

subsequently z-score Threshold to separate “good” and “bad” merges

Drawbacks again… Threshold will be sample specific, hard to have

one threshold for entire dataset (illustrated in test results)

Failure rate is high if sample size is unequal

Page 19: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic Tree based Merge Test

Page 20: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Merge’bility Test – Techniques … Phylogenetic tree based method:

Evolutionary Distance based method Drawback: Too strict; many false negatives possible;

Also hard to nail a threshold Evolutionary Least Common Ancestor (LCA)

based method Improved performance in both of the previously

mentioned issues

Page 21: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic TreeEvolutionary Distance based

Merge Test

Page 22: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic Tree Distance based method Clustalw (or other tree generation tools)

provide NJ tree of a MSA profile Sequence length normalized distance from

root for each sequence 0 < distance < 1

Define some threshold for distance that constitutes intra/inter cluster distances

Page 23: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic Tree Distance based method Distance between sequences from…

Two clusters will be closer to: ‘1’ if two clusters are not merge’ble – call these “bad

distances” ‘0’ if two clusters are actually part of the same super

cluster The same cluster will be obviously closer to ‘0’ –

these constitute “good distances”; don’t care in our case

Count number of “bad distances” Gives a good idea of how good a merge is

Page 24: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic Tree Distance based method Good enough? Not

yet – need for normalization of the “bad distance” count. Why? Number of edges

between vertices of same/different clusters is proportional to size of clusters!

Page 25: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic Tree Distance based method Once normalization of number of “bad

distances” is done, this method churned out decent results Normalizing factor? Contentious.. What is a good

normalizer? Method too strict for unequally sized clusters.

Most merges rejected leading to appreciable number of false negatives Inherent nature of MSA programs and unequally sized

profiles (cluster sizes)

Page 26: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phylogenetic TreeLCA Coverage based Merge

Test

Page 27: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Clustalw, Phylip (or other tree generation

tools) provide a rooted phylogenetic tree for a MSA profile

Looking at the tree, one can easily make out if a pair of clusters should be merged or not How? Parse tree into a usual tree data structure and

look for common ancestor of sequences of each cluster

Example…

Page 28: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Good Merge

Sequences of the two clusters (shaded blue and red) are from the same super cluster

Page 29: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Bad Merge

Sequences of the two clusters (shaded blue and red) are from different super clusters

Page 30: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Same LCA for both clusters? Good merge! If not … Bad merge?

Not quite. Possible that LCAs may be different but they cover sequences from either cluster upto a considerable extent

Better to use coverage of LCAs instead Example…

Page 31: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Why LCA Coverage?

Second cluster has three sequences, but its LCA covers four more sequences from the other cluster

Page 32: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Coverage test:

For clusters Ci and Ck, choose smaller cluster say Ci i.e | Ci | < | Ck |

Define Cov (LCA[Ci]) as the number of sequences LCA Ci covers.

If Cov(LCA[Ci]) > # of sequences in Ci

… where | Ci | < | Ck | i.e. { Cov (LCA[Ci]) / | Ci | } > 1

Or {Cross Coverage (LCA[Ci])} > 0

Page 33: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Phy.Tree LCA coverage based method Advantages:

Sample size difference does not play a big role Demarcating between “good” and “bad” merges is

much simpler and straight forward Shown to work really well on a variety of data

sizes, difficulty levels – test results… Possible weakness:

Bound to fail for extremely small fragments (say 2 sequences each) – hard not to have a common LCA !

Page 34: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Test Results – 4 datasets(from COG database)

Page 35: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Test Results – Data set 1

DATA: COG {0001, 0005} (Real Size: 35,30) MERGE’BILITY TEST METHOD

Observed OutcomeFragment Cluster Size Expected Outcome

n (F1) n (F2) Good / Bad Model Comparison(0.0001)

Phy.tree Distance Phy.tree LCA coverage

10 10 Good Good Good Good

10 10 Bad Bad Bad Bad

10 5 Good Good Good Good

10 5 Bad Bad Bad Bad

10 3 Good Good Good Good

10 3 Bad Bad Bad Bad

4 2 Good Good Good Good

4 2 Bad Bad Bad Bad

3 3 Good Good Good Good

3 3 Bad Bad Bad Bad

Page 36: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Test Results – Data set 2

DATA: COG {0142, 0183} (Real Size: 74,116) MERGE’BILITY TEST METHOD

Observed OutcomeFeagment Cluster Size Expected Outcome

n (F1) n (F2) Good / Bad Model Comparison(0.001)

Phy.tree Distance Phy.tree LCA coverage

10 10 Good Good Good Good

10 10 Bad Bad Bad Bad

10 5 Good Good Bad Good

10 5 Bad Bad Bad Bad

10 3 Good Good Bad Good

10 3 Bad Good Bad Bad

4 2 Good Good Bad Good

4 2 Bad Bad Bad Bad

3 3 Good Good Bad Bad

3 3 Bad Bad Bad Bad

Page 37: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Test Results – Data set 3

DATA: COG {0380, 0383} (Real Size: 15,13) MERGE’BILITY TEST METHOD

Observed OutcomeFragment Cluster Size Expected Outcome

n (F1) n (F2) Good / Bad Model Comparison(0.001 / 0.0005)

Phy.tree Distance Phy.tree LCA coverage

10 10 Good Good / Bad Good Good

10 10 Bad Good / Bad Bad Bad

10 5 Good Good / Bad Bad Good

10 5 Bad Good / Bad Bad Bad

10 3 Good Good / Bad Bad Good

10 3 Bad Good / Bad Bad Bad

4 2 Good Good / Good Good Bad

4 2 Bad Bad / Good Bad Bad

3 3 Good Good / Bad Good Good

3 3 Bad Bad / Bad Bad Good

Page 38: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Test Results – Data set 4DATA: COG {0160, 0161} (Real Size: 79,49) MERGE’BILITY TEST METHOD

Observed OutcomeFragment Cluster Size Expected Outcome

n (F1) n (F2) Good / Bad Model Comparison(0.0001)

Phy.tree Distance Phy.tree LCA coverage

10 10 Good Bad Good Good

10 10 Bad Good Good Bad

10 5 Good Bad Good Good

10 5 Bad Good Good Bad

10 3 Good Good Good Good

10 3 Bad Good Bad Bad

4 2 Good Good Good Good

4 2 Bad Good Good Bad

3 3 Good Good Good Good

3 3 Bad Good Good Bad

2 2 Good Good Good Good

2 2 Bad Good Good Good

Page 39: Framework for Sequence Cluster Merging (Also showing importance of domain knowledge) Arvind Gopu Masters student, Computer Science & Bioinformatics Indiana

Acknowledgements!

A big thank you to: Prof. Sun Kim, advisor My parents, brother, grand parents! All my colleagues and friends: JH, Zhiping, Scott Martin,

SR, Raj, Anshul, Pat Hayes and everyone else! Folks at CS & Informatics: CS Systems staff, Lucy, Linda,

Wendy, Cheryl, Errissa, Bob! Profs. Marty Siegel and Gary Wiggins – GPC. RATS folks!

Did I forget someone?! Sorry if I did…