Upload
chase-burns
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
UGM 2006
Miklós Vargyas
What’s new in JKlustor
Overview
• An introduction to JKlustor – Brief history of the product– Main features– Usage examples– Performance
• LibMCS, an alternative approach to clustering chemical structures– Concepts, motivation– Features– Performance
• Future of JKlustor
Brief history of JKlustor
• First discovery tool in the JChem package– Jarp released in version 1.5.2 (March 22, 2001)– Compr 1.5.7 (May 27, 2001)– Ward 1.5.9 (Jun 25, 2001)
• API released in JChem 1.6.2 (May 16, 2002)
• Experimental LibMCS first released in JChem 3.0 (Dec 1, 2004)
• New JKlustor GUI to be released in JChem 3.?
JKlustor features
• Similarity based clustering– ChemAxon’s topological fingerprint– External data points, arbitrary dimension– Tanimoto, weighted Euclidean
• Hierarchical clustering: Ward– Reciprocal nearest neighbor algorithm– Kelley method
• Non-hierarchical clustering: Jarvis-Patrick
• Diversity calculation: Compr
• Structure based clustering: LibMCS
JKlustor usage
• Command line tools– Pipelining commands– Option flags– Structure file/database input– Manual creation of cluster views
Input SDFile GenerateMD NNeib
JarvisPatrick CreateView MarvinView Picture
JKlustor usage
generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt nneib -f 512 -t 0.1 -g –i fingerprints.txt –o neighborlists.txt jarp -c 0.2 -y –i neighborlists.txt –o clusters.txt
• Prepare data and run clustering
• View first cluster
• View centroids, display cluster id and size
crview -i id -c "clid=1" -s input.sdf -t clusters.txt –o jarp_cluster1.sdf
mview –c 3 -r 3 jarp_cluster1.sdf
crview -i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt -o jarp_centroids.sdf
mview -c 3 -r 3 -f "clid:size" jarp_centroids.sdf
JKlustor usage
0
2000
4000
6000
8000
10000
12000
14000
100 1000 10000 20000 40000 100000
library size
run
tim
e (s
)
Ward 512
Jarp 512
JKlustor performance
• Memory: O(n)
• Time: Jarvis-Patrick O(n1.5), Ward O(n2)
What is MCS?
• The Maximum Common Substructure of two chemical structures
Clustering by MCS?
• Find the MCS of a group of structures
Very brief history of LibMCS
• Reaction automapper, based on Maximum Common Subgraph Search
• MCS class API made public
• Customer requested MCS based clustering– More intuitive than similarity based– Focused set analysis
• screens: 2000 – 10000 structures• lead optimization: 3000 – 5000 structures
– Should be hierarchical (outliers)– Ultimate goal: cluster 5000 compounds in 5
seconds
LibMCS features
• MCS based hierarchical clustering
• Flexible search options
• Hierarchy browser
• Filtering by chemical properties
• Cluster statistics
• No size limitation
• Fast operation
LibMCS – Dendogram view
LibMCS – Molecule view
LibMCS – Table view
LibMCS – Statistics
LibMCS – Selections
LibMCS – Property filters
LibMCS – Output files
LibMCS – Output files
CCCN1CC(=O)SCC(C)C1=O CC1CSC(=O)CN(C2CCCC2)C1=O 0 21 0CCCN1CC(=O)SCC(C)C1=O CC1CSC(=O)C2CCCN2C1=O 0 21 0OC(=O)C1CCCN1C(=O)CCS CC(CS)C(=O)N1CCCC1C(O)=O 0 19 0OC(=O)C1CCCN1C(=O)CCS [H]C1(CCCN1C(=O)CCS)C(O)=O 0 19 0OC(=O)C1CCCN1C(=O)CCS OC(=O)C1CCCN1C(=O)C2CCCC2SC(=O)C3=CC=CC=C3 0 19 0OC(=O)C1CCCN1C(=O)CCS OC(=O)C1CCCN1C(=O)C2CCCCC2S 0 19 0CCC(=O)N(CC1=CC=CC=C1)C(C)C=O CC1SC(=O)C2(C)CC3=CC=CC=C3CN2C1=O 0 20 0CCC(=O)N(CC1=CC=CC=C1)C(C)C=O CC1CSC(=O)C2CC3=C(CN2C1=O)C=CC=C3 0 20 0CC1SC(=O)C2CCCN2C1=O CC1SC(=O)C2CCCN2C1=O 0 30 0CC1SC(=O)CNC1=O CC1SC(=O)CNC1=O 0 29 0OC(=O)C1CSCCCCCCCCC(CS)C(=O)N1 OC(=O)C1CSCCCCCCCCC(CS)C(=O)N1 0 31 0CC(S)C(=O)NCC(O)=O CC(S)C(=O)NCC(O)=O 0 24 0CCC1=CC=CC=C1 CC(NC(CCC1=CC=CC=C1)C(O)=O)C(=O)N2CCCC2C(O)=O 0 22 0CCC1=CC=CC=C1 CCOC(=O)C(CC1=CC=CC=C1)NC(=O)NC(CC2=CC=CC=C2)C(=O)OCC 0 22 0OC(=O)C1CCCN1C(=O)NC2=CC=CC=C2 OC(=O)C1CCCN1C(=O)NC2=CC=CC=C2 0 23 0C\C(Cl)=N/OC(N)=O C\C(Cl)=N/OC(N)=O 0 27
> <Cluster_ID>1163
> <Element_count>1
> <Parent_ID>1
$$$$
Marvin 05290619172D
23 24 0 0 0 0 999 V2000 2.4230 -0.3587 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.1375 0.0538 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.1375 0.8788 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.4349 -1.1837 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -1.1494 -1.5962 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.8638 -1.1837 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0
LibMCS – RGroup decomposition
LibMCS – RGroup decomposition
LibMCS – Performance
• Depends on– average structure size– total diversity– minimal required MCS size– atom/bond constraints
• Scales linearly
• Maximum speed achieved– 1 000 structures in 3 seconds
• Memory requirements– 100 000 structures occupy 200MB
LibMCS – Performance
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (s
ec)
LibMCS – Further applications
• Find the MCS of existing clusters
• Data retrieval
• Assay analysis
• Compound acquisition
• Combinatorial library profiling
Development plans
• Disconnected MCS
• Multi-group clustering
• More chemical sense (e.g. avoid opening rings, consider chirality)
• Performance tuning (e.g. NN)
• Integrate Ward/Jarp into new GUI
• Additive clustering
• Clustering million compound libraries
• Integrate Chemical Terms
• Integrate molecular descriptors, optimized metrics
Summary
• New tool in JKlustor based on MCS
• More plausible grouping
• Hierarchical with dendogram browser
• Statistics
• Filtering, coloring, selection
Acknowledgements
• Developers– Ferenc Csizmadia, Árpád Tamási,
András Volford, Szilárd Doránt– Péter Vadász, Nóra Máté
• Special thanks