Indiana University School of David Wild – CICC Quarterly Meeting, Jan 27 2005. Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27

Indiana University School of

Projects 1-4 update

David WildCICC Quarterly Meeting

January 27th 2006


CICC-related projects

• Formal CICC projects1. Innovative cross-screen analysis of NIH DTP Human Tumor Cell

Line Data – innovative scientific analysis of NIH HTS data2. Development of cheminformatics web services and use cases in

Taverna – web service & workflow infrastructure3. Development of a novel interface for the analysis of PubChem

HTS data – tools for interacting with lots of complex data4. A structure storage and searching system for Distributed Drug

Discovery – innovative kinds of chemical databases

• Other, related projects– Fast clustering of very large datasets using Linux clusters– Smart client for mining drug discovery data (Microsoft

supported)


PROJECT 4Experimental

Databases

PROJECT 2Web services& workflows PROJECT 1

Innovative cross-screenanalysis ofHTS data

PROJECT 3Visualization, navigation

& analysis tools forHTS data

SMART CLIENTSmart interfaces (incl.NLP, RSS, agents, etc)

SMART CLIENTGeneral drug discovery

web services& workflows SMART CLIENT

Smart interfaces (incl.NLP, RSS, agents, etc)

FAST PARALLELCLUSTERING

Using DivKmeans& AVIDD


Desired outcomes by Summer 2006

• A chemical informatics web service infrastructure running at IU• Several Taverna workflows that use these and other web

services, and which demonstrate that the infrastructure can be used to perform complex, relevant operations on PubChem data

• Demonstrated scientific results with the NIH DTP data• An established Distributed Drug Discovery database linked with

PubChem, that shows that our techniques together with PubChem can be employed in ways which benefit humanity in general

• A sandbox PubChem copy with improved functionality and architecture

• One or more novel visualization tools for PubChem data• Demonstrate the feasibility of fast, accurate clustering of very

large datasets (including the whole of PubChem) using the AVIDD Linux Cluster and a parallelized clustering algorithm (DivKmeans)

• Show that .NET and Java-based web services can work well together in a common infrastructure

• Demonstrate the feasibility of a natural language or other straightforward interface for scientists to express their information needs


NIH DatabaseService

PostgreSQLCHORD

FingerprintGenerator

BCI Makebits

ClusterAnalysis

BCI Divkmeans TableManagement

VoTables

PlotVisualizer

VoPlot

DockingSelector

Script

2D-3D

OpenEye OMEGA

Docking

OpenEye FRED

3D Visualizer

JMOL

Cluster the compounds in the NIH DTP database by chemical structure, then

choose representative compounds from the clusters and dock them into

PDB protein files of interest

SMILES + ID

Fingerprints

PDB DatabaseService

SMILES + ID + Data

ClusterMembership

SMILES + ID + + Cluster # + Data

SMILES + ID

MOL File

PDB Structure +

Box

Docked Complex


NIH DatabaseService

PostgreSQLCHORD

DockingSelector

Script

2D-3D

OpenEye OMEGA

Docking

OpenEye FRED

3D Visualizer

JMOL

PDB LigandDatabaseService

SMILES + ID + + Data

NIH SMILES + ID

MOL File

Docked Complex

PDBDatabaseService

Prot

ein

Documents

Indiana University School of David Wild – CICC Quarterly Meeting, Jan 27 2005. Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27