1 Overview of Chemical Informatics and Cyberinfrastructure Collaboratory October 18 2006 Geoffrey...

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory

October 18 2006Geoffrey Fox

Computer Science, Informatics, PhysicsPervasive Technology Laboratories

Indiana University Bloomington IN 47401gcf@indiana.edu

http://www.infomall.orghttp://www.chembiogrid.org

Activities Local Teams, successful Prototypes and International

Collaboration set up in 3 initial major focus areas• Chemical Informatics Cyberinfrastructure/Grids with services,

workflows and demonstration uses building on success in other applications (LEAD) and showing distributed integration of academic and commercial tools

• Computational Chemistry Cyberinfrastructure/Grids with simulation, databases and TeraGrid use

• Education with courses and degrees Review of activities suggest we also formalize work in two further areas

• Chemical Informatics Research – model applicability and data-mining

• Interfacing with the User - interaction tools and portal optimized for particular customer groups

Also have started an activity to identify “customers” for Cyberinfrastructure and its implied Chemistry eScience model

CICC Senior Personnel Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu

Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe Kevin E. Gilbert John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams

From Biology, Chemistry, Computer Science, Informatics

at IU Bloomington and IUPUI (Indianapolis)

CICC Infrastructure Vision Drug Discovery and other academic chemistry and pharmacology

research will be aided by powerful modern information technology ChemBioGrid set up as distributed cyberinfrastructure in eScience model

ChemBioGrid will provide portals (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses

ChemBioGrid will provide services to manipulate this data and combine in workflows; it will have convenient ways to submit and manage multiple jobs

ChemBioGrid will include access to PubChem, PubMed, PubMed Central, the Internet and its derivatives like Microsoft Academic Live and Google Scholar

The services include open-source software like CDK, commercial code from vendors from BCI, OpenEye, Gaussian and Google, and any user contributed programs

ChemBioGrid will define open interfaces to use for a particular type of service allowing plug and play choice between different implementations

CICC Combines Grid Computing with Chemical Informatics

CICCCICC CICCCICCChemical Informatics and Cyberinfrastucture CollaboratoryFunded by the National Institutes of Health

www.chembiogrid.org

Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories

Science and Cyberinfrastructure

Large Scale Computing ChallengesChemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated.

CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs.

CICC supports the NIH mission by combining state of the art chemical informatics techniques with

• World class high performance computing• National-scale computing resources (TeraGrid)• Internet-standard web services • International activities for service orchestration• Open distributed computing infrastructure for scientists world wide

NIHPubMed

DataBase

OSCARText

Analysis

POVRayParallel

Rendering

Initial 3DStructure

Calculation

ToxicityFiltering

ClusterGrouping Docking

MolecularMechanics

Calculations

Quantum Mechanics

Calculations

IU’sVaruna

DataBase

NIHPubChemDataBase

Chemical informatics text analysis programs can process 100,000’s of abstracts of online journalarticles to extract chemical signatures of potential drugs.

OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Top-ranking docked molecules can be further examined for drug potential.

Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community.

CICC Prototype Web Services

Molecular weightsMolecular formulaeTanimoto similarity2D Structure diagramsMolecular descriptors3D structuresInChI generation/searchCMLRSSR and Excel

Basic cheminformatics

Application based services

Compare (NIH)Toxicity predictions (ToxTree)Literature extraction (OSCAR3)Clustering (BCI Toolkit)Docking, filtering, ... (OpenEye)Varuna simulation

Define WSDL interfaces to enable global production of compatible Web services; refine CML Add more services (identify gaps) Add more databases, including 3D structural info Demonstrate use of services in other pipelining tools (KDE, Knime – Pipeline Pilot already done) Extend Computational Chemistry (Varuna) Services Routine TeraGrid and Big Red use “Production” on OSCAR3 CDK Gamess Jaguar Develop more training material

Next steps?

Key Ideas

Add value to PubChem with additional distributed services and databases Develop nifty ideas like VOTables Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids

Web Service LocationsIndiana University

Clustering VOTables OSCAR3 Toxicity classification Database services

Penn State University(now moved to IU)CDK based services

Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors

Cambridge University InChI generation / search CMLRSS OpenBabel

InfoChem SPRESI

database

SDSCTypical TeraGrid Site

NIHPubChem …..Compare …..

Cheminformatics Education at IU Linked to bioinformatics in Indiana University’s School of Informatics

• School of Informatics degree programs BS, MS, PhD Programs offered at both the Indianapolis (IUPUI) and Bloomington

(IUB) campuses• Bioinformatics MS and track on PhD• Chemical Informatics MS and track on PhD• Informatics Undergraduates can choose a chemistry cognate (change to

Life Sciences ) PhD in Informatics started in August 2005 and offers tracks in

• bioinformatics; chemical informatics; health informatics; human-computer interaction design; social and organizational informatics; more to come!

Good employer interest but modest student understanding of value of Cheminformatics degree

3 core courses in Cheminformatics plus seminar/independent studies Significant interest in distance education version of introductory

Cheminformatics course (enrollment promising in Distance Graduate Certificate in Chemical Informatics)

Current Status Web site http://www.chembiogrid.org Wiki chosen to support project as a shared editable web space Building Collaboratory involving PubChem – Global Information System

accessible anywhere and at any time – enhance PubChem with distributed tools (clustering, simulation, annotation etc.) and data

Adopted Taverna as workflow as popular in Bioinformatics but we will evaluate other systems such as GPEL from LEAD

Demonstrated CI-enhanced Chemistry simulations Initiated Data-mining, User interface and Chemical Informatics tools

research Prototyped large set of runs on local Big Red 23 Teraflop supercomputer

(OSCAR3 and modeling moving to CDK Gamess Jaguar) Initial results discussed at conferences/workshops/papers

• Gordon Conferences, ACS, SDSC tutorial First new Cheminformatics courses offered Advisory board set up and met – this is second meeting Videoconferencing-based meetings with Peter Murray-Rust and group at

Cambridge roughly every 2-3 weeks Good or potentially good interactions with Local HTS in CGB, NIH DTP,

Scripps, Lilly and Michigan ECCR

MLSCN Post-HTS Biology Decision SupportPercent Inhibition or IC50 data is retrieved from HTS

Question: Was this screen successful?

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Compounds submitted to PubChem

Workflows encoding distribution analysis of screening results

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis

A Grid of Grids linking collections of services atPubChemECCR centersMLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etcCHEMINFORMATICSPROCESS GRIDS

Example HTS workflow: finding cell-protein relationshipsA protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

Similar structures to the ligand can be

browsed using client portlets.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations

LeastSquaresRegression

RandomForests

NeuralNets

Varuna environment for molecular modeling (Baik, IU)

QMDatabase

ResearcherResearcher

Simulation ServiceFORTRAN Code,

Scripts

Chemical Concepts

Experiments

QM/MMDatabasePubChem, PDB,

NCI, etc.

ChemBioGridChemBioGrid

ReactionDB

DB ServiceQueries, Clustering,

Curation, etc.

Papersetc.

Condor

TeraGridSupercomputers

“Flocks”

Methods Development at the CICC

Tagging methods for web-based annotation exploiting del.icio.us and Connotea

Development of QSAR model interpretability and applicability methods

RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis

See http://www.daylight.com/meetings/mug99/Wild/Mug99.html Visual Similarity Matrices for High Volume Datasets

See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php Fast, accurate clustering using parallel Divisive K-means Mapping of Natural Language queries to use cases and workflows Advanced data mining models for drug discovery information

Structure of Proposal a) Define audience that we are targeting b) Cyberinfrastructure Framework with Key services --

Registry, Computing, portal, workflow • Exemplar Chemoinformatics Services • Exemplar workflows using services • Defined WSDL for key cases defined to allow others to

contribute • Tutorial

c) Education d) IT/Cyber-enhanced Computational Chemistry e) Cheminformatics Research

• Systems• Tools and Modeling

Questions We expect to respond to “big” NIH RFP in about 4 months Should we partner with Michigan? Who is “customer” and how do we get more?

• Do/Should chemists want our or more generally NIH’s product?• Interactions with “large” and “small” industry

What is balance between infrastructure, computational chemistry, Cheminformatics tools and research, chemical informatics systems and interfaces?

Should we stress literature (OSCAR3) project? Balance of applications and generic capabilities? How should we structure education component?

• Field does not have strong student appeal compared to Bioinformatics We are strong in Computer Sciences

(Grids/Cyberinfrastructure) but doubtful if any CS reviewers• We are strong in Cheminformatics systems but not clear a recognized

activity and how do we justify claim that Grids/Cyberinfrastructure/Open Access “good”

Should we link more with biology?

Covering our bases: Who are our “Customers”?

"Classical Chemical Informatics" - ContentsStructure-Based Drug Design; Generation, Curation and Refinement of Protein-Ligand Interactions;Docking, Homology Modeling, QSAR

"New Areas to Conquer"Chemical Literature Processing; Cellular Pharmacokinetics; Traditional Chemical Research fields that were so far not reached by Informatics

CyberinfrastructureWebservices, Workflows, HTS-Tools, new DBs

INDIANA-MICHIGAN Chemical Informatics Center

Rest of the World

Cheminfo-Aware Science Community

Cheminfo-Ignorant Science Community

What do we need to conquer traditional chemical Research Community

PubChem; other DB's

Chemist

- only interest in a small subset.- want more DATA on this small set.

Computational Tools,In-house DB's

- High-Fidelity Structural Data, Redox Potentials, Spectroscopy, Transition State Structures, Energies, Molecular Orbitals…..

“Departments” of the future Center

Computer ScienceDevelop scalable, robust and efficientContainers & Cyberinfrastructure

InformaticsDevelop new services, data structures,algorithms, tools

Infrastructure/Technology Developers and Providers

Build Cyberinfrastructure, design databases, workflow, support Web services with interface standards, wrap codes as services; Support infrastructure

Medicinal ChemistryDevelop new models, produce new scientific concepts, new methods

ChemistryConquer new fields, increase the information content

Application Scientists (Customers)

Core group develops requirements for infrastructure and codes as services and tests infrastructure with key exemplar projects. Allow broad use by all

1 Overview of Chemical Informatics and Cyberinfrastructure Collaboratory October 18 2006 Geoffrey...

Documents

Materials Microcharacterization Collaboratory

The Collaboratory Monday, February 1, 2010. What is the Collaboratory? The Collaboratory: Is an organization run by students, faculty and staff advisors

Learning as collaboratory

Collaboratory Highlights and Issues

Economic assessment of INSPIRE Max Craglia University of Sheffield ICOSS: Informatics Collaboratory for the Social Sciences

The Collaboratory Monday, April 26, 2010. Welcome to Collaboratory Chapel! The Collaboratory is an organization run by Messiah students, educators and

Collaboratory highlights

Collaboratory Templates - Word Web viewTitle: Collaboratory Templates - Word Docs Author: IT Keywords: templates, collaboratory, welcome to the global collaboratory, word templates,

OptIPlanet Collaboratory

Open Data AG Collaboratory

Western Regional Biomedical Collaboratory

Chemical Informatics and Cyberinfrastructure Collaboratory ...dsc.soic.indiana.edu/presentations/CICCOct21-05...• Novel routes to the discovery of enzymatic reaction mechanisms •

Collaboratory Stakeholder Advisory Group

The Collaboratory Monday, September 14, 2009. Meet Dave Bedillion Student Director of the Collaboratory

The Collaboratory Monday, March 22, 2010. Welcome to Collaboratory Chapel! The Collaboratory is an organization run by Messiah students, educators and

Pulsar Search Collaboratory

The Collaboratory Monday, March 1, 2010. Welcome to Collaboratory Chapel ! The Collaboratory is an organization run by Messiah students, educators and

1 Overview of Cyberinfrastructure and the Breadth of Its Application Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department Director

Crystal Grid Reciprocal Net XPort Crystal Grid Framework Chemical Informatics and Cyberinfrastructure Collaboratory - 2005 The Crystal Grid A joint project

The OptIPlanet Collaboratory