Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel
5th KNIME Users Group Meeting
Zurich, 2 February 2012
KNIME in NIBR: Stories from Industry
Basel, Switzerland
Basel, Switzerland
KNIME in NIBR
§ Infrastructure
§ Node development • Open-source & in-house • Sponsored
§ Examples
2
Infrastructure
§ Enterprise servers + cluster integration running in Cambridge, Basel
§ Standard releases for Windows, Linux, Mac
§ Nightly builds for users comfortable on the bleedingleading edge
3
Node development : open source
§ Chemistry nodes based on the RDKit • open-source cheminformatics toolkit • useable from C++, Python, Java
• NIBR scientists/developers actively participate • www.rdkit.org
§ Standard cheminformatics tasks + some nice extras
§ Developed both in-house and together with knime.com
4
Node development : in house
§ Connections to internal data sources
§ Wrappers around in-house developed algorithms
§ Connection to our web service framework for cheminformatics services
5
Generic CIx service node
6
Sponsored node development
§ Modifications to naïve Bayes nodes to support fingerprints
§ Fingerprint naïve Bayes supporting unbalanced datasets
§ Database schema browser
§ Improvements to python integration
§ Improvements to database connector, readers
§ Ensemble tree classifier (in progress)
7
Case studies
8
Combining databases
9
§ Question: what kind of activity might I expect to see for a given compound?
§ Do a similarity search in our database of internal compounds
§ Look up assays where those compounds have been tested
§ More browsing of those results: where are those neighbors most active?
p(Activity) > 8
Combining databases
p(Activity) > 8
Combining databases
11
§ More browsing of those results: show me the most active neighbors
Parallel virtual screening example
§ Goal: find some interesting compounds to be screened for a new project
§ 2D similarity searches across two databases: • NIBR powder archive • Catalogs from trusted vendors
§ About 7 million compounds total.
§ Use several different fingerprints
Finton Sirockin (GDC/CADD)
The basic process
13
§ Generate fingerprints for database and queries
§ Calculate similarities with the Erlwood Fingerprint Similarity node
§ Sort, filter, standardize
§ Report
Combining the pieces
14
• Workflow is run for each query
• Fingerprints calculated for each type of search
• 600 – 11 000s • Needs to be calculated only once, even for n queries
Cluster usage reporting
§ Present a dashboard with a comprehensive view of current and historical usage of our HPC cluster infrastructure
§ Three Phases of processing : • Input from raw SGE files off of the clusters at each site • Steps A-C : data pre-processing, filtering & date-time object conversion
- All logs are gathered into a single file kept in RAM - Use of java nodes to convert unix time to Knime date objects - Bash nodes for awk manipulations which are faster natively in LINUX
• Steps D – I : execute concurrently - Knime Statistics and grouping are heavily used - Step H spawns cluster jobs to gather user usage statistics
§ Present summarized and aggregated data using spotfire
15
Mike Derby (NIBR IT) Varun Shivashankar (NIBR IT)
The workflow
16
• Usage Data input file : Original logs 2GB – 4 GB in size x 4 clusters
• Resulting Data file of summarized data : user_usage_DUS.csv == 1.9M
The complexity
17
The report: historical data
18
The dashboard
19
Written out to a UNC path, read every few minutes by Spotfire Server Generates data either from scripts or Knime running headless.
Predicting which target a molecule will hit
§ Goal: build a model to predict which of a set of targets a molecule is most likely to hit
§ Method: using RDKit atom-pair fingerprints and a new KNIME learner that builds ensembles of truncated decision trees. (sponsored development with knime.com)
§ Validation data set: active molecules from 50 different ChEMBL assays1
20
1Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 51, 1831-1839 (2011).
Predicting which target a molecule will hit
21
§ 11561 data points, 50 classes
§ 50 trees, random descriptor selection
About that scaling…
22
Predicting which target a molecule will hit
23
§ 11561 data points, 50 classes
§ 50 trees, random descriptor selection
§ out-of-bag prediction error: 5.8%
§ mean error from cross validation: 4.2%
Predicting which target a molecule will hit
24
§ mistakes tend to be in families
Drilling into the confusion matrix
25
Drilling into the confusion matrix
26
Drilling into the confusion matrix
27
Drilling into the confusion matrix
28
Drilling into the confusion matrix
29
Drilling into the confusion matrix
30
Acknowledgements
§ NIBR • John Davies (CPC) • Richard Lewis (GDC) • Steve Litster (NIBR IT) • Andy Palmer (NIBR IT) • Patrick Warren (NIBR IT) • Case studies
- Finton Sirockin (GDC) - Mike Derby (NIBR IT) - Varun Shivashankar (NIBR IT) - John Davies (CPC)
• Node development - Manuel Schwarze (NIBR IT) - Dillip Kumar Mohanty (NIBR IT) - Sudip Ghosh (NIBR IT)
• Marc Litherland (NIBR IT)
§ knime.com • Michael Berthold • Bernd Wiswedel • Thorsten Meinl • Peter Ohl
§ Simon Richards (Lilly)
31
T e a c h • D i s c o v e r • T r e a t
the power of collaborative efforts
Join the Teach-Discover-Treat initiative: participate in our
symposium* and compete on one or more challenges!
*ACS Spring Meeting, March 25th, 1:30pm to 5:00pm, San Diego Convention Center, Room 26A
Goal: Provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases
q Requirements: use freely available software tools; datasets will be provided with a focus on targets for neglected diseases
q Criteria to judge: quality of the model (statistical measures), clarity of the tutorial (suitable for undergraduate course), innovative application of computational technique(s)
q Awards: travel awards to cover travel expenses for presenting work at COMP symposium
q Presentation of Awardees at ACS Spring 2013 meeting (New Orleans)
More information and access to data sets coming in March Bookmark www.teach-discover-treat.org