KNIME in NIBR Stories from Industry

Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel

5th KNIME Users Group Meeting

Zurich, 2 February 2012

KNIME in NIBR: Stories from Industry

Basel, Switzerland

Basel, Switzerland

KNIME in NIBR

§  Infrastructure

§  Node development • Open-source & in-house •  Sponsored

§  Examples

2

Infrastructure

§  Enterprise servers + cluster integration running in Cambridge, Basel

§  Standard releases for Windows, Linux, Mac

§  Nightly builds for users comfortable on the bleedingleading edge

3

Node development : open source

§  Chemistry nodes based on the RDKit •  open-source cheminformatics toolkit •  useable from C++, Python, Java

•  NIBR scientists/developers actively participate •  www.rdkit.org

§  Standard cheminformatics tasks + some nice extras

§  Developed both in-house and together with knime.com

4

Node development : in house

§  Connections to internal data sources

§  Wrappers around in-house developed algorithms

§  Connection to our web service framework for cheminformatics services

5

Generic CIx service node

6

Sponsored node development

§  Modifications to naïve Bayes nodes to support fingerprints

§  Fingerprint naïve Bayes supporting unbalanced datasets

§  Database schema browser

§  Improvements to python integration

§  Improvements to database connector, readers

§  Ensemble tree classifier (in progress)

7

Case studies

8

Combining databases

9

§  Question: what kind of activity might I expect to see for a given compound?

§  Do a similarity search in our database of internal compounds

§  Look up assays where those compounds have been tested

§  More browsing of those results: where are those neighbors most active?

p(Activity) > 8

Combining databases

p(Activity) > 8

Combining databases

11

§  More browsing of those results: show me the most active neighbors

Parallel virtual screening example

§  Goal: find some interesting compounds to be screened for a new project

§  2D similarity searches across two databases: •  NIBR powder archive •  Catalogs from trusted vendors

§  About 7 million compounds total.

§  Use several different fingerprints

Finton Sirockin (GDC/CADD)

The basic process

13

§  Generate fingerprints for database and queries

§  Calculate similarities with the Erlwood Fingerprint Similarity node

§  Sort, filter, standardize

§  Report

Combining the pieces

14

• Workflow is run for each query

• Fingerprints calculated for each type of search

• 600 – 11 000s • Needs to be calculated only once, even for n queries

Cluster usage reporting

§  Present a dashboard with a comprehensive view of current and historical usage of our HPC cluster infrastructure

§  Three Phases of processing : •  Input from raw SGE files off of the clusters at each site •  Steps A-C : data pre-processing, filtering & date-time object conversion

-  All logs are gathered into a single file kept in RAM -  Use of java nodes to convert unix time to Knime date objects -  Bash nodes for awk manipulations which are faster natively in LINUX

•  Steps D – I : execute concurrently -  Knime Statistics and grouping are heavily used -  Step H spawns cluster jobs to gather user usage statistics

§  Present summarized and aggregated data using spotfire

15

Mike Derby (NIBR IT) Varun Shivashankar (NIBR IT)

The workflow

16

•  Usage Data input file : Original logs 2GB – 4 GB in size x 4 clusters

•  Resulting Data file of summarized data : user_usage_DUS.csv == 1.9M

The complexity

17

The report: historical data

18

The dashboard

19

Written out to a UNC path, read every few minutes by Spotfire Server Generates data either from scripts or Knime running headless.

Predicting which target a molecule will hit

§  Goal: build a model to predict which of a set of targets a molecule is most likely to hit

§  Method: using RDKit atom-pair fingerprints and a new KNIME learner that builds ensembles of truncated decision trees. (sponsored development with knime.com)

§  Validation data set: active molecules from 50 different ChEMBL assays1

20

1Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 51, 1831-1839 (2011).


21

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

About that scaling…

22


23

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

§  out-of-bag prediction error: 5.8%

§  mean error from cross validation: 4.2%


24

§  mistakes tend to be in families

Drilling into the confusion matrix

25


26


27


28


29


30

Acknowledgements

§  NIBR •  John Davies (CPC) •  Richard Lewis (GDC) •  Steve Litster (NIBR IT) •  Andy Palmer (NIBR IT) •  Patrick Warren (NIBR IT) •  Case studies

-  Finton Sirockin (GDC) -  Mike Derby (NIBR IT) -  Varun Shivashankar (NIBR IT) -  John Davies (CPC)

•  Node development -  Manuel Schwarze (NIBR IT) -  Dillip Kumar Mohanty (NIBR IT) -  Sudip Ghosh (NIBR IT)

•  Marc Litherland (NIBR IT)

§  knime.com •  Michael Berthold •  Bernd Wiswedel •  Thorsten Meinl •  Peter Ohl

§  Simon Richards (Lilly)

31

T e a c h • D i s c o v e r • T r e a t

the power of collaborative efforts

Join the Teach-Discover-Treat initiative: participate in our

symposium* and compete on one or more challenges!

*ACS Spring Meeting, March 25th, 1:30pm to 5:00pm, San Diego Convention Center, Room 26A

Goal: Provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases

q  Requirements: use freely available software tools; datasets will be provided with a focus on targets for neglected diseases

q  Criteria to judge: quality of the model (statistical measures), clarity of the tutorial (suitable for undergraduate course), innovative application of computational technique(s)

q  Awards: travel awards to cover travel expenses for presenting work at COMP symposium

q  Presentation of Awardees at ACS Spring 2013 meeting (New Orleans)

More information and access to data sets coming in March Bookmark www.teach-discover-treat.org

Documents

KNIME in NIBR Stories from Industry