32
Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel 5 th KNIME Users Group Meeting Zurich, 2 February 2012 KNIME in NIBR: Stories from Industry Basel, Switzerland Basel, Switzerland

KNIME in NIBR Stories from Industry

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: KNIME in NIBR Stories from Industry

Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel

5th KNIME Users Group Meeting

Zurich, 2 February 2012

KNIME in NIBR: Stories from Industry

Basel, Switzerland

Basel, Switzerland

Page 2: KNIME in NIBR Stories from Industry

KNIME in NIBR

§  Infrastructure

§  Node development • Open-source & in-house •  Sponsored

§  Examples

2

Page 3: KNIME in NIBR Stories from Industry

Infrastructure

§  Enterprise servers + cluster integration running in Cambridge, Basel

§  Standard releases for Windows, Linux, Mac

§  Nightly builds for users comfortable on the bleedingleading edge

3

Page 4: KNIME in NIBR Stories from Industry

Node development : open source

§  Chemistry nodes based on the RDKit •  open-source cheminformatics toolkit •  useable from C++, Python, Java

•  NIBR scientists/developers actively participate •  www.rdkit.org

§  Standard cheminformatics tasks + some nice extras

§  Developed both in-house and together with knime.com

4

Page 5: KNIME in NIBR Stories from Industry

Node development : in house

§  Connections to internal data sources

§  Wrappers around in-house developed algorithms

§  Connection to our web service framework for cheminformatics services

5

Page 6: KNIME in NIBR Stories from Industry

Generic CIx service node

6

Page 7: KNIME in NIBR Stories from Industry

Sponsored node development

§  Modifications to naïve Bayes nodes to support fingerprints

§  Fingerprint naïve Bayes supporting unbalanced datasets

§  Database schema browser

§  Improvements to python integration

§  Improvements to database connector, readers

§  Ensemble tree classifier (in progress)

7

Page 8: KNIME in NIBR Stories from Industry

Case studies

8

Page 9: KNIME in NIBR Stories from Industry

Combining databases

9

§  Question: what kind of activity might I expect to see for a given compound?

§  Do a similarity search in our database of internal compounds

§  Look up assays where those compounds have been tested

Page 10: KNIME in NIBR Stories from Industry

§  More browsing of those results: where are those neighbors most active?

p(Activity) > 8

Combining databases

Page 11: KNIME in NIBR Stories from Industry

p(Activity) > 8

Combining databases

11

§  More browsing of those results: show me the most active neighbors

Page 12: KNIME in NIBR Stories from Industry

Parallel virtual screening example

§  Goal: find some interesting compounds to be screened for a new project

§  2D similarity searches across two databases: •  NIBR powder archive •  Catalogs from trusted vendors

§  About 7 million compounds total.

§  Use several different fingerprints

Finton Sirockin (GDC/CADD)

Page 13: KNIME in NIBR Stories from Industry

The basic process

13

§  Generate fingerprints for database and queries

§  Calculate similarities with the Erlwood Fingerprint Similarity node

§  Sort, filter, standardize

§  Report

Page 14: KNIME in NIBR Stories from Industry

Combining the pieces

14

• Workflow is run for each query

• Fingerprints calculated for each type of search

• 600 – 11 000s • Needs to be calculated only once, even for n queries

Page 15: KNIME in NIBR Stories from Industry

Cluster usage reporting

§  Present a dashboard with a comprehensive view of current and historical usage of our HPC cluster infrastructure

§  Three Phases of processing : •  Input from raw SGE files off of the clusters at each site •  Steps A-C : data pre-processing, filtering & date-time object conversion

-  All logs are gathered into a single file kept in RAM -  Use of java nodes to convert unix time to Knime date objects -  Bash nodes for awk manipulations which are faster natively in LINUX

•  Steps D – I : execute concurrently -  Knime Statistics and grouping are heavily used -  Step H spawns cluster jobs to gather user usage statistics

§  Present summarized and aggregated data using spotfire

15

Mike Derby (NIBR IT) Varun Shivashankar (NIBR IT)

Page 16: KNIME in NIBR Stories from Industry

The workflow

16

•  Usage Data input file : Original logs 2GB – 4 GB in size x 4 clusters

•  Resulting Data file of summarized data : user_usage_DUS.csv == 1.9M

Page 17: KNIME in NIBR Stories from Industry

The complexity

17

Page 18: KNIME in NIBR Stories from Industry

The report: historical data

18

Page 19: KNIME in NIBR Stories from Industry

The dashboard

19

Written out to a UNC path, read every few minutes by Spotfire Server Generates data either from scripts or Knime running headless.

Page 20: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

§  Goal: build a model to predict which of a set of targets a molecule is most likely to hit

§  Method: using RDKit atom-pair fingerprints and a new KNIME learner that builds ensembles of truncated decision trees. (sponsored development with knime.com)

§  Validation data set: active molecules from 50 different ChEMBL assays1

20

1Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 51, 1831-1839 (2011).

Page 21: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

21

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

Page 22: KNIME in NIBR Stories from Industry

About that scaling…

22

Page 23: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

23

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

§  out-of-bag prediction error: 5.8%

§  mean error from cross validation: 4.2%

Page 24: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

24

§  mistakes tend to be in families

Page 25: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

25

Page 26: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

26

Page 27: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

27

Page 28: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

28

Page 29: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

29

Page 30: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

30

Page 31: KNIME in NIBR Stories from Industry

Acknowledgements

§  NIBR •  John Davies (CPC) •  Richard Lewis (GDC) •  Steve Litster (NIBR IT) •  Andy Palmer (NIBR IT) •  Patrick Warren (NIBR IT) •  Case studies

-  Finton Sirockin (GDC) -  Mike Derby (NIBR IT) -  Varun Shivashankar (NIBR IT) -  John Davies (CPC)

•  Node development -  Manuel Schwarze (NIBR IT) -  Dillip Kumar Mohanty (NIBR IT) -  Sudip Ghosh (NIBR IT)

•  Marc Litherland (NIBR IT)

§  knime.com •  Michael Berthold •  Bernd Wiswedel •  Thorsten Meinl •  Peter Ohl

§  Simon Richards (Lilly)

31

Page 32: KNIME in NIBR Stories from Industry

T e a c h • D i s c o v e r • T r e a t

the power of collaborative efforts

Join the Teach-Discover-Treat initiative: participate in our

symposium* and compete on one or more challenges!

*ACS Spring Meeting, March 25th, 1:30pm to 5:00pm, San Diego Convention Center, Room 26A

Goal: Provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases

q  Requirements: use freely available software tools; datasets will be provided with a focus on targets for neglected diseases

q  Criteria to judge: quality of the model (statistical measures), clarity of the tutorial (suitable for undergraduate course), innovative application of computational technique(s)

q  Awards: travel awards to cover travel expenses for presenting work at COMP symposium

q  Presentation of Awardees at ACS Spring 2013 meeting (New Orleans)

More information and access to data sets coming in March Bookmark www.teach-discover-treat.org