Agile large-scale machine-learning pipelines in drug discovery

PowerPoint Presentation

Agile large-scale machine-learning pipelines in drug discoveryOla SpjuthDepartment of Pharmaceutical Biosciences and Science for Life LaboratoryUppsala University, [email protected]

Outline

My research in perspective

Our approach to machine learning in ligand-based modelingChallenges when data growsAutomation workflows/pipelinesHPC, Cloud Computing and Big Data Analytics

From data to insightsWe have access to a wealth of information

Data mining and predictive modeling can be useful

History: Bioclipse an open source workbench for the life sciencesO. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Msak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397

Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.

Open sourceEclipse plugin architecture4

How is the compound metabolized?Are any of its metabolites reactive/toxic?Here?

Here?

Is it toxic?

Chemical liabilities (drug safety, alerts)Adverse effects?Can we, based on existing experimental studies, IT, and statistical models, predict the outcome for new compounds?

Starting out in 2008 with a challenge:

Build a system with predictive models which runs on the clientInitial problem: Site-of-metabolism prediction

Site-of-metabolism (SOM) predictions MetaPrint2DL. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362

Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J Chem Inf Model. 2007 Mar-Apr;47(2):583-90.

ReactionDatabase

MetaPrint2Ddatabase

CircularFingerprintsHighest probability of metabolismLow probability of metabolismMedium probability of metabolism

Mapping

Predicting SOM in silicoCheapReasonably effectiveFast, can be used in earlier steps than optimizationResults on par with other toolsUsed a lot at e.g. AstraZenecaWet lab experiments are slow, expensive and not exact

7

Bioclipse and MetaPrint2D

Next challenge: Extend to general predictive models

Fast predictive models, allow for instant updates upon structural changesSpan from virtual screening to lead optimization

Bioclipse Decision Support

Integrate various predictive methodsSimilarity searches (InChI, signatures, fingerprints)Structural alerts (toxicophores)QSAR models (classification, regression)Visual interpretationHighlight important substructures

O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.

Mutagenicity: ability of a substance to induce mutations to DNACarcinogenic Potency Database (CPDB)aryl hydrocarbon receptor (AHR), transcription factor involved in metabolizing enzymes, important target because of a promiscuous ligand binding site10

Ligand-based predictive modelingQuantitative Structure-Activity Relationship (QSAR)Start with a dataset of chemical structures with measured property to model (inhibition, toxicity, etc)Describe chemicals using descriptorsMake use of statistical modeling to relate chemical structures to a response

11

Machine learning pipelinesPreprocessingModel buildingValidationReporting

QSAR modelingSignatures1 descriptor in CDK2Canonical representation of atom environments

Support Vector Machine (SVM)Robust modelingFaulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and Computer Sciences, 2003, 43, 707-720Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.

13

Local interpretation of nonlinear QSAR modelsMethodCompute gradient of decision function for predictionExtract descriptor(s) with largest component in the gradientDemonstrated on RF, SVM, and PLSCarlsson, L., Helgee, E.A., and Boyer, S. Interpretation of nonlinear qsar models applied to ames mutagenicity data. J Chem Inf Model 49, 11 (Nov 2009), 25512558.

Lars Carlsson,AstraZeneca R&D

14


Advantages:Fast: Run on local computersInterpretable results: Can be used for hypothesis generationGeneral: Can integrate any modeling technique and be applied to any data setExtensible: Very easy to add new components

15

Next challenge: Simple model building

Build a solution where:Scientists can build accurate models without modeling expertise, in order to aid their decision makingCombine these models with other models

Simple model building with graphical wizards

Next challenge: Predict using distributed servicesOpenTox - European project for creating a interoperable framework for toxicity predictionsAcademia and industryPartsOntology and APIQuery and invocation of predictive servicesMethods and algorithmsAuthentication and authorization


Modeldiscoverypredictions

Bioclipse and OpenTox

Collaboration with

European project for creating a interoperable framework for toxicity predictionsAcademia and industryPartsOntology and APIQuery and invocation of predictive servicesMethods and algorithmsAuthentication and authorization

20

OpenTox in Bioclipse

Rich user interface fro OpenTox!Screenshot with OpenTox predictions run in DSViewSafety profiles, rich clients allows for rich gui, customizable

21

Summary of Bioclipse Decision Support

Flexible, general methodApply to any collection of molecules

State-of-the-art machine-learning methodsHandles large data setsFast predictions

Advantages with the DS methodFast: Can run on local computerInstant predictions, calculate as you drawInterpretable results: Can be used for hypothesis generationGeneral: Apply any modeling technique to any data setExtensible: Very easy to add new componentsOpen: Free, open source

ObservationsPredictive drug discovery is becoming data-intensiveHigh throughput technologiesDrug/chemical screeningMolecular biology (omics)More and bigger publicly available data sources

Data is continuously updated We need scalable and automated methods for predictive modeling

Keeping predictive models up to date is challengingVersioning of models not trivial

24

Challenges with bigger data sets for machine learningModeling time increasesReduce/avoid parameter tuningRun on high-performance e-infrastructuresUse approximate methodsNot all implementations can handle dataset sizesUse sparse implementations

Determine parameter intervals for modeling (sweetspot)

J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg. Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. J Chem Inf Model. 2014, 54(11), pp 32113217.

SVM: Cost and Gamma parametersSignatures: Heights

Example 1: Modeling large number of observations

Jonathan Alvarsson

Example 2: Target predictions

Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JE, Noeske T. Ligand-based target prediction with signature fingerprints. J Chem Inf Model. 2014 Oct 27;54(10):2647-53

Challenge with running on HPCReduce manual work Automate data preprocessing and modelingSupport modeling life cycle (build, validate, document, version, publish, re-train )

Automating model building is not trivialAim: Agile, component-based architecture

Example application: Training large number of datasets

Aim: Build models for hundreds of targetsChallenge to extractChallenge to automate model building

Data sources

Samuel Lampa

Automating analysis on HPC clustersWorkflow systems can aid development and deploymentWe used Luigi system Integrate with queuing system (SLURM)

Train and assess model

Samuel Lampa

https://github.com/spotify/luigi

Example ML pipeline(unpublished data)

Publishing models

Publish models for easy access and consumptionWe used P2 (OSGi) provisioning system

v. 1.3v. 1.2v. 1.1

Use models

Reactive/continuous modeling

Data sources

CoordinateIntegrateVersionMonitor

Publishmodels

Archivemodels

Train and assess model

User

Bioclipse

34

Model building WFs on HPC is not trivialMany workflow systems existDSLs vs APIsDynamic input/output in e.g. cross-validation not supported out of the boxTime-consuming to create WFs

Workflows can be useful but is not (yet) the silver bullet we soughtO. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen, M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schnherr, A. Siretskiy, and D. Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.

Could cloud computing improve things?

QSAR Modeling on Amazon Elastic Cloud

B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson, and O. Spjuth. Scaling predictive modeling in drug development with cloud computing. J. Chem. Inf. Model., 2015, 55 (1), pp 19-25

Private cloudsWe set up an OpenStack system at UPPMAX (our HPC center)Primarily Infrastructure as a Service (IaaS) users can run virtual machinesPlatform-as-a-Service (PaaS): Hadoop and SparkOur question: Can this be useful for model building?

Open catalogue of VMIsHosted at Uppsala University

M. Dahl, F. Haziza, A. Kallio, E. Korpelainen, E. Bongcam-Rudloff, and O. Spjuth. BioImg.org: A catalogue of virtual machine images for the life sciences. Accepted in Bioinformatics and Biology Insights.www.bioimg.org Managing Virtual Machine Images

Cloud computing enables Big Data AnalyticsHadoopOpen Source Map-Reduce, suited for massively parallel tasksDistributed execution, high availability, fault tolerant, can be run on commodity hardwareE.g. Google, Facebook and Twitter use it

Hadoop File System (HDFS) distributes data on nodes, computing done in parallelbring computations to data

Hadoop (MapReduce) for massively parallel analysis

Evaluating Hadoop for next-generation sequencingCompare Hadoop and HPCCreate as identical pipelines as possibleCalculate efficiency as function of data sizeConclusion: Hadoop pipeline scales better than HPC and is economical for current data sizes

Alexey Siretskiy, former postdoc at UPPMAX

A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26.

A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE 10th International Conference on (2014), vol. 1, pp. 317323.

SPARKAdd caching to Hadoop (MapReduce) in memory computingGood for iterative algorithmsWe applied it for ligand-based virtual screening

With ke Edlund, HPCViz, KTHL. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud- Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013

Large-scale machine learning on SparkOngoing project: Create a large-scale machine learning pipeline for QSAR using Spark ML as alternative to Luigi workflow systemApply to large data setsApply to many data setsCompare Spark with workflows on Batch systemAim: Use for Reactive Modeling

Some conclusions so far on cloud computing and Hadoop/Spark for bioinformaticsCloud computingEasy provisioning of infrastructures, services and platformsHadoopScalable and efficient but to the price of software incompatibilitySparkimproves over Hadoop with in-memory computing and more intuitive interfaceCurrent working hypothesis: Spark more advantageous compared to workflows on batch systems for machine learning pipelines

Conformal predictionSeek answer to: How good is your prediction?

Traditional machine learning algorithms:Simple predictions (e.g. Class A, 8.45)Conformal predictionsPrediction intervals for a given confidence levelbased on a consistent and well-defined mathematical framework1

1 Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic learning in a random world; Springer: New York, 2005.

Conformal predictions

Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014), 1596603.

Some projects on Conformal PredictionsCP Feature Highlighting

CP in Spark

Large-scale model building in cheminformatics and virtual screeningOngoing projects

Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal Prediction Classification Models. Statistical Learning and Data Sciences. Springer International Publishing; 2015. pp. 323334.Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. Submitted.

Two pilots for clinical data management

CML, Lucia Cavelier

MDR, sa Melhus

e-Science (cyberinfrastructure, big data)Systematic and advanced use of computers in researchHigh-performance computingDistributed data, Big dataEnabling science!

www.e-science.sewww.essenceofescience.se

AcknowledgementsWorkflowsSamuel LampaDavid KreilMaciej Kadua

BioImg.orgMartin DahlFrdric HazizaMentell Design

Hadoop & SparkAlexey Siretskiyke EdlundIzhar ul HassanMarco CappuciniStaffan Arvidsson

Cloud computingFrdric HazizaTore SundqvistBehrooz TorabiSalman ToorAndreas Hellander

Predictive modelingLars CarlssonErnst Ahlberg-HelgeeMartin EklundUlf NorinderWesley SchaalJonathan Alvarsson

BioclipseArvid BergEgon WillighagenAll Bioclipse and CDK contributors

Thank [email protected]