Upload
ola-spjuth
View
286
Download
2
Embed Size (px)
Citation preview
PowerPoint Presentation
Agile large-scale machine-learning pipelines in drug discoveryOla SpjuthDepartment of Pharmaceutical Biosciences and Science for Life LaboratoryUppsala University, [email protected]
Outline
My research in perspective
Our approach to machine learning in ligand-based modelingChallenges when data growsAutomation workflows/pipelinesHPC, Cloud Computing and Big Data Analytics
From data to insightsWe have access to a wealth of information
Data mining and predictive modeling can be useful
History: Bioclipse an open source workbench for the life sciencesO. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Msak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397
Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.
Open sourceEclipse plugin architecture4
How is the compound metabolized?Are any of its metabolites reactive/toxic?Here?
Here?
Is it toxic?
Chemical liabilities (drug safety, alerts)Adverse effects?Can we, based on existing experimental studies, IT, and statistical models, predict the outcome for new compounds?
Starting out in 2008 with a challenge:
Build a system with predictive models which runs on the clientInitial problem: Site-of-metabolism prediction
Site-of-metabolism (SOM) predictions MetaPrint2DL. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362
Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J Chem Inf Model. 2007 Mar-Apr;47(2):583-90.
ReactionDatabase
MetaPrint2Ddatabase
CircularFingerprintsHighest probability of metabolismLow probability of metabolismMedium probability of metabolism
Mapping
Predicting SOM in silicoCheapReasonably effectiveFast, can be used in earlier steps than optimizationResults on par with other toolsUsed a lot at e.g. AstraZenecaWet lab experiments are slow, expensive and not exact
7
Bioclipse and MetaPrint2D
Next challenge: Extend to general predictive models
Fast predictive models, allow for instant updates upon structural changesSpan from virtual screening to lead optimization
Bioclipse Decision Support
Integrate various predictive methodsSimilarity searches (InChI, signatures, fingerprints)Structural alerts (toxicophores)QSAR models (classification, regression)Visual interpretationHighlight important substructures
O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.
Mutagenicity: ability of a substance to induce mutations to DNACarcinogenic Potency Database (CPDB)aryl hydrocarbon receptor (AHR), transcription factor involved in metabolizing enzymes, important target because of a promiscuous ligand binding site10
Ligand-based predictive modelingQuantitative Structure-Activity Relationship (QSAR)Start with a dataset of chemical structures with measured property to model (inhibition, toxicity, etc)Describe chemicals using descriptorsMake use of statistical modeling to relate chemical structures to a response
11
Machine learning pipelinesPreprocessingModel buildingValidationReporting
QSAR modelingSignatures1 descriptor in CDK2Canonical representation of atom environments
Support Vector Machine (SVM)Robust modelingFaulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and Computer Sciences, 2003, 43, 707-720Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.
13
Local interpretation of nonlinear QSAR modelsMethodCompute gradient of decision function for predictionExtract descriptor(s) with largest component in the gradientDemonstrated on RF, SVM, and PLSCarlsson, L., Helgee, E.A., and Boyer, S. Interpretation of nonlinear qsar models applied to ames mutagenicity data. J Chem Inf Model 49, 11 (Nov 2009), 25512558.
Lars Carlsson,AstraZeneca R&D
14
Bioclipse Decision Support
Advantages:Fast: Run on local computersInterpretable results: Can be used for hypothesis generationGeneral: Can integrate any modeling technique and be applied to any data setExtensible: Very easy to add new components
15
Next challenge: Simple model building
Build a solution where:Scientists can build accurate models without modeling expertise, in order to aid their decision makingCombine these models with other models
Simple model building with graphical wizards
Next challenge: Predict using distributed servicesOpenTox - European project for creating a interoperable framework for toxicity predictionsAcademia and industryPartsOntology and APIQuery and invocation of predictive servicesMethods and algorithmsAuthentication and authorization
Bioclipse Decision Support
Modeldiscoverypredictions
Bioclipse and OpenTox
Collaboration with
European project for creating a interoperable framework for toxicity predictionsAcademia and industryPartsOntology and APIQuery and invocation of predictive servicesMethods and algorithmsAuthentication and authorization
20
OpenTox in Bioclipse
Rich user interface fro OpenTox!Screenshot with OpenTox predictions run in DSViewSafety profiles, rich clients allows for rich gui, customizable
21
Summary of Bioclipse Decision Support
Flexible, general methodApply to any collection of molecules
State-of-the-art machine-learning methodsHandles large data setsFast predictions
Advantages with the DS methodFast: Can run on local computerInstant predictions, calculate as you drawInterpretable results: Can be used for hypothesis generationGeneral: Apply any modeling technique to any data setExtensible: Very easy to add new componentsOpen: Free, open source
ObservationsPredictive drug discovery is becoming data-intensiveHigh throughput technologiesDrug/chemical screeningMolecular biology (omics)More and bigger publicly available data sources
Data is continuously updated We need scalable and automated methods for predictive modeling
Keeping predictive models up to date is challengingVersioning of models not trivial
24
Challenges with bigger data sets for machine learningModeling time increasesReduce/avoid parameter tuningRun on high-performance e-infrastructuresUse approximate methodsNot all implementations can handle dataset sizesUse sparse implementations
Determine parameter intervals for modeling (sweetspot)
J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg. Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. J Chem Inf Model. 2014, 54(11), pp 32113217.
SVM: Cost and Gamma parametersSignatures: Heights
Example 1: Modeling large number of observations
Jonathan Alvarsson
Example 2: Target predictions
Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JE, Noeske T. Ligand-based target prediction with signature fingerprints. J Chem Inf Model. 2014 Oct 27;54(10):2647-53
Challenge with running on HPCReduce manual work Automate data preprocessing and modelingSupport modeling life cycle (build, validate, document, version, publish, re-train )
Automating model building is not trivialAim: Agile, component-based architecture
Example application: Training large number of datasets
Aim: Build models for hundreds of targetsChallenge to extractChallenge to automate model building
Data sources
Samuel Lampa
Automating analysis on HPC clustersWorkflow systems can aid development and deploymentWe used Luigi system Integrate with queuing system (SLURM)
Train and assess model
Samuel Lampa
https://github.com/spotify/luigi
Example ML pipeline(unpublished data)
Publishing models
Publish models for easy access and consumptionWe used P2 (OSGi) provisioning system
v. 1.3v. 1.2v. 1.1
Use models
Reactive/continuous modeling
Data sources
CoordinateIntegrateVersionMonitor
Publishmodels
Archivemodels
Train and assess model
User
Bioclipse
34
Model building WFs on HPC is not trivialMany workflow systems existDSLs vs APIsDynamic input/output in e.g. cross-validation not supported out of the boxTime-consuming to create WFs
Workflows can be useful but is not (yet) the silver bullet we soughtO. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen, M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schnherr, A. Siretskiy, and D. Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.
Could cloud computing improve things?
QSAR Modeling on Amazon Elastic Cloud
B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson, and O. Spjuth. Scaling predictive modeling in drug development with cloud computing. J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
Private cloudsWe set up an OpenStack system at UPPMAX (our HPC center)Primarily Infrastructure as a Service (IaaS) users can run virtual machinesPlatform-as-a-Service (PaaS): Hadoop and SparkOur question: Can this be useful for model building?
Open catalogue of VMIsHosted at Uppsala University
M. Dahl, F. Haziza, A. Kallio, E. Korpelainen, E. Bongcam-Rudloff, and O. Spjuth. BioImg.org: A catalogue of virtual machine images for the life sciences. Accepted in Bioinformatics and Biology Insights.www.bioimg.org Managing Virtual Machine Images
Cloud computing enables Big Data AnalyticsHadoopOpen Source Map-Reduce, suited for massively parallel tasksDistributed execution, high availability, fault tolerant, can be run on commodity hardwareE.g. Google, Facebook and Twitter use it
Hadoop File System (HDFS) distributes data on nodes, computing done in parallelbring computations to data
Hadoop (MapReduce) for massively parallel analysis
Evaluating Hadoop for next-generation sequencingCompare Hadoop and HPCCreate as identical pipelines as possibleCalculate efficiency as function of data sizeConclusion: Hadoop pipeline scales better than HPC and is economical for current data sizes
Alexey Siretskiy, former postdoc at UPPMAX
A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26.
A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE 10th International Conference on (2014), vol. 1, pp. 317323.
SPARKAdd caching to Hadoop (MapReduce) in memory computingGood for iterative algorithmsWe applied it for ligand-based virtual screening
With ke Edlund, HPCViz, KTHL. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud- Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013
Large-scale machine learning on SparkOngoing project: Create a large-scale machine learning pipeline for QSAR using Spark ML as alternative to Luigi workflow systemApply to large data setsApply to many data setsCompare Spark with workflows on Batch systemAim: Use for Reactive Modeling
Some conclusions so far on cloud computing and Hadoop/Spark for bioinformaticsCloud computingEasy provisioning of infrastructures, services and platformsHadoopScalable and efficient but to the price of software incompatibilitySparkimproves over Hadoop with in-memory computing and more intuitive interfaceCurrent working hypothesis: Spark more advantageous compared to workflows on batch systems for machine learning pipelines
Conformal predictionSeek answer to: How good is your prediction?
Traditional machine learning algorithms:Simple predictions (e.g. Class A, 8.45)Conformal predictionsPrediction intervals for a given confidence levelbased on a consistent and well-defined mathematical framework1
1 Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic learning in a random world; Springer: New York, 2005.
Conformal predictions
Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014), 1596603.
Some projects on Conformal PredictionsCP Feature Highlighting
CP in Spark
Large-scale model building in cheminformatics and virtual screeningOngoing projects
Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal Prediction Classification Models. Statistical Learning and Data Sciences. Springer International Publishing; 2015. pp. 323334.Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. Submitted.
Two pilots for clinical data management
CML, Lucia Cavelier
MDR, sa Melhus
e-Science (cyberinfrastructure, big data)Systematic and advanced use of computers in researchHigh-performance computingDistributed data, Big dataEnabling science!
www.e-science.sewww.essenceofescience.se
AcknowledgementsWorkflowsSamuel LampaDavid KreilMaciej Kadua
BioImg.orgMartin DahlFrdric HazizaMentell Design
Hadoop & SparkAlexey Siretskiyke EdlundIzhar ul HassanMarco CappuciniStaffan Arvidsson
Cloud computingFrdric HazizaTore SundqvistBehrooz TorabiSalman ToorAndreas Hellander
Predictive modelingLars CarlssonErnst Ahlberg-HelgeeMartin EklundUlf NorinderWesley SchaalJonathan Alvarsson
BioclipseArvid BergEgon WillighagenAll Bioclipse and CDK contributors
Thank [email protected]