View
38
Download
0
Category
Preview:
Citation preview
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Artificial Intelligence Enabled Multiscale Molecular Simulations
Arvind RamanathanTeam Lead, Integrative Systems Biology, Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge
ramanathana@ornl.gov http://ramanathanlab.org
Dmitry I. LiakhNCCS
Ramakrishnan KannanCSMD
Srikanth Yoginath
CSMD
Christopher B. StanleyCSED
DebsindhuBhowmik
CSED
HengMa
CSED
22
Large-ScaleNumerical Simulation
Scalable Data Analytics
DeepLearning / Artificial
Intelligence
Drive Integration of Simulation, Data Analytics and Machine LearningMolecules Library: AI-driven multiscale molecular simulations
TraditionalHPC
Systems
3
Multi-scale phenomena in biological systems pose challenges for modeling/ simulations @ Exascale
• Simulations of physical phenomena take 45-60% of supercomputing time
• Coupled to experimental data such as Spallation Neutron Source, X-ray scattering/ diffraction facilities, etc.
• “Exascale simulations will require some analyses… be performed while data is still resident in memory…”
4
Towards AI-driven simulations: Interleaving data analytics + simulations
• In situ analytics• Reduced data
movement and other overheads
• Online monitoring and feedback
Job scheduler
Simulation(s)
Data storage (Disks)
Analytics
Visualization
“Big iron”
Dedicated analytics clusters
Big Store
Traditional Compute + Simulations
Unsustainable at Exascale• Data movement bottlenecks• Parallel analytics bottlenecks
Interleaving Analytics + Simulations
Job scheduler
Simulation(s)
Data storage (Disks)
Analytics
Visualization
“Big iron”
Iterative forward/backward loop • High performance framework to monitor & analyze
simulations as they are running with little/ no modification to simulation software
• Demonstrate on molecular dynamics simulations, but generalize framework for broad applicability
5
• Building effective (low-dimensional) latent representations of simulation datasets:– Using deep learning approaches for
molecular dynamics (MD) data– Scaling convolutional variational
autoencoder for MD
• Predicting where we should go next in MD simulations:– Building a recurrent autoencoder to
predict future steps
• Preliminary work on a reinforcement learning approach for protein folding/ docking
Outline: Can artificial intelligence (AI) techniques be leveraged for accelerating molecular simulations?
6
A variational approach to encode protein folding with convolutional auto-encoders
Con
tact
mat
rices
MD
Sim
ulat
ions
Den
se
1234
1234
Mea
nSt
d
Sam
pled
late
nt
embe
ddin
g
Den
se
Rec
onst
ruct
ed
cont
act m
atric
es
4 convolution layers 4 deconvolution layers
1. 64 filters, 3x3 window, 1x1 stride, RELU2. 64 filters, 3x3 window, 1x1 stride, RELU3. 64 filters, 3x3 window, 2x2 stride, RELU4. 64 filters, 3x3 window, 1x1 stride, Sigmoid
Reduced dimension: 3
VAE
Embe
ddin
g
INPUT OUTPUTS
No.
of n
ativ
e co
ntac
ts
D. Bhowmik, M.T. Young, S. Gao, A. Ramanathan, BMC Bioinformatics (accepted)
Related work: Hernandez 17 arXiv, Doerr 17 arXiv
77
Deep Learning reveals ”metastable states” in protein folding…
MSM Builder Datasets, Pande group
8
Scaling & Performance: Enabling DL approaches to achieve near real-time training/prediction
1
10
6 60 120 384Number of GPUs
Train
Tim
e / E
poch
(sec
)
Batch Size
128• Scaling deep learning on Summit to
facilitate online training• DeepEx: a custom-built deep
learning stack for Summit
• Exploiting low rank structure of scientific data:
• Accelerate training• Scale to larger datasets
• Performance on Resnet like convolutional nets
S. Yoginath , M. Alam, K. Perumalla, R. Kannan, D. Bhowmik, A. Ramanathan, ORNL Tech Report
9
N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Gaussian process bandits without regret: An experimental design approach. CoRR, abs/0912.3995, 2009.S. Grunewalder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret Bounds for Gaussian Process Bandit Problems. In AISTATS 2010 -Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pages 273–280, Chia Laguna Resort, Sardinia, Italy, May 2010.
Current platforms for hyperparameter optimization rely on sequential optimization techniques
• Bayesian optimization, Bandit optimization, and others are usually sequential search processes
• Exponential scaling: – The number of samples required to bound our uncertainty about the
optimization procedure is scales exponentially with the number of search dimensions, as in 2D, where D is number of dimensions [1, 2].
– Forgotten in the recent excitement of Bayesian optimization.
• Hyperspace, instead seeks to focus on the search space:
• Parallelism to exploit the statistical structure of the search space
• Reveal partial dependencies across parameter spaces
• Build many surrogate functions in parallel
• {Prayer}!
HyperSpace: Distributed Parallel Bayesian Optimization
10
HyperSpace: Parallel exploration of large search spaces 1. Define the bounds of each hyperparameter
search space.
2. Divide each search space bound into two nearly equal sub-bounds with overlap ɸ, where {ɸ ∈ ℝ|0 ≤ ɸ ≤ 1}.
3. Create all possible combinations of hyperparameter sub-bounds to form 2+search spaces (hyperspaces) where D is the number of model hyperparameters.
4. Run Bayesian optimization over each hyperspace in parallel
M. Todd Young, J. D. Hinkle, R. Kannan, A. Ramanathan, HyperSpace: Massively Parallel Bayesian Optimization, Workshop on High Performance Machine Learning, 2018, Lyon, France
https://github.com/yngtodd/hyperspace
11
Parallel exploration of large search spaces works better than random/ sequential based optimization
HyperSpace Random search SMBO
https://hpml2018.github.io/HPML2018_7/index.html#/
• Exploiting statistical dependencies in the hyperparameter dimensions leads to better set of parameters for ML models
12
• Building effective (low-dimensional) latent representations of simulation datasets:– Using deep learning approaches for molecular dynamics (MD) data– Scaling convolutional variational autoencoder for MD
• Predicting where we should go next in MD simulations:– Building a recurrent autoencoder to predict future steps
• Preliminary work on a reinforcement learning approach for protein folding/ docking
Outline: Can AI techniques be leveraged for biological experimental design?
13
RL-Fold: a naïve design based on native structure
Coffee
Mug
Start
Extract Contact Matrices
Coffee
PotC
offee M
ug
Trajectories
Latent Data
Stop and signal end of episode
CVAE
Output
Run MDsimulations
Output
Contact Matrices
Input
Input
Output
Input
DBSCANClustering
Outliers are present?
YesNoSave topologies to
spawn new simulations
Stop CalculateReward
CalculateReward
No. of native contacts in the ith sample RMSD threshold
No. of conformers in the cluster of the ith sample RMSD to the native state
No. of samples
No. of clusters
• Baseline: work with manually set threshold for RMSD from native state
• Deep RL: design a policy network that auto-detects the RMSD threshold
• Hypothesis: RL will allow sampling of “unseen” states
14
Pre-trained deep learning model allows RL explores possible states in protein folding
15
How does the folding look?
• Within 3-4 iterations, RL reaches near native state RMSD
• Further cycles explore misfolded states:• Unfold within a few steps of
MD simulations• Sampling allows exploration
of more intermediate states• Builds on all-atom simulations +
RL in a loop
16
Summary• Deep learning / AI techniques show promise: learning biophysical
characteristics that can be used to guide simulations
• Reinforcement learning: Preliminary evidence suggests the approach is feasible to speed up protein folding simulations!– How to integrate with physics-based models? – How to build scalable approaches beyond RL? – How to integrate with sparse experimental observables?
• Enabling iterative, active, and optimal experimental design
• Extensible library: Molecules to enable analysis of MD simulations at scale with Deep(µ)scope supporting AI-driven MD simulations
17
Some emerging challenges in HPC for multi-scale simulations…
• Design of coupled data analytic and simulation workflows on OLCF - Summit and ALCF – A21/Theta– In situ analytics approaches are required– Streaming applications of ML are different from post-processing of data
• Scaling DL/ AI approaches for biomolecular simulations– Faster and more efficient training for deep learning / AI approaches – Tensor based approaches to build deep learning algorithms
THANK YOU !!!• ORNL LDRD – Exascale computing
initiative
• DOE-NCI Joint Design of Advanced Computing Solutions for Cancer (JDAS4C)
• DOE Exascale Computing Project Cancer Deep Learning Environment (CANDLE)
• OLCF Early Science Access (Summit)
Questions/ Comments: ramanathana@ornl.gov
Recommended