Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Algorithms for In Situ Data Analytics of Next Generation
Molecular Dynamics WorkflowsMichela Taufer
Department of Electrical Engineering and Computer ScienceThe University of Tennessee Knoxville
Acknowledgements
T. Johnston B. Zhang T. Estrada A Liwo
S. Crivelli
M. CuendetE. Deelman
R. da Silva
Sponsors:
H. Weinstein
A. RazaviT. Do
S. Thomas A. PlanteB. Mulligan
Trends in Next-Generation Systems
Source: Lucy Nowell (DOE)
5x
12x
64x
13x
Rising Importance of Ensembles
Source: https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification
Exas
cale
Peta
scal
eTe
rasc
ale
Sierra (2018)
Sequoia (2013)
Purple (2005)
Number of Jobs
Syst
em S
cale
(FLO
PS)
Few Many
Widening I/O Gap
3
Classical Molecular Dynamics Simulations
By Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/
• A MD simulation comprises of hundreds of thousands of MD job
• Each job preforms hundreds of thousands of MD steps
4
Classical Molecular Dynamics Simulations
Forces on single atoms à Acceleration
à Velocity à Position
• A MD step computes forces on single atoms (e.g., bond, angle, dihedrals, nonbond)• Forces are added to
compute acceleration• Acceleration is used to
update velocities • Velocities are used to
update the atom positions• Every n steps, all atom
positions are stored à 3D snapshot or frame
5
Analyzing MD Frames: Present and Future
6
ANALYSIS
Local Storage
MEMORY
CN
MD code
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Parallel File System
MEMORY
CN
MD code
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
ANALYSIS
Parallel File System
Present Future
In Situ and In Transit Analysis
Core 1
Core 4
Core 7
Core 8
Core 1
Core 1
Core 5
Core 6
Simulation
Nod
e
AnalysisSimulation
Node 1 Node 2 Node 3
Network InterconnectAnalysis
Shar
ed M
emor
y
Example of tools:• DataSpaces (Rutgers U.)• DataStager (GeorgiaTech)
In situ and in transit analysis requires rethinking data algorithms 7
Building a Closed-loop Workflow
Plumed
MD code (e.g., GROMACS)
In-memory Staging AreaDataSpaces
A4MDanalytics
Retriever
Dataflow
IngestorControlflow
Run n-stride simulation steps Collective
variables
Data Generation Data AnalyticsData Storage
Dataflow
Burst Buffer
Parallel File System
(e.g., Lustre)
A4MDanalytics
A4MDanalytics
algorithms
ML-inferred algorithms
Controlflow ControlflowControlflow
Data Feedback
8
Building a Closed-loop Workflow
Plumed
MD code (e.g., GROMACS)
In-memory Staging AreaDataSpaces
In-memory Staging AreaDataSpaces RetrieverRetriever
DataflowDataflow
IngestorIngestorControlflowControlflow
Run n-Stride simulation stepsRun n-Stride simulation steps Collective
variablesCollective variables
Data GenerationData Generation Data AnalyticsData AnalyticsData StorageData Storage
DataflowDataflow
Burst BufferBurst Buffer
Parallel File System
(e.g., Lustre)
Parallel File System
(e.g., Lustre)
ML-inferred algorithmsML-inferred algorithms
ControlflowControlflow ControlflowControlflowControlflowControlflow
Data FeedbackData Feedback
Plumed
MD code (e.g., GROMACS)
In-memory Staging AreaDataSpaces Retriever
Dataflow
IngestorControlflow
Run n-Stride simulation steps Collective
variables
Data Generation Data AnalyticsData Storage
Dataflow
Burst Buffer
Parallel File System
(e.g., Lustre)
ML-inferred algorithms
Controlflow ControlflowControlflow
Data Feedback
A4MDanalytics
A4MDanalytics
A4MDanalytics
algorithms
9
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and
protein engineering
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and
protein engineering
A4MD: Protein-Ligand Docking
12
T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)
From 3D Atomic Structures to 3D Points
13
x-y axis
x-z axis
y-z axis
x-y axis
x-z axis
y-z axis
Protein pocket
Docked ligand
Docked ligand
Metadata: each 3D pointrepresents the position of one docked ligand in the protein pocket
Search for Dense Spaces: Octree Clustering
14
Search for Dense Spaces: Octree Clustering
Octree nodes
Metadata: ligand conformations
15
Search for Dense Spaces: Octree Clustering
Near-native ligand structures
(RMSD <= 2A)
Deepest, more dense octant
found by octree clustering
16
Search: Linear in complexity using Mimir- a MapReduce over MPI framework
Case Study: Sampled Conformations - Ligand 1k1l
−1−0.5
00.5
1 −1−0.5
00.5
1
−1
−0.5
0
0.5
1
17
T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)
Case Study: Sampled Conformations - Ligand 1k1l
Near-native ligand structures
(RMSD <= 2A)
Deepest, more dense octant
found by octree clustering
18
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and
protein engineering
A4MD: Rare Events in MD Simulations
Transformations:
Movements:
20
A4MD: Rare Events in MD Simulations
• We want to capture what is going on in each frame without:• Disrupting the simulation (e.g., stealing CPU and memory on
the node)• Moving all the frames to a central file system and analyzing
them once the simulation is over• Comparing each frame with past frames of the same job• Comparing each frame with frames of other jobs
Frames (or snapshots) of an MD trajectory:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 75 Frame 80
21
From 3D Atomic Structure to a Single Eigenvalue
Drop all but not the backbone atoms (Cα atoms)
CβiCα
j
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
From 3D Atomic Structure to a Single Eigenvalue
Measure the distance between Cα
j and Cβi
Build a bipartite distance matrix by comparing two substructures
i
j
λmaxCompute largest eigenvalue
CβiCα
j d
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
Case Study: Capturing Movement of α-helices
24
Capture movement of structures (α-helices) with respect to each other
1330 1360 1390
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
24
Case Study: Capturing Movement of α-helices
Monitor largest eigenvalue of entire protein
25
Case Study: Capturing Movement of α-helices
Something is changing
Monitor largest eigenvalue of entire protein
26
Case Study: Capturing Movement of α-helices
Individual α-helices (Helix 1, Helix 2, and Helix 3) appear stable
Monitor largest eigenvalue of single α-helices
27
Case Study: Capturing Movement of α-helices
Monitor largest eigenvalue of bipartite distance matrix
First and second α-helices appear stable; third helix moves
28
1330
1360
1390
Case Study: Capturing Movement of α-helices
Analysis: Linear in complexity using local metadata (eigenvalues) with DataSpaces
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria
and protein engineering
A4MD: Proteins with Similar Functions Key principle: proteins with similar sequences have similar functions• Measure millions of protein variants expressed from yeast or bacteria• Structure proteins to produce desired properties (protein engineering)
31
Protein Representations
3D Cartesian representation Surface representationMulti-fold representation
32
From Multi-fold Representation to Image Encoding
33
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
From Multi-fold Representation to Image Encoding
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
34
From Multi-fold Representation to Image Encoding
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
Case Study: High-Throughput Protein Analysis
convolutional neural network
• 62,991 proteins from the Protein Data Bank• Eight biological processes from biological process taxonomy in RCSB-PDB
Proteins as 3D tens
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
36
Google’s Inception-v3, Gem-Net
Challenges and Opportunity
A workflow that integrates both simulations and analytics must have these key properties: • Efficiency: Optimize workflows’ performance and power usage
associated to data movement and analytics• Generality: Build workflows that support different types of
analytics across different MD applications• Non-invasive: Capture data from MD simulations without
rewriting legacy codes or simulation scripts • Portability: Execute combined simulations and analytics across
different platforms and with heterogenous resources • Scalability: (Re)design ML algorithms for knowledge discovery
at scale
37