38
Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of Electrical Engineering and Computer Science The University of Tennessee Knoxville

Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Algorithms for In Situ Data Analytics of Next Generation

Molecular Dynamics WorkflowsMichela Taufer

Department of Electrical Engineering and Computer ScienceThe University of Tennessee Knoxville

Page 2: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Acknowledgements

T. Johnston B. Zhang T. Estrada A Liwo

S. Crivelli

M. CuendetE. Deelman

R. da Silva

Sponsors:

H. Weinstein

A. RazaviT. Do

S. Thomas A. PlanteB. Mulligan

Page 3: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Trends in Next-Generation Systems

Source: Lucy Nowell (DOE)

5x

12x

64x

13x

Rising Importance of Ensembles

Source: https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification

Exas

cale

Peta

scal

eTe

rasc

ale

Sierra (2018)

Sequoia (2013)

Purple (2005)

Number of Jobs

Syst

em S

cale

(FLO

PS)

Few Many

Widening I/O Gap

3

Page 4: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Classical Molecular Dynamics Simulations

By Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/

• A MD simulation comprises of hundreds of thousands of MD job

• Each job preforms hundreds of thousands of MD steps

4

Page 5: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Classical Molecular Dynamics Simulations

Forces on single atoms à Acceleration

à Velocity à Position

• A MD step computes forces on single atoms (e.g., bond, angle, dihedrals, nonbond)• Forces are added to

compute acceleration• Acceleration is used to

update velocities • Velocities are used to

update the atom positions• Every n steps, all atom

positions are stored à 3D snapshot or frame

5

Page 6: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Analyzing MD Frames: Present and Future

6

ANALYSIS

Local Storage

MEMORY

CN

MD code

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

Parallel File System

MEMORY

CN

MD code

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

ANALYSIS

Parallel File System

Present Future

Page 7: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

In Situ and In Transit Analysis

Core 1

Core 4

Core 7

Core 8

Core 1

Core 1

Core 5

Core 6

Simulation

Nod

e

AnalysisSimulation

Node 1 Node 2 Node 3

Network InterconnectAnalysis

Shar

ed M

emor

y

Example of tools:• DataSpaces (Rutgers U.)• DataStager (GeorgiaTech)

In situ and in transit analysis requires rethinking data algorithms 7

Page 8: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Building a Closed-loop Workflow

Plumed

MD code (e.g., GROMACS)

In-memory Staging AreaDataSpaces

A4MDanalytics

Retriever

Dataflow

IngestorControlflow

Run n-stride simulation steps Collective

variables

Data Generation Data AnalyticsData Storage

Dataflow

Burst Buffer

Parallel File System

(e.g., Lustre)

A4MDanalytics

A4MDanalytics

algorithms

ML-inferred algorithms

Controlflow ControlflowControlflow

Data Feedback

8

Page 9: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Building a Closed-loop Workflow

Plumed

MD code (e.g., GROMACS)

In-memory Staging AreaDataSpaces

In-memory Staging AreaDataSpaces RetrieverRetriever

DataflowDataflow

IngestorIngestorControlflowControlflow

Run n-Stride simulation stepsRun n-Stride simulation steps Collective

variablesCollective variables

Data GenerationData Generation Data AnalyticsData AnalyticsData StorageData Storage

DataflowDataflow

Burst BufferBurst Buffer

Parallel File System

(e.g., Lustre)

Parallel File System

(e.g., Lustre)

ML-inferred algorithmsML-inferred algorithms

ControlflowControlflow ControlflowControlflowControlflowControlflow

Data FeedbackData Feedback

Plumed

MD code (e.g., GROMACS)

In-memory Staging AreaDataSpaces Retriever

Dataflow

IngestorControlflow

Run n-Stride simulation steps Collective

variables

Data Generation Data AnalyticsData Storage

Dataflow

Burst Buffer

Parallel File System

(e.g., Lustre)

ML-inferred algorithms

Controlflow ControlflowControlflow

Data Feedback

A4MDanalytics

A4MDanalytics

A4MDanalytics

algorithms

9

Page 10: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and

protein engineering

Page 11: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and

protein engineering

Page 12: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

A4MD: Protein-Ligand Docking

12

T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)

Page 13: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

From 3D Atomic Structures to 3D Points

13

x-y axis

x-z axis

y-z axis

x-y axis

x-z axis

y-z axis

Protein pocket

Docked ligand

Docked ligand

Metadata: each 3D pointrepresents the position of one docked ligand in the protein pocket

Page 14: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Search for Dense Spaces: Octree Clustering

14

Page 15: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Search for Dense Spaces: Octree Clustering

Octree nodes

Metadata: ligand conformations

15

Page 16: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Search for Dense Spaces: Octree Clustering

Near-native ligand structures

(RMSD <= 2A)

Deepest, more dense octant

found by octree clustering

16

Search: Linear in complexity using Mimir- a MapReduce over MPI framework

Page 17: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Sampled Conformations - Ligand 1k1l

−1−0.5

00.5

1 −1−0.5

00.5

1

−1

−0.5

0

0.5

1

17

T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)

Page 18: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Sampled Conformations - Ligand 1k1l

Near-native ligand structures

(RMSD <= 2A)

Deepest, more dense octant

found by octree clustering

18

Page 19: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and

protein engineering

Page 20: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

A4MD: Rare Events in MD Simulations

Transformations:

Movements:

20

Page 21: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

A4MD: Rare Events in MD Simulations

• We want to capture what is going on in each frame without:• Disrupting the simulation (e.g., stealing CPU and memory on

the node)• Moving all the frames to a central file system and analyzing

them once the simulation is over• Comparing each frame with past frames of the same job• Comparing each frame with frames of other jobs

Frames (or snapshots) of an MD trajectory:

Frame 55 Frame 60 Frame 65 Frame 70 Frame 75 Frame 80

21

Page 22: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

From 3D Atomic Structure to a Single Eigenvalue

Drop all but not the backbone atoms (Cα atoms)

CβiCα

j

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.

Page 23: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

From 3D Atomic Structure to a Single Eigenvalue

Measure the distance between Cα

j and Cβi

Build a bipartite distance matrix by comparing two substructures

i

j

λmaxCompute largest eigenvalue

CβiCα

j d

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.

Page 24: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Capturing Movement of α-helices

24

Capture movement of structures (α-helices) with respect to each other

1330 1360 1390

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.

24

Page 25: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Capturing Movement of α-helices

Monitor largest eigenvalue of entire protein

25

Page 26: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Capturing Movement of α-helices

Something is changing

Monitor largest eigenvalue of entire protein

26

Page 27: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Capturing Movement of α-helices

Individual α-helices (Helix 1, Helix 2, and Helix 3) appear stable

Monitor largest eigenvalue of single α-helices

27

Page 28: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: Capturing Movement of α-helices

Monitor largest eigenvalue of bipartite distance matrix

First and second α-helices appear stable; third helix moves

28

Page 29: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

1330

1360

1390

Case Study: Capturing Movement of α-helices

Analysis: Linear in complexity using local metadata (eigenvalues) with DataSpaces

Page 30: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria

and protein engineering

Page 31: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

A4MD: Proteins with Similar Functions Key principle: proteins with similar sequences have similar functions• Measure millions of protein variants expressed from yeast or bacteria• Structure proteins to produce desired properties (protein engineering)

31

Page 32: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Protein Representations

3D Cartesian representation Surface representationMulti-fold representation

32

Page 33: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

From Multi-fold Representation to Image Encoding

33

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

Page 34: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

From Multi-fold Representation to Image Encoding

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

34

Page 35: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

From Multi-fold Representation to Image Encoding

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

Page 36: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Case Study: High-Throughput Protein Analysis

convolutional neural network

• 62,991 proteins from the Protein Data Bank• Eight biological processes from biological process taxonomy in RCSB-PDB

Proteins as 3D tens

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

36

Google’s Inception-v3, Gem-Net

Page 37: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of

Challenges and Opportunity

A workflow that integrates both simulations and analytics must have these key properties: • Efficiency: Optimize workflows’ performance and power usage

associated to data movement and analytics• Generality: Build workflows that support different types of

analytics across different MD applications• Non-invasive: Capture data from MD simulations without

rewriting legacy codes or simulation scripts • Portability: Execute combined simulations and analytics across

different platforms and with heterogenous resources • Scalability: (Re)design ML algorithms for knowledge discovery

at scale

37

Page 38: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of