Biospark Getting Started Guide - robertslabjhu.infoThe “Biospark Getting Started Guide” describes how to download and execute the Biospark virtual machine image and how to get

Biospark Getting Started Guide

April 21, 2016

Authors:Elijah Roberts

Roberts GroupJohns Hopkins University

http://biophysics.jhu.edu/roberts/https://www.assembla.com/spaces/roberts-lab-public/wiki/Tutorialshttps://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark

Description

The “Biospark Getting Started Guide” describes how to download and execute the Biospark virtualmachine image and how to get started running analysis scripts using the Biospark framework.

System Requirements

This document assumes you will be using the latest Biospark virtual machine image availablefrom the Biospark website. If you are using a different configuration, you may need to adjust theinstructions in some places.

http://biophysics.jhu.edu/roberts/

https://www.assembla.com/spaces/roberts-lab-public/wiki/Tutorials

https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark

Table of Contents

Chapter 1 Downloading and Running the Biospark Virtual Machine . . . . . . . . . . . 11.1 Downloading Oracle VirtualBox and the Biospark virtual machine . . . . . . . . . . . 11.2 Installing Oracle VirtualBox and Biospark on Windows . . . . . . . . . . . . . . . . . 21.3 Installing Oracle VirtualBox and Biospark on Mac OS X . . . . . . . . . . . . . . . . 61.4 Running the Biospark VM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 2 Tutorial Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Overview of the Biospark software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Prerequisites for running the tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Understanding the tutorial syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Tutorial files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Using Biospark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.1 Starting and stopping the Biospark servers . . . . . . . . . . . . . . . . . . . 122.5.2 Accessing the HDFS file system . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.3 Submitting Biospark jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.4 Introduction to SFiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Executing a Jupyter/IPython notebook . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7 Changes if not using the Biospark virtual machine . . . . . . . . . . . . . . . . . . . 14

Chapter 3 Analyzing Time-Lapse Microscopy Images of Yeast Cell Growth . . . . . . . 153.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Uploading the microscopy images into HDFS . . . . . . . . . . . . . . . . . . . . . . 153.3 Aligning and normalizing the microscopy images . . . . . . . . . . . . . . . . . . . . 16

3.3.1 Running the analysis again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Segmenting cells in the images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5 Analyzing the growth data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 4 Calculating Probability Distributions from Monte Carlo Simulations . . . . . 244.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Uploading the simulation data into HDFS . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Extracting the time series of a specific trajectory . . . . . . . . . . . . . . . . . . . . 254.4 Calculating the stationary probability distribution . . . . . . . . . . . . . . . . . . . . 264.5 Calculating the time-dependent probability distribution . . . . . . . . . . . . . . . . . 274.6 Viewing an RDME trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.7 Calculating the spatially resolved, time-dependent probability distribution . . . . . . . 30

Chapter 5 Extracting Structural Dynamics from Molecular Dynamics Simulations . . . 325.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Uploading the simulation data into HDFS . . . . . . . . . . . . . . . . . . . . . . . . 32

ii

5.3 Calculating the root-mean-square deviation . . . . . . . . . . . . . . . . . . . . . . . 335.4 Clustering by pairwise RMSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 6 Getting Started with Biospark Scripting . . . . . . . . . . . . . . . . . . . . . . 396.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2 Launching a PySpark session in Jupyter Notebook . . . . . . . . . . . . . . . . . . . 396.3 Listing an SFile using PySpark scripting commands . . . . . . . . . . . . . . . . . . 40

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

iii

Chapter 1

Downloading and Running the Biospark Virtual Machine

The easiest way to get started using Biospark is to use a downloadable Biospark virtual ma-chine (VM) image. Each image contains a full installation of Biospark and all of the dependenciesneeded to run it installed in an Ubuntu Linux operating system. Additionally, the VM contains allof the tools and utilities used in each of the Biospark Tutorials and Protocols. The VM image mustbe executed in a hypervisor, software that allows your computer to run additional guest instancesof operating systems (known as virtual machines) alongside your host operating system. In thistutorial, we describe using Oracle VirtualBox, which is a free and widely used hypervisor.

Keep in mind that Biospark is optimized for data-intensive processing on large clusters. Therefore,you shouldn’t expect to see cluster-level performance on single machine running Biospark in aVM. To take advantage of the framework to efficiently process large data sets requires access toa large cluster with Biospark installed. Other Tutorials and Protocols describe how to configureand use Biospark in such an environment. However, the Biospark VM does contain a completeinstallation and using it provides the same experience (albeit slower) as using Biospark on a largecluster.

1.1 Downloading Oracle VirtualBox and the Biospark virtual machine

1. Go to the Oracle VirtualBox download page: https://www.virtualbox.org/wiki/Downloads.

2. Download the VirtualBox installer for your platform.

3. Download the VirtualBox Extension Pack, the same file for all platforms.

4. Go to the Biospark download page: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark.

1

https://www.virtualbox.org/wiki/Downloads



5. Download an Ubuntu Biospark virtual machine image.

1.2 Installing Oracle VirtualBox and Biospark on Windows

1. Run the Oracle VirtualBox installer following all prompts.

2. Once the installer has completed, launch VirtualBox with Administrator privileges by rightclicking on the icon on the desktop and then choosing Run as administrator.

3. After VirtualBox starts choose File→Preferences and then choose the Extensions panel.

2

4. Click on the Add Package button.

5. Select the extension package you downloaded earlier and press the OK button. The exten-sion will load.

6. Press OK save the preferences and then click the Close button to close the administrativeexecution of VirtualBox.

3

7. Start VirtualBox again with normal privileges by double clicking on the desktop icon. Oncethe program has started, choose File→Import Appliance....

8. Select the Biospark VM that you downloaded earlier and press OK. The Biospark VM willappear in the list.

9. Press the Settings button to open the settings for the Biospark VM.

4

10. Select the System panel and the Motherboard tab. Set the Base Memory option to aslarge a value as your system can support.

11. Change to the Processor tab and set the Processor(s) option to as large a value as yoursystem can support.

5

12. Press OK to save your settings and return to the VM list. Click on the Biospark VM and pressthe Start button to begin your Biospark session.

1.3 Installing Oracle VirtualBox and Biospark on Mac OS X

1. Run the Oracle VirtualBox installer following all prompts.

2. Once the installer has completed, launch the VirtualBox application.

3. After VirtualBox starts choose VirtualBox→Preferences... from the menu.

6

4. Select the Extensions panel and then click on the Add Package button.

5. Select the extension package you downloaded earlier and press the OK button. You mayneed to enter an administrative password. The extension will then load.

6. Choose File→Import Appliance... from the menu.

7

7. Select the Biospark VM that you downloaded earlier and press OK. The Biospark VM willappear in the list.

8. Press the Settings button to open the settings for the Biospark VM.

9. Select the System panel and the Motherboard tab. Set the Base Memory option to aslarge a value as your system can support.

8

10. Change to the Processor tab and set the Processor(s) option to as large a value as yoursystem can support.

11. Press OK to save your settings and return to the VM list. Click on the Biospark VM and pressthe Start button to begin your Biospark session.

1.4 Running the Biospark VM

Once the Biospark VM launches, you should see an Ubuntu desktop. If you need to login, theusername if biospark and the password is biospark. You are now ready to begin the tutorial.

9

Chapter 2

Tutorial Introduction

2.1 Overview of the Biospark software

Data-intensive statistical analysis, colloquially known as "Big Data," has seen increasing use inbioinformatics in recent years. Other areas of biology generating large numerical datasets fromsimulations or high-throughput experiments, however, have not yet adopted these new technolo-gies. We have developed a new framework, called Biospark, for storing and analyzing binarynumerical data using an open source Hadoop distributed file system and the Spark engine forlarge-scale parallel data processing using Python [1].

In this tutorial you will only be executing scripts that are included with the Biospark framework,but the real advantages of Biospark comes from writing your own scripts. Other Tutorials andProtocols are designed to assist in the process of writing your own scripts.

2.2 Prerequisites for running the tutorial

This tutorial makes extensive use of the Unix command line for program execution. Some basicknowledge of using a Unix-based OS is required to work through this tutorial. This tutorial alsouses Python in places for performing numerical analyses, but all of the necessary Python code isgiven and knowledge of Python is not required to work through the tutorial.

2.3 Understanding the tutorial syntax

Throughout this tutorial commands that you need to execute on your Biospark instance are givenusing the following syntax:

user@host:directory$ command arg1 arg2 argN

Here, directory is the working directory your shell should be in when running the command,command is the command to execute, and arg1, arg1, argN are the command line arguments to use.For example, the line:

user@host:microscopy$ hdfs dfs -ls /user/biospark/bsgs

Means you should be in a directory called microscopy and you should execute the command hdfswith arguments dfs -ls /user/biospark/bsgs. Usually you can copy the command from this tuto-

10

rial and then paste it into the terminal without problems. Note: The paste command for an Ubuntuterminal is CTRL-SHIFT-V. If you are reading the tutorial on your host computer and executing thecommands on a VM in VirtualBox you can enable copy-and-paste support between host and guestVMs under Setting→General→Advanced, if it is not already enabled.Some commands are too long to fit on a single line and are shown split across multiple lines:

user@host:microscopy$ biospark-submit sp_align_frames.py \/user/biospark/bsgs/frames.sfile index.txt reference.png \/user/biospark/bsgs/frames-aligned.sfile /user/biospark/bsgs/alignment.txt

All parts of a split command should still be executed as a single command in the Unix shell. Thebackslash character (\) is a line continuation character in Bash, so the multi-line commands canalso be copied and pasted directly into the terminal.

2.4 Tutorial files

In order to work through this tutorial you need to download the accompanying package of files foruse during the tutorial, named bsgs_tut_files.tgz, from the Tutorials webpage. Once you havedownloaded the package you must extract it. Throughout this tutorial, the directory into which youhave extracted the tutorial files package will be referred to as $FILES_ROOT. If you set an environ-ment variable with this name pointing to the directory in each new terminal session you launch,copy and paste of commands using the reference will work correctly.

For example, the following commands can be executed after you download the tutorial files pack-age to accomplish the above steps.

1. Start a new terminal by clicking on the Terminal application.

2. Change to the Downloads directory.

user@host:∼$ cd Downloads

3. Extract the tutorial files package.

user@host:Downloads$ tar zxvf bsgs_tut_files.tgz

11

4. Make a Tutorials directory underneath the biospark user account’s home directory.

user@host:Downloads$ mkdir -p $HOME/Tutorials

5. Move the tutorial files into the Tutorials directory.

user@host:Downloads$ mv bsgs_tut_files $HOME/Tutorials

6. Export an environment variable $FILES_ROOT pointing to the tutorial files.

user@host:Downloads$ export FILES_ROOT=$HOME/Tutorials/bsgs_tut_files

2.5 Using Biospark

2.5.1 Starting and stopping the Biospark servers

Biospark uses Apache Hadoop as its underlying platform for distributed file storage and resourcemanagement. The Hadoop framework requires several servers to be running. The Biospark VMprovides scripts to start and stop these servers, which run under the hadoop user account. Exe-cute the following command to start the Hadoop servers. The biospark account has sudo accessand remember that the password for the account is biospark.

user@host:∼$ sudo biospark-start

You will need to run this command whenever the Biospark VM has been restarted. If you need tostop the Hadoop servers, execute the following command.

user@host:∼$ sudo biospark-stop

To see if the servers are running execute the following command:

user@host:∼$ sudo -u hadoop jps

You should see the DataNode and NameNode appear, if not run biospark-stop and then runbiospark-start as before.

2.5.2 Accessing the HDFS file system

Biospark uses Apache Hadoop for distributed file storage. The file system is called the Hadoopdistributed file system (HDFS). You can interact with HDFS using the hdfs dfs command, followedby one of the supported file subcommands. For example, to list a directory, execute the -ls sub-command.

user@host:∼$ hdfs dfs -ls /

You can execute hdfs dfs by itself to get a list of valid subcommands.

12

user@host:∼$ hdfs dfs

HDFS is organized in a similar manner to a traditional Unix-based file systems. The root is/, user directories are typically located are /user/username, and a temporary directory is at/tmp. Some utilities that use both local and HDFS files differentiate HDFS locations by usingthe hdfs://hostname syntax, for example hdfs://hostname/user/biospark would refer to theBiospark user’s directory on the hostname server running HDFS. For the Biospark VM, hostnameshould always be localhost.

The first time you run this tutorial on a Biospark VM, you should execute the following commandto make a bsgs directory under the biospark user directory in HDFS. This directory is used laterin the tutorial and assumed to exist.

user@host:∼$ hdfs dfs -mkdir /user/biospark/bsgs

2.5.3 Submitting Biospark jobs

Biospark uses the Spark engine for performing parallel data processing. Biospark jobs are sub-mitted using the biospark-submit command, which wraps Spark’s underlying spark-submit com-mand, automatically adding the options necessary for Spark to find the files needed to use theBiospark framework. The basic syntax for executing the biospark-submit command is as follows.

user@host:∼$ biospark-submit script_name script_arguments+

For example, the following command would submit a job to execute the script sp_sfile_list.py,which lists the contents of the given SFile located in HDFS.

user@host:∼$ biospark-submit sp_sfile_list.py /user/biospark/test.sfile

The default behavior is to submit the job for execution on a single CPU core and executor process.You can use the -c option to increase the number of cores Biospark use on each executor and the-n option to increase the number of executors. For example, the following command would usefour cores on each of six executors, for a total of 24 simultaneous processing tasks.

user@host:∼$ biospark-submit -c 4 -n 6 sp_sfile_list.py /user/biospark/test.sfile

In general, cores must reside on the same Hadoop node while executors are distributed acrossthe cluster. If you are working through this tutorial using the Biospark VM and have made threeor more cores available to the VM, you can add a -c option to every biospark-submit commandspecifying the value as the number of cores you have made available minus one. It is best toreserve at least one core for the Spark and Hadoop processes.

Additionally, you can specify how much memory each executor should be allocated using the -eoption. The default is 1g, but you may want to increase this is you have sufficient memory available.

user@host:∼$ biospark-submit -e 2g sp_sfile_list.py /user/biospark/test.sfile

13

2.5.4 Introduction to SFiles

An SFile is a container file used by Biospark for storing binary numerical data in HDFS. An SFileis composed of a series of records, each with a name, a data type, and a binary data object. Formore details on the SFile file format, see [1]. Storing data in SFiles is the key to efficient processingusing Biospark. Throughout this tutorial you will learn how to work with SFiles using the utilitiesand scripts in Biospark framework.

2.6 Executing a Jupyter/IPython notebook

This tutorial makes use of Jupyter (formerly called IPython) notebooks for data analysis. In orderto execute the notebooks, you must first start a Jupyter server in the root directory of the tutorial.

1. Open a new terminal session.

2. Export an environment variable $FILES_ROOT pointing to the tutorial files.

user@host:∼$ export FILES_ROOT=$HOME/Tutorials/bsgs_tut_files

3. Change into the tutorial root directory in the tutorial package.

user@host:∼$ cd $FILES_ROOT

4. Start the notebook server.

user@host:$FILES_ROOT $ jupyter notebook

The Jupyter server opens a web browser to the address http://localhost:8888/tree for you tointeract with the tutorial notebooks. The Jupyter server will continue to run in the terminal window.Minimize the terminal window and start a new terminal instance or open another terminal tab inthe existing window to continue working on the tutorial.

Some familiarity with using Jupyter notebooks would be helpful in going through this tutorial, but isnot required. The primary feature to know is that a notebook is composed of a series of cells. Torun the calculation in a cell, click on it and then press SHIFT-Enter on the keyboard.

2.7 Changes if not using the Biospark virtual machine

If you are not using the Biospark VM to work through this tutorial, but rather a custom installationof Biospark, the following changes will need to be made.

• The user name in this tutorial is assumed to be biospark. Substitute your user name in allpaths.

• The HDFS server is assumed to be localhost. Substitute localhost in all hdfs://localhostURLs with the name of your Hadoop server.

• Make sure to increase the number of cores and executors the executor memory size appro-priately for your environment.

14

http://localhost:8888/tree

Chapter 3

Analyzing Time-Lapse Microscopy Images of Yeast Cell Growth

3.1 Overview

3.2 Uploading the microscopy images into HDFS

The first step in analyzing the images is to upload them into HDFS as an SFile. To do so, you usethe sfile utility.

1. Change into the microscopy directory in the tutorial package.

user@host:∼$ cd $FILES_ROOT/microscopy

2. The microscopy directory contains a directory named frames that contains 101 frames froma time-lapse microscopy series of Saccharomyces cerevisiae cells growing. Upload thesefiles into an SFile in HDFS named /user/biospark/bsgs/frames.sfile using the followingcommand.

user@host:microscopy$ sfile -cp -t mime:image/png frames/frame-*.png \hdfs://localhost/user/biospark/bsgs/frames.sfile

The -cp option specifies that we want to copy a series of source files into individual recordsin a target SFile. The -t mime:image/png option specifies that the type of the records in theSFile should be mime:image/png.

3. Verify that the SFile was created.


4. Check the contents of the SFile using the sfile utility.

user@host:microscopy$ sfile -ls hdfs://localhost/user/biospark/bsgs/frames.sfile

You should see each frame listed as an individual record in the SFile.

/frames/frame-000000.png 1561446 mime:image/png/frames/frame-000010.png 1561797 mime:image/png/frames/frame-000020.png 1570617 mime:image/png/frames/frame-000030.png 1575350 mime:image/png/frames/frame-000040.png 1576537 mime:image/png

15

5. You can see further usage information about the sfile utility by executing the followingcommand.

user@host:microscopy$ sfile –help

3.3 Aligning and normalizing the microscopy images

After the images have been uploaded into HDFS, the next step is to align the images to a referenceimage. The images also need to be brightness normalized to correct for intensity variations duringthe time-series, which could otherwise negatively impact segmentation accuracy.

1. Copy the first frame to use as the reference image.

user@host:microscopy$ cp frames/frame-000000.png reference.png

2. Then execute the Biospark script sp_align_frames.py that performs the alignment and nor-malization.

user@host:microscopy$ biospark-submit sp_align_frames.py \/user/biospark/bsgs/frames.sfile index.txt reference.png \/user/biospark/bsgs/frames-aligned.sfile /user/biospark/bsgs/alignment.txt

The arguments are i) the input SFile in HDFS, ii) the local index file, iii) the local referenceimage, iv) the name of the output SFile to be created in HDFS containing the aligned andnormalized images, and v) the name of the text file to be created in HDFS containing thealignment and normalization parameters for all of the images.

On a single machine, this process will take a few minutes. If it were running on a cluster, allimages would be processed in parallel and time would scale down (nearly) linearly with thenumber of nodes in the cluster.

3. When the job finishes, you should see a message like “Processed 97 images, saved imagesinto hdfs at: /user/biospark/bsgs/frames-aligned.sfile” in the output. Four of the frames werefluorescence images and skipped. Check that the output was created.


You should see some output like:

Found 3 itemsdrwxr-xr-x - biospark 0 /user/biospark/bsgs/alignment.txtdrwxr-xr-x - biospark 0 /user/biospark/bsgs/frames-aligned.sfile-rw-r--r-- 3 biospark 158124724 /user/biospark/bsgs/frames.sfile

4. Now we would like to download the alignment.txt output for analysis. Important: Hadoopoutputs are directories, not individual files. The output directory contains the individualparts from the tasks that were working on the job. To see this, list the contents of one of theoutput directories.

user@host:microscopy$ hdfs dfs -ls /user/biospark/bsgs/alignment.txt

16


Found 3 items-rw-r--r-- 1 biospark 0 /user/biospark/bsgs/alignment.txt/_SUCCESS-rw-r--r-- 1 biospark 2799 /user/biospark/bsgs/alignment.txt/part-00000-rw-r--r-- 1 biospark 477 /user/biospark/bsgs/alignment.txt/part-00001

5. Since the output is not a single file, use the hdfs dfs -cat command to download the filesone at a time and redirect the output to a local alignment.txt file.

user@host:microscopy$ hdfs dfs -cat /user/biospark/bsgs/alignment.txt/* > alignment.txt

6. Open the Jupyter notebook browser tab. Click on the Jupyter logo in the upper left cornerto go back to the root of the tutorial. Click on the microscopy directory and then click on theanalysis.ipynb notebook. The notebook will open in a new browser tab. Execute the topcell by clicking on it and then pressing SHIFT-Enter on the keyboard. Execute the first twocells in Aligning and normalizing the microscopy images section in the same manner toplot the X offset for each frame. You should see something like the plot shown below.

7. Execute the next cell to show the Y offset.

17

8. Execute the next two cells to show the image intensity mean and standard deviation that wasused to normalize the intensity of each frame.

9. The frames-aligned.sfile directory contains the aligned and normalized images in a seriesof SFiles. List the contents of the output directory.

user@host:microscopy$ hdfs dfs -ls /user/biospark/bsgs/frames-aligned.sfile


Found 3 items-rw-r--r-- 1 0 /user/biospark/bsgs/frames-aligned.sfile/_SUCCESS-rw-r--r-- 1 148160163 /user/biospark/bsgs/frames-aligned.sfile/part-r-00000.sfile-rw-r--r-- 1 24810731 /user/biospark/bsgs/frames-aligned.sfile/part-r-00001.sfile

10. To view the contents of each SFile, we could list the records of each part individually usingthe sfile utility. But, since Hadoop can automatically deal with directories with files in multi-ple parts, we can instead submit a Biospark job to list the contents of the entire directory atonce in parallel.

user@host:microscopy$ biospark-submit sp_sfile_list.py \/user/biospark/bsgs/frames-aligned.sfile


18

frame-aligned-0 protobuf:robertslab.pbuf.NDArray 1796685frame-aligned-10 protobuf:robertslab.pbuf.NDArray 1796972frame-aligned-20 protobuf:robertslab.pbuf.NDArray 1807557frame-aligned-40 protobuf:robertslab.pbuf.NDArray 1813138

Each line shows the name of a record, its type, and the size of its data object. You can seethat the aligned and normalized images have been stored as serialized NDArray objects.These objects represent multidimensional arrays similar to the NumPy ndarray class, towhich they can be efficiently deserialized.

11. In the next section, we will see how to use the aligned and normalized images.

3.3.1 Running the analysis again

If you want to re-run the analysis for any reason you must first delete the outputs of the previous job,since by default, Hadoop will not overwrite an existing output file. Reminder: Hadoop outputs aredirectories. You can delete the outputs of this section from HDFS using the following commands.

user@host:microscopy$ hdfs dfs -rm -r -f /user/biospark/bsgs/alignment.txtuser@host:microscopy$ hdfs dfs -rm -r -f /user/biospark/bsgs/frames-aligned.sfile

19

3.4 Segmenting cells in the images

After the images have been aligned and normalized we can perform the segmentation.

1. First you must define the region of the image to search for cells. Since the cells in the upperright corner of the image grow off the boundary (see below), we want to exclude them fromthe segmentation. Also, we want to exclude any cells that might grow into the frame duringthe time course, so we choose a region around the two central cell clusters. The whitespot toward the bottom of the frame is a polystyrene bead used for autofocus and would beautomatically excluded from segmentation. The image below shows a rectangle starting atx=600, y=200 and ending at x=1250, y=1150, which would be a reasonable choice for animage mask in this case.

2. Define which frames you wish to segment. Depending on the speed of your machine, itmay take 30–60 seconds per frame. For purposes of the tutorial, we will segment the firstten frames, which corresponds to a frame range of 0–100, since the tutorial dataset onlyincludes every tenth frame.

3. Execute the Biospark script sp_yeast_segment.py to perform the segmentation.

user@host:microscopy$ biospark-submit sp_yeast_segment.py \/user/biospark/bsgs/frames-aligned.sfile index.txt reference.png \/user/biospark/bsgs/cells.sfile --image-mask=600,200,1250,1150 \--frame-range=0-100 --contour-points=20 --tolerance=1e-1

20

The arguments are i) the aligned SFile in HDFS, ii) the local index file, iii) the local referenceimage, iv) the name of the output SFile to be created in HDFS containing the segmentationdata, v) the region of the image to use for segmentation, vi) the range of frames to process,vii) the number of points around the circumference of the cell to use in the contour, and viii)the tolerance for the numerical minimizer. Here, the number of contour points is set to be 20and the tolerance to be 1e-1 for speed purposes. Settings of 50 and 1e-2, respectively, givebetter segmentation for real data, but take longer to calculate.

Once again, on a single machine, this analysis will take a few minutes. If it were performed ona cluster, segmentation of individual cells would be done in parallel resulting in a substantialperformance increase.

4. When the job finishes, execute a sp_sfile_list.py job to list the contents of the output file.

user@host:microscopy$ biospark-submit sp_sfile_list.py /user/biospark/bsgs/cells.sfile


/0/62e39d076cb047eba25978f34fb49600 protobuf:...pbuf.imaging.yeast.CellContour 414/0/390acea974014a798757742087dde845 protobuf:...pbuf.imaging.yeast.CellContour 414/0/7fc006ef42ae4a16825e45fd3db785b7 protobuf:...pbuf.imaging.yeast.CellContour 414/10/62e39d076cb047eba25978f34fb49600 protobuf:...pbuf.imaging.yeast.CellContour 414

The first part of the record name indicates the frame number and the second part indicatesthe cell’s globally unique identifier. The cell contour data are stored in a serialized CellCon-tour object.

5. To view the segmentation results, you can execute a job to generate a series of imagesshowing the cell contours drawn on the aligned and normalized images. First, however, wemust make a directory for the frames to be stored.

user@host:microscopy$ mkdir cell_framesuser@host:microscopy$ biospark-submit sp_yeast_draw_frames.py \

/user/biospark/bsgs/frames-aligned.sfile index.txt \/user/biospark/bsgs/cells.sfile cell_frames --prefix=frame --scale=0.5

The arguments are i) the aligned SFile in HDFS, ii) the local index file, iii) the SFile in HDFScontaining the contours, iv) the name of the directory to store the images, v) a prefix to usefor each image filename, and vi) a scaling factor for the images, where 0.5 means scale theimage down by one-half.

Once the job finishes, the directory should contain one image for each frame you segmented,and the cell contours will be drawn in red with the first three characters in the cell’s id writtento the lower right of the cell.

21

6. If you have the avconv program installed (it is installed by default on the Biospark VM) youcan easily make an MPEG-4 movie from the frames. First, the frames must be renumberedso that they are sequential. The following two commands create a new directory for the movieimages and then make soft links to the original images numbered correctly for avconv.

user@host:microscopy$ mkdir -p movie_imagesuser@host:microscopy$ image-renumber cell_frames frame movie_images

7. Then run avconv to make a movie from the frames.

user@host:microscopy$ avconv -r 1 -i movie_images/%06d.png -r 1 -b 4000k cells.mp4

The two -r options set the input and output frame rate to 1 frame per second, respectively.It is set low here, since there are so few frames. For a large data set it should be increased.The -b option sets the video bitrate to be 4 mb/s. This value can be adjusted to trade qualityfor file size.

8. Click on the File Browser and then find the movie file and double click on it to play it.

22

3.5 Analyzing the growth data

In this section you will analyze the growth rate of the cells from the segmentation data. Once aparallel Biospark calculation has completed, the SFiles outputs can be downloaded and processedlocally in a serial manner to facilitate analysis.

1. Download the individual parts of the output. As with the text files you downloaded earlier,SFiles can be freely concatenated together. Use the hdfs dfs -cat command to downloadthe files.

user@host:microscopy$ hdfs dfs -cat /user/biospark/bsgs/cells.sfile/* > cells.sfile

2. Examine the contents of the local file.

user@host:microscopy$ sfile -ls cells.sfile

3. Open the Jupyter notebook you were using earlier in this Chapter. Scroll down to the Ana-lyzing the growth data section and execute the first three cells to plot the area of each cellas a function of time. You should see something like the plot shown below.

If you segmented fewer frames, your plot may not include as many data points. The fullsegmentation data set is available in the $FILES_ROOT/microscopy/solutions/ directory ascells.sfile, if you would rather analyze a larger data set. The full growth movie is alsolocated in that directory.

23

Chapter 4

Calculating Probability Distributions from Monte Carlo Simulations

4.1 Overview

Lattice Microbes ES (LMES) is software for performing stochastic simulation of both well-stirredand spatially resolved biochemical reaction networks, see https://www.assembla.com/spaces/roberts-lab-public/wiki/LMES. LMES simulations are used to probabilistically study the be-havior of biochemical reaction-only and reaction-diffusion networks. Typically, many independenttrajectories of the system are computed and then analyzed to calculate a desired statistic. Theoutput data sets are often quite large and, if the trajectories are independent, the analysis canbe done in parallel. These properties make such an analysis well-suited for the Biospark frame-work. In this chapter you will learn how to run scripts to calculate probability distributions and otherstatistics from LMES simulations.

The examples in this chapter use the following reversible bimolecular reaction:

A+Bk1−⇀↽−k2C.

Initial conditions are A = 500, B = 500, C = 0 molecules and the rate constants are k1 =2.408×104M−1s−1 and k2 = 0.01 s−1. The system volume is 1×10−15 L, which in the reaction-diffusion models is a cube with 1µm edges. Under these conditions the equilibrium values of themolecular species are A = 250, B = 250, C = 250. The diffusion coefficient of all the molecularspecies is set to D = 1×10−13m2ss−1, which is ∼1000x slower than a typical protein in solution forillustrative purposes. We simulate the well-stirred reaction-only system using the chemical masterequation (CME) and the reaction-diffusion system using the reaction-diffusion master equation(RDME), see [2] for numerical details.

4.2 Uploading the simulation data into HDFS

LMES can natively generate SFile output files using the -ff sfile -fo filename.sfile option.These files can be uploaded into HDFS using the hdfs dfs -put command.

1. Change into the lmes directory in the tutorial package.

user@host:∼$ cd $FILES_ROOT/lmes

2. The lmes directory contains a file named bimolecular_cme.sfile that contains 50,000,000data points uniformly sampled from 1000 independent well-stirred simulations. Upload this

24

https://www.assembla.com/spaces/roberts-lab-public/wiki/LMES

https://www.assembla.com/spaces/roberts-lab-public/wiki/LMES

files into an HDFS file named /user/biospark/bsgs/bimolecular_cme.sfile using the fol-lowing command.

user@host:lmes$ hdfs dfs -put bimolecular_cme.sfile \/user/biospark/bsgs/bimolecular_cme.sfile

3. The lmes directory also contains a file named bimolecular_rdme.sfile that contains thetime series for 200 reaction-diffusion simulations. Upload this files into an HDFS file named/user/biospark/bsgs/bimolecular_rdme.sfile using the following command.

user@host:lmes$ hdfs dfs -put bimolecular_rdme.sfile \/user/biospark/bsgs/bimolecular_rdme.sfile

4. Verify that the SFiles were uploaded.

user@host:lmes$ hdfs dfs -ls /user/biospark/bsgs

4.3 Extracting the time series of a specific trajectory

Once an LMES SFile output file is stored in HDFS, you can extract individual trajectories from theoutput to observe their dynamics. The trajectories are stored in an SFile in a compressed format,so extracting every trajectory may consume a large amount of local disk space and is not usuallypractical. Rather, you should extract a few trajectories to spot-check the behavior of your system.

1. Make a directory for storing the output data. The trajectories will be stored as binary numer-ical data in an HDF5 container.

user@host:lmes$ mkdir -p data

2. Execute the Biospark script sp_extract_lm_time_series.py, which extracts the trajectoriesand saves them to local storage.

user@host:lmes$ biospark-submit sp_extract_lm_time_series.py data/time_series.h5 \/user/biospark/bsgs/bimolecular_cme.sfile 1 10

The arguments are i) the filename of the local output file to create, ii) the input SFile in HDFS,iii+iv) the minimum and maximum trajectory numbers, which defines the range to extract.

3. When the job finishes, you should see a message like “Saved 20 data sets todata/time_series.h5” a few lines up in the output. Check that the file was created.

user@host:lmes$ ls -la data/

drwxr-xr-x 3 biospark biospark 102 .drwxr-xr-x 8 biospark biospark 272 ..-rw-r--r-- 1 biospark biospark 3611885 time_series.h5

4. Open the Jupyter notebook browser tab. Click on the Jupyter logo in the upper left cornerto go back to the root of the tutorial. Click on the lmes directory and then click on theanalysis.ipynb notebook. The notebook will open in a new browser tab. Execute the topcell by clicking on it and then pressing SHIFT-Enter on the keyboard. Execute the first two

25

cells in Extracting the time series of a specific trajectory section in the same manner toplot the time series for trajectory 1.

5. Execute the next cell to show a zoomed-in plot of all ten trajectories.

6. Finally, execute the last cell in the section to close the file.

Note: if you want to run the analysis again, you must first delete the output filedata/time_series.h5. In general, Biospark scripts will not overwrite existing files, to guardagainst accidental deletion of data.

4.4 Calculating the stationary probability distribution

Now you will calculate the stationary probability density function (PDF) of the system by analyzingall of the trajectories. The stationary PDF gives the probability for the system to be in a particularstate at equilibrium, when the probabilities are not changing in time.

1. Execute the Biospark script sp_calc_lm_stationary_pdf.py. The script calculates the prob-ability to have a given copy number of a species across every time-point in every trajectory.If the system is not at equilibrium initially, you can specify a minimum time for a data point tobe included in the analysis.

user@host:lmes$ biospark-submit sp_calc_lm_stationary_pdf.py data/stationary_pdf.h5 \/user/biospark/bsgs/bimolecular_cme.sfile 0 1 2 --skip-less-than=1e3

26

The arguments are i) the filename of the local output file to create, ii) the input SFile in HDFS,iii–v) the indices of all of the species for which to calculate the PDF, vi) the minimum timecutoff for including data in the analysis. Here, you will exclude the first 1000 s of data, whichis ∼10× the 100 s relaxation time of this system.

2. When the job finishes, you should see a message like “Saved pdfs from 45000750 recordsto data/stationary_pdf.h5” a few lines up in the output.

3. Open the Jupyter notebook you were using earlier in this Chapter. Scroll down to the Calcu-lating the stationary probability distribution section and execute the first two cells to plotthe stationary PDF of each species. You should see something like the following plot.

Notice that the plots lacks density near A=500, B=500, C=0, which is the starting conditionfor the simulations. This is because you excluded the first 1000 s of data from the analysis.

4. Execute the final cell in the section to close the file.

4.5 Calculating the time-dependent probability distribution

Next you will calculate the time-dependent PDF, which is the probability to be in a given stateat a particular time. The underlying PDF is estimated by binning the data across time bins of aparticular size. Here, you will use a time bin width of 10 s, which is significantly less than thetimescale of the approach to equilibrium. Therefore, you should see the time-dependent behaviorof the system as it relaxes from its non-equilibrium initial state.

1. Execute the Biospark script sp_calc_lm_time_pdf.py. The script calculates the probabilityto have a given number of a species in each of the specified time bins.

user@host:lmes$ biospark-submit sp_calc_lm_time_pdf.py data/time_pdf.h5 \/user/biospark/bsgs/bimolecular_cme.sfile 10.0 0 1 2

The arguments are i) the filename of the local output file to create, ii) the input SFile in HDFS,iii) the width of the time bins to use in the analysis, iv–vi) the indices of all of the speciesfor which to calculate the PDF. On a single machine, this process will take a few minutes.Remember, if it were running on a Hadoop cluster, all records would be processed in paralleland time would scale down (nearly) linearly with the number of nodes in the cluster.

2. When the job finishes, you should see a message like “Saved time-dependent pdfs with 1001time bins from 50000 records per bin to data/time_pdf.h5” a few lines up in the output.

27

3. Open the Jupyter notebook you were using earlier in this Chapter. Scroll down to the Calcu-lating the time-dependent probability distribution section and execute the first two cellsto plot the PDFs of each species at the first two time points. You should see something likethe following plot.

You can see that the PDFs at the first two time points are quite different from each other,since the system is not yet at equilibrium. Also, notice the times for the first two data pointsare 5 s and 15 s, which are the centers of the first two 10 s bins.

4. Execute the next cell in the Jupyter notebook to plot a heatmap of the time-dependent PDFfor the first 200 s.

You can see that the spread of the PDF decreases as the system approaches equilibrium.Notice the discrete bands due to the 10 s time bins. You can increases the resolution bydecreasing the time bin width, but the analysis will take longer. Eventually, it will be limitedby the size of the underlying data set.

5. Execute the next cell in the Jupyter notebook to plot the mean and variance of each speciesas a function of time. These are calculated from the time-dependent PDF.

28


4.6 Viewing an RDME trajectory

In this section, you will extract and view a trajectory from a reaction-diffusion simulation. Theoutput SFile stores the number of molecules of each species in every subvolume for every frameof the simulation. In the RDME reaction-diffusion model of the bimolecular reaction system thevolume is subdivided into a 10×10×10 cube, with each subvolume having a length of 100 nm.

1. Execute the Biospark script sp_extract_lm_rdme_time_series.py, which extracts the tra-jectories and saves them to and HDF5 data file on local storage.

user@host:lmes$ biospark-submit sp_extract_lm_rdme_time_series.py \data/rdme_time_series.h5 /user/biospark/bsgs/bimolecular_rdme.sfile 1 10

The arguments are i) the filename of the local output file to create, ii) the input SFile in HDFS,iii+iv) the minimum and maximum trajectory numbers, which defines the range to extract.

2. When the job finishes, you should see a message like “Saved 10020 data sets todata/rdme_time_series.h5” a few lines up in the output.

3. Open the Jupyter notebook you were using earlier in this Chapter. Scroll down to the Viewingan RDME trajectory section and execute the first two cells to play an animation showing theposition of every C molecule in the system at 1 s intervals. You should see something likethe following plot.

You can see that as time increases more C molecules are present in the simulation volume.To stop the animation, select the Kernel→Interrupt menu option in Jupyter.


29

4.7 Calculating the spatially resolved, time-dependent probability distribution

Finally, you will calculate the time-dependent PDF from the RDME simulations. The output of theanalysis is the probability for each subvolume to have a particular number of molecules of eachspecies at a given time. As was the case above, the time-dependent PDF is estimated by binningthe data in time. You will again use time bins of width 10 s, which will allow you to monitor thesystem as it relaxes to equilibrium.

1. Execute the Biospark script sp_calc_lm_rdme_time_pdf.py to calculate the time-dependentPDF for each subvolume using the spatial information in the simulation file.

user@host:lmes$ biospark-submit sp_calc_lm_rdme_time_pdf.py data/rdme_time_pdf.h5 \/user/biospark/bsgs/bimolecular_rdme.sfile 10.0 0 1 2 --not-sparse

The arguments are i) the filename of the local output file to create, ii) the input SFile inHDFS, iii) the width of the time bins to use in the analysis, iv–vi) the indices of all of thespecies for which to calculate the PDF, vii) the --not-sparse flag is used to tell the scriptthat the lattice is not sparsely populated. If you did have a sparse lattice, you would leavethis flag off to obtain better performance. On a single machine, this process will likely take10–15 minutes. If you would rather not wait for the analysis, there is a copy of the outputdata in the solutions directory.

2. When the job finishes, you should see a message like “Saved time-dependent pdfs with 101time bins to data/rdme_time_pdf.h5” a few lines up in the output.

3. Open the Jupyter notebook you were using earlier. Scroll down to the Calculating thespatially resolved, time-dependent probability distribution section and execute the firsttwo cells to plot a heatmap of the time-dependent PDF for each species in subvolume (0,0,0).You should see something like the following plot.

You can see that the probability for the subvolume to contain at least one molecule of A orB starts high and then decreases while the probability for the subvolume to contain at leastone molecule of C starts low and then increases. It is very rare for a subvolume to containmore than four molecules of a given species.

4. Execute the next cell to plot the mean number of C molecules in each subvolume. Thecolumns in the subplot are the different z-planes of the system volume and the subplot rowsare different time points. You should see something like the following plot. Blue correspondsto 0 molecules and red to 0.3 molecules.

30

You can see that the mean number of C molecules per subvolume increase with time, butthere is some variability from subvolume to subvolume due to finite sampling. The distributionacross the subvolume is otherwise uniform as expected.

5. Execute the next cell to plot the relative variance σ2/µ, also known as the Fano factor (F),of the number of molecules in each subvolume. You should see something like the followingplot. Here, blue corresponds to F = 0.7 and red to F = 1.3.

A process generating Poisson statistics will have a Fano factor of one. You can see from theplots that the reaction-diffusion system is indeed a Poisson process.


31

Chapter 5

Extracting Structural Dynamics from Molecular DynamicsSimulations

5.1 Overview

Molecular dynamics (MD) is a simulation technique for studying a physical system at the atomiclevel. In an MD simulation, the equations of motion for all of the atoms are solved in a time-dependent manner, subject to a force field that accounts for interatomic interactions. In biologicalphysics, MD simulations are often used to study the dynamics of macromolecules such as proteinsand RNAs.

A common task when analyzing an MD trajectory is to calculate a structural statistic for everyframe. MD simulations can generate very large data sets and, if the analysis can be done inparallel, significant speedups can be achieved using the Biospark framework. In this chapter youwill learn how to run scripts to calculate how a protein’s structure fluctuates over the course of atrajectory. The examples in this chapter use a 400 ns simulation of a Yeast hexokinase (PDB code1IG8) sampled every 2.5 ns.

5.2 Uploading the simulation data into HDFS

To analyze a trajectory using Biospark, you must first upload the data into HDFS in an SFilecontainer that the scripts can read. Several different MD file formats are in common usage. TheMDTraj Python library (http://mdtraj.org/) is capable of reading many different file formats andis used here to parse the MD files into multidimensional arrays, which can be stored in HDFS. Youcan use the md_to_sfile utility to read an MD output file and upload the trajectory into an SFile inHDFS.

1. Change into the md directory in the tutorial package.

user@host:∼$ cd $FILES_ROOT/md

2. Examine the contents of the directory.

user@host:md$ ls -la


-rw-r--r-- 1 151520276 1IG8_simulation.dcd-rw-r--r-- 1 6233965 1IG8_structure.pdb

32

http://mdtraj.org/

-rw-r--r-- 1 8754075 1IG8_topology.psf

The directory contains a PDB structure file containing the initial atomic coordinates of theprotein, ions, and water being simulated. The PSF file containing the connectivity of theatoms for use by the force field. A DCD file storing the atomic coordinates of the atoms assampled from the MD trajectory.

3. Convert the DCD file to an SFile saving the output directly into HDFS.

user@host:md$ md_to_sfile -o hdfs://local/user/biospark/bsgs/1IG8_simulation.sfile \1IG8_simulation.dcd 1IG8_structure.pdb

The arguments are i) the output SFile to be created, ii) the local trajectory file in a for-mat readable by MDTraj, iii) the local structure file in a format readable by MDTraj. Thehdfs://local syntax tells the tool to create the SFile directly on the default local HDFSwithout first creating an intermediate file. You could alternatively save the SFile to a localdirectory and then upload it to HDFS, but doing so would require double the local disk spaceand may be impractical for very large trajectories.

4. Verify that the SFile was uploaded.

user@host:md$ hdfs dfs -ls /user/biospark/bsgs


Found 1 items-rw-r--r-- 3 biospark 151521330 /user/biospark/bsgs/1IG8_simulation.sfile

5. Check the contents of the SFile.

user@host:md$ sfile -ls hdfs://localhost/user/biospark/bsgs/1IG8_simulation.sfile


/Frames/0 946934 protobuf:robertslab.pbuf.NDArray/Frames/1 946934 protobuf:robertslab.pbuf.NDArray/Frames/2 946934 protobuf:robertslab.pbuf.NDArray/Frames/3 946934 protobuf:robertslab.pbuf.NDArray

Each frame is stored in a serialized NDArray object of two dimensions, where each row is anatom and the three columns are the atom’s x, y, and z positions.

5.3 Calculating the root-mean-square deviation

The root-mean-square deviation (RMSD) is a common structural metric for comparing two con-formations of the same protein structure. Here, you will execute a script to align each of the MDframes to the initial crystal structure and then calculate the RMSD with respect to this referencestructure.

33

1. Make a directory to store the output data. The data will be stored as binary numerical datain a HDF5 container.

user@host:md$ mkdir -p data

2. Execute the Biospark script sp_calc_md_rmsd.py, which performs the alignment and RMSDcalculation.

user@host:md$ biospark-submit sp_calc_md_rmsd.py data/rmsd_backbone.h5 \/user/biospark/bsgs/1IG8_simulation.sfile 1IG8_structure.pdb \--selection="protein and backbone"

The arguments are i) the filename of the local output file to create, ii) the input SFile in HDFS,iii) the local reference structure, iv) the atoms to use in the calculation. Here, you will alignand calculate the RMSD using only the protein backbone atoms.

3. When the job finishes, you should see a message like “Saved 160 RMSD values to data/rmsd_backbone.h5”a few lines up in the output. Check that the file was created.

user@host:md$ ls -la data/

drwxr-xr-x 3 biospark biospark 102 .drwxr-xr-x 8 biospark biospark 272 ..-rw-r--r-- 1 biospark biospark 4278 rmsd_backbone.h5

4. You can view the contents of the HDF5 file using the h5dump program.

user@host:md$ h5dump data/rmsd_backbone.h5


HDF5 "data/rmsd_backbone.h5" {GROUP "/" {DATASET "RMSD" {DATATYPE H5T_IEEE_F64LEDATASPACE SIMPLE { ( 160 ) / ( 160 ) }DATA {(0): 0.924585, 1.35522, 1.17194, 1.63237, 1.22204, 1.06669, 1.05949,

5. Open the Jupyter notebook browser tab. Click on the Jupyter logo in the upper left cornerto go back to the root of the tutorial. Click on the md directory and then click on the anal-ysis.ipynb notebook. The notebook will open in a new browser tab. Execute the top cellby clicking on it and then pressing SHIFT-Enter on the keyboard. Execute the first two cellsin Calculating the root-mean-square deviation section in the same manner to plot RMSDvs. time.

34

6. Now, execute the sp_calc_md_rmsd.py script again to perform the calculation over all of theprotein atoms.

user@host:md$ biospark-submit sp_calc_md_rmsd.py data/rmsd_protein.h5 \/user/biospark/bsgs/1IG8_simulation.sfile 1IG8_structure.pdb \--selection="protein"

7. When the job finishes, you should see a message like “Saved 160 RMSD values to data/rmsd_protein.h5”a few lines up in the output. Check that the file was created.


drwxr-xr-x 3 biospark biospark 102 .drwxr-xr-x 8 biospark biospark 272 ..-rw-r--r-- 1 biospark biospark 4278 rmsd_backbone.h5-rw-r--r-- 1 biospark biospark 4242 rmsd_protein.h5

8. Open the Jupyter notebook again. Execute the next two cells to plot RMSD of the full proteinvs. time.

Notice that there is a good correlation between the backbone-only and all-atom RMSDs.

35

9. Execute the next cell to plot a comparison between the backbone-only and all-atom RMSDs.

10. Execute the last cell to close the files.

5.4 Clustering by pairwise RMSD

Next you will align each pair of frames and calculate all of the pairwise RMSD values. Using thepairwise distance matrix, you will find the dominant conformational clusters and which conforma-tional state the protein is in at each frame of the trajectory. This type of analysis can inform aboutmajor structural rearrangements during the trajectory.

1. Execute the Biospark script sp_calc_md_rmsd_pairwise.py, which performs the pairwisealignment and RMSD calculations.

user@host:md$ biospark-submit sp_calc_md_rmsd_pairwise.py data/rmsd_pairwise.h5 \/user/biospark/bsgs/1IG8_simulation.sfile 1IG8_structure.pdb \--selection="protein and backbone"

The arguments are i) the filename of the local output file to create, ii) the input SFile in HDFS,iii) the local reference structure, iv) the atoms to use in the calculation.

2. When the job finishes, you should see a message like “Saved RMSD values from 160 win-dows to data/rmsd_pairwise.h5” a few lines up in the output. Check that the file was created.


-rw-r--r-- 1 biospark biospark 107204 rmsd_backbone.h5

36

3. Open the Jupyter notebook you were using earlier in this Chapter. Scroll down to the Clus-tering by pairwise RMSD section and execute the first two cells to plot the pairwise RMSDvalues as a heatmap. You should see something like the following plot.

Notice that the first and last half of the trajectory seem to be more self-correlated.

4. Execute the next two cells to calculate a hierarchical clustering of the pairwise RMSD dis-tances and to plot the number of cluster vs. the cutoff distance. You should see somethinglike the following plot.

As the cutoff distance increases, fewer unique clusters are identified. You will use the cutoffvalue that gives 10 clusters to explore the data a bit further.

5. Execute the next cell to show a dendrogram of the hierarchical clustering. The branchesare colored according to the cutoff distance selected to give 10 clusters. You should seesomething like the following plot.

37

Notice that three predominant clusters can be identified in the dendrogram.

6. Execute the next cell to show which cluster the trajectory is in during the time-course of thesimulation. You should see something like the following plot.

The trajectory starts of in a single cluster and stays there for ∼200 ns. Then the proteinconformation jumps to a different cluster and stays there for ∼150 ns. For the last ∼50 ns thetrajectory moves to a third cluster. These are the same three dominant clusters identified inthe dendrogram.

7. Finally, execute the last cell to close the file.

38

Chapter 6

Getting Started with Biospark Scripting

6.1 Overview

This final chapter serves as a short introduction to writing your own scripts, which is the real powerof Biospark. You will learn how to connect a Jupyter notebook to the Hadoop service and executea few simple PySpark scripting commands.

6.2 Launching a PySpark session in Jupyter Notebook

1. Open the Jupyter notebook browser tab. Click on the Jupyter logo in the upper left cornerto go back to the root of the tutorial. Click on the New button on the right hand side of thewindow and then select the PySpark notebook type. A new PySpark notebook will open ina browser tab.

2. It will take a few seconds for the PySpark kernel to connect with the Hadoop service. Youwill see a message like that shown below in the top right corner of the notebook. When themessage disappears, you are connected and ready to begin working.

39

3. You can view the normal Spark output in the terminal session window that is running theJupyter notebook server.

4. Enter command below into the first cell and then execute it by pressing SHIFT-Enter on thekeyboard. This will make sure you have an active Hadoop session and print the applicationID.

print sc.applicationId

6.3 Listing an SFile using PySpark scripting commands

1. Execute the following command in a new cell to create a new RDD for the SFile in HDFScontaining the microscopy images you uploaded earlier.

records = sc.newAPIHadoopFile("/user/biospark/bsgs/frames.sfile","robertslab.hadoop.io.SFileInputFormat","robertslab.hadoop.io.SFileHeader","robertslab.hadoop.io.SFileRecord",keyConverter="robertslab.spark.sfile.SFileHeaderToPythonConverter",valueConverter="robertslab.spark.sfile.SFileRecordToPythonConverter")

2. Paste the following code into a new cell to define a map function for listing the records. Makesure the code is properly formatted in the cell, as pasting will occasionally change spacing,which is important in Python. Execute the cell.

# Define a new function to list the records in the RDD. The record argument is# a tuple like (key,value).def listRecords(record):

# The key is a tuple containing the name and type of the record.(name,datatype)=record[0]

# The value contains the record data.data=record[1]

# Return a new (key,value) tuple where the key contains the name, type,# and data length, and the value is empty.return ((name,datatype,len(data)),None)

3. Paste the following code into a new cell and execute it to perform the Spark operation. It maytake a few seconds to run, the string “Finished” will be printed when the operation completes.

40

# Perform a Spark operation on the RDD: 1) list the records, 2) select just the keys,# 3) return the keys as a list.listing = records.map(listRecords).keys().collect()print "Finished"

4. Finally, execute the following code in a new cell to print the listing. You should see a list ofthe records in the SFile.

for line in listing:print "%s %s %d"%(line)

5. Since the notebook is connected to Hadoop, it is using resources as long as the kernel isrunning. Once you are finished with the notebook, close the browser tab and then go backto the main Jupyter tab. Click the Running tab and then press the Shutdown button nextto your PySpark notebook. This will release any resources associated with the Hadoopsession.

6. An example PySpark notebook like the one you just created is available in the$FILES_ROOT/scripting/ directory as pyspark_test.ipynb if you encounter any problems.

You now know enough to get started using Biospark for your own applications. Have fun!

41

List of Abbreviations

CME – chemical master equationHDFS – Hadoop distributed file systemLMES – Lattice Microbes ESMD – molecular dynamicsPDF – probability density functionRMSD – root-mean-square deviationRDME – reaction-diffusion master equationVM – virtual machineS. cerevisiae – Saccharomyces cerevisiae

42

Bibliography

[1] Klein, M, Sharma, R, Bohrer, C, Avelis, C, Roberts, E (2016) Biospark: Scalable analysisof large numerical data sets from biological simulations and experiments using Hadoop andSpark. Submitted.

[2] Roberts, E, Stone, JE, Luthey-Schulten, Z (2013) Lattice Microbes: High-performancestochastic simulation method for the reaction-diffusion master equation. J. Comput. Chem.34:245–55.

43

Documents

Biospark Getting Started Guide - robertslabjhu.infoThe “Biospark Getting Started Guide” describes how to download and execute the Biospark virtual machine image and how to get