2 BD5: an open HDF5-based data format to represent 3 … · 30 Recent advances in bioimage informatics and mechanobiological simulation techniques ... 43 the SSBD:database (Systems

1

Research Article 1

BD5: an open HDF5-based data format to represent 2

quantitative biological dynamics data 3

Koji Kyoda1,2, Kenneth H. L. Ho1,2, Yukako Tohsato1,2,3, Hiroya Itoga1 and Shuichi 4

Onami1,2,* 5

1Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics 6

Research, Kobe 650-0047, Japan. 7

2Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, Kobe 8

650-0047, Japan. 9

3Department of Information Science and Engineering, Ritsumeikan University, Shiga 10

525-8577, Japan. 11

*To whom correspondence should be addressed 12

13

Short title: BD5 data format for representing quantitative biological dynamics data 14

15

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 28, 2020. . https://doi.org/10.1101/2020.04.26.062976doi: bioRxiv preprint







https://doi.org/10.1101/2020.04.26.062976

http://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1101/2020.04.26.062976


https://doi.org/10.1101/2020.04.26.062976


https://doi.org/10.1101/2020.04.26.062976


https://doi.org/10.1101/2020.04.26.062976


https://doi.org/10.1101/2020.04.26.062976


https://doi.org/10.1101/2020.04.26.062976


2

Abstract 16

BD5 is a new binary data format based on HDF5 (hierarchical data format version 5). It 17

can be used for representing quantitative biological dynamics data obtained from 18

bioimage informatics techniques and mechanobiological simulations. Biological 19

Dynamics Markup Language (BDML) is an XML(Extensible Markup Language)-based 20

open format that is also used to represent such data; however, it becomes difficult to 21

access quantitative data in BDML files when the file size is large because parsing XML-22

based files requires large computational resources to first read the whole file 23

sequentially into computer memory. BD5 enables fast random (i.e., direct) access to 24

quantitative data on disk without parsing the entire file. Therefore, it allows practical 25

reuse of data for understanding biological mechanisms underlying the dynamics. 26

27

28


https://doi.org/10.1101/2020.04.26.062976


3

Introduction 29

Recent advances in bioimage informatics and mechanobiological simulation techniques 30

have led to the production of a large amount of quantitative data of spatiotemporal 31

dynamics of biological objects ranging from molecules to organisms [1]. A wide variety 32

of such data can be described in an open unified data format Biological Dynamics 33

Markup Language (BDML), an Extensible Markup Language (XML)-based format [2]. 34

BDML enables efficient development and evaluation of software tools for a wide range 35

of applications. 36

The XML-based BDML format has the advantages of machine/human readability, 37

and extensibility. However, it is often problematic for accessing and retrieving data 38

when the size of the BDML file becomes too large (e.g., our programs cannot load a 39

BDML file over 20 GB on a standard workstation). This problem arises because parsing 40

an XML-based file often requires large computational resources to first read the whole 41

file sequentially into computer memory. In fact, many sets of quantitative data stored in 42

the SSBD:database (Systems Science of Biological Dynamics database) [1] were 43

divided into a series of BDML files for each time point to allow software to read them 44

efficiently. One of the solutions to the above problem is to use another approach such as 45

the eXtensible Data Model and Format [3] or FieldML [4]. In these formats, the data 46

itself is described in HDF5 binary format and meta-information about the data is 47

described in XML format. HDF5 is a hierarchical data format for storing large scientific 48

data sets (http://www.hdfgroup.org/HDF5/). It is widely used for describing various 49

kinds of large-scale biological data [4-9]. 50


https://doi.org/10.1101/2020.04.26.062976


4

Here, we describe the development of BD5 data format, based on HDF5, for 51

representing quantitative biological dynamics data in a manner that enables quick access 52

and retrieval. 53

Materials and Methods 54

Design and implementation 55

Here, we extended BDML to support HDF5-based storage of quantitative biological 56

dynamics data. In contrast to XML documents, HDF5 format can allow random (i.e., 57

direct) access to parts of the file without parsing the entire contents. Therefore, HDF5 is 58

a more efficient file format for accessing and retrieving the contents of the file. 59

We developed the BD5 data format based on HDF5 for representing quantitative 60

data. A BD5 file is organized into two primary structures, datasets and groups. Datasets 61

are array-like objects that store numerical data, whereas groups are hierarchical 62

containers that store datasets and other groups. Detailed information on BD5 is 63

available at http://ssbd.qbic.riken.jp/bdml/. Here, we summarize the BD5 major datasets 64

and groups. BD5 format has one container named data (Fig. 1). It includes 65

● scaleUnit dataset for the definition of spatial and time scales and units, 66

● objectDef dataset for the definition of biological objects, 67

● featureDef dataset for features of interest, 68

● numbered groups (0, 1, … , n) corresponding to an index number of a 69

time-ordered sequence, 70

● trackInfo dataset for the information of tracking of one object to another. 71

Each of the numbered groups corresponds to an index of a time-ordered sequence 72

that has object and feature groups. For a fixed time interval, the index will correspond 73


https://doi.org/10.1101/2020.04.26.062976


5

to each sequential time point. For example, if the time interval is 2 minute, group 0 will 74

have t = 0 and group 1 will have t = 1 while tScale is 2 and tUnit is minute (Fig. 1). For 75

irregular time intervals, the index allows a time-ordered sequence to be saved and be 76

read in the correct order. If the first time is 0 minutes, the second time is 2 minutes, 77

while the third time is 7 minutes, then group 0 will have t = 0, group 1 will have t = 2 78

and group 2 will have t = 7. The tUnit is still minute, but the tScale in this case will be 79

1. 80

Each object group has numbered dataset(s) corresponding to the reference number 81

of the biological object(s) predefined under the objectDef dataset. Each row of the 82

numbered object includes an identifier of the object and its spatiotemporal information 83

such as time point and xyz-coordinates (Fig. 2). To represent biological objects such as 84

line and face entities that have an arbitrary number of xyz-coordinates in BD5 format, a 85

tabular dataset is used (Fig. 3). The multiple xyz-coordinates are represented by using a 86

sequential ID (sID) that allows us to connect the xyz-coordinates together to form a line 87

or a face within a biological object. 88

Each feature group has numbered dataset(s) corresponding to the reference 89

number of the object(s) predefined in the objectDef dataset. Each row of the numbered 90

object includes an identifier of the object, an identifier of the feature (fID) predefined in 91

featureDef, and the value of the feature (Fig. 4). This format allows objects that do not 92

possess all the features defined in featureDef to be recorded, because not all the features 93

can necessarily be measured in practical biological experiments. For example, in the 94

experiment in Fig. 4, an object may have information for fID = 1 (name: center-of-mass 95

GFP signal) but not fID = 0 (name: average GFP signal). 96


https://doi.org/10.1101/2020.04.26.062976


6

The trackInfo dataset enables information of the objects to be linked between 97

different time points or time frames (Fig. 2). For example, when a cell at t = 0 divides 98

into two daughter cells at t = 1, it has links from the parent cell to the daughter cells. 99

The trackInfo dataset can be used to represent not only phenomena such as cell division 100

but also those such as cell fusion. 101

To allow the use of BD5 to describe quantitative data, we needed to update the 102

BDML format so that it could be used to describe the corresponding meta-information. 103

The latest version of BDML (version 3.0) can handle an external file by using the 104

extFile element (Fig. 5). The bd5File element that we introduced within the 105

extFile element can be used to point to an external BD5 file. In addition, this update 106

allows the designation of multiple contact persons and the use of a unique persistent 107

digital identifier, ORCID (https://orcid.org) in its format. 108

Results 109

Validation 110

To evaluate the performance of the BD5 format, we first compared time for accessing 111

the file between XML- and HDF5-based files (i.e., between pairs of BDML and BD5 112

files containing equivalent data). We measured the time for accessing coordinate data at 113

a randomly selected time point in the BDML and BD5 files (334 pairs of files) by using 114

a Python-based program (Fig. 6a). The results indicate that the access times of HDF5-115

based files were consistently faster than those of the corresponding XML-based files. 116

Therefore, BD5, the new HDF5-based format, enables practical access to quantitative 117

data for further analysis. 118


https://doi.org/10.1101/2020.04.26.062976


7

File size can be a critical benchmark for a data format because the transfer of large 119

files often fails. Therefore, we next compared disk space requirement between the 120

XML- and HDF5-based files by comparing the size of BDML and BD5 files (450 pairs 121

of files) (Fig. 6b). BD5 format reduced the file size by ~85% compared with the BDML 122

format when the data is large. When the data is small (< 300MB), the size of BD5 file is 123

close to, but still less than, that of the corresponding BDML file. Because the size of 124

HDF5-based files for large data is much less than that of the equivalent XML-based 125

files, the BD5 format enables, in theory, fast transfer of large quantitative data to and 126

from computers on the network and on the internet. 127

In addition, we determined the relationships between access time and file size for 128

BDML and BD5 files (Fig. 7). In BD5, we found fast access to the coordinate data even 129

when the file size was large. This fast data access in BD5 originated from its random 130

access to data. In BDML, the access time linearly increased with file size. This result 131

suggests that parsing of XML was the main bottleneck of data access. Quantitative 132

biological dynamics data tends to be large due to the advances in live-cell imaging 133

techniques and imaging equipment. We anticipate that BD5 will play a key role in fast 134

access to such large data sets. 135

Software tools and usage related to BD5 136

So that BD5-based tools can be used for data stored in older BDML files, we provide a 137

C++-based software tool named BDML2BD5. By using this tool, BDML files can be 138

converted into BD5 files. To compile the tool, the HDF5 library is required for HDF5 139

data writing, and CodeSynthesis XSD (http://www.codesynthesis.com/products/xsd/) is 140

required for the BDML schema to C++ data binding compiler. All source codes and the 141


https://doi.org/10.1101/2020.04.26.062976


8

executable file of BDML2BD5 are available at 142

https://github.com/openssbd/BDML2BD5/. 143

We also provide a program bd5lint for detecting bugs and inconsistencies in BD5 144

files. The program checks the structure of BD5 files, and checks that the ordered 145

numbered datasets in object and feature groups correspond to the reference numbers of 146

the objects and features predefined in the objectDef and featureDef dataset. It also 147

checks the consistency of the dimensions declared and the actual dimensions used 148

within the datasets. It provides type checking of the data and error warnings if the data 149

do not conform to the BD5 specification. The Python source code is available at 150

https://github.com/openssbd/bd5lint/. 151

We also provide several Python-based programs for data analysis using BD5 files. 152

These programs are available as Jupyter Notebook files at 153

https://github.com/openssbd/BDML-BD5/. An example is a program that counts the 154

number of biological objects in each numbered group of the time-ordered sequence. By 155

using this program, we can obtain the proliferation curve of Caenorhabditis elegans 156

embryogenesis. The program can be modified to obtain similar information for other 157

organisms such as Danio rerio and Drosophila melanogaster. 158

Discussion 159

In this study, we developed a new BD5 data format based on HDF5 for representing 160

quantitative biological dynamics data. Compared with BDML, which is based on XML, 161

the BD5 format has two advantages: (a) fast access and retrieval to quantitative data 162

because of random access to the HDF5-based file, and (b) fast transfer of files 163

containing large quantitative data because the file size is dramatically reduced. A 164


https://doi.org/10.1101/2020.04.26.062976


9

drawback of the BD5 is that human readability is low when compared with BDML 165

format. BD5 files cannot be opened by text editors because the file is binary formatted. 166

However, the HDF group provides a software tool named HDFView that enables the 167

user to open and read all HDF5-based files 168

(https://www.hdfgroup.org/downloads/hdfview/). This tool can compensate for the lack 169

of human readability. 170

BD5 format has already been used in the latest version of SSBD:database 171

(http://ssbd.qbic.riken.jp), which is one of the major databases for sharing bioimage data 172

and quantitative biological dynamics data [10]. Over 687 files, which include a wide 173

variety of quantitative biological dynamics data from molecules to cells to organisms, 174

are available. This demonstrates that the BD5 format has high functionality and 175

flexibility for representing quantitative biological dynamics data. SSBD:database also 176

provides a RESTful API (i.e., an API (application programming interface) that allows 177

applications to access data and interact with external software tools) through the use of 178

the webservice h5serv (https://github.com/HDFGroup/h5serv). This enables 179

SSBD:database to provide a web service for users to access quantitative data stored in 180

BD5 files (http://ssbd.qbic.riken.jp/restfulapi/). Because HDF5 and XML are supported 181

by many software platforms, BD5 is a promising data format for storing quantitative 182

biological dynamics data. 183

Like BDML, BD5 can represent quantitative biological dynamics data that is 184

associated with, but independent of, microscopy images. Such data has often been 185

represented as regions of interest (ROIs) on the corresponding microscopy images; for 186

example, the ROIs in the OME data model (https://docs.openmicroscopy.org/ome-187


https://doi.org/10.1101/2020.04.26.062976


10

model/) and segmentation channels in Cell Feature Explorer (https://cfe.allencell.org). 188

However, not all data can be represented as an ROI on a microscopy image. For 189

example, in an automated cell lineage tracing study of Caenorhabditis elegans, each 190

nucleus was represented as a sphere with center and radius, independently of the z-stack 191

images [11]. Such flexible representation of BD5 (and also BDML) enables us to 192

represent quantitative biological dynamics data obtained not only from bioimage 193

informatics but also from mechanobiological simulation techniques. 194

Funding 195

This work was supported in part by the National Bioscience Database Center (NBDC) 196

of the Japan Science and Technology Agency (JST); Core Research for Evolutionary 197

Science and Technology (CREST) Grant Number JPMJCR1511, JST; JSPS KAKENHI 198

Grant Number JP18H05412; the Strategic Programs for R&D (President’s Discretionary 199

Fund) of RIKEN, Japan; and Open Life Science Platform, RIKEN, Japan. 200

Acknowledgements 201

We are grateful to the members of the Onami laboratory, RIKEN Center for Biosystems 202

Dynamics Research, Japan for feedback and discussions. 203

204

REFERENCES 205

1. Tohsato Y, Ho KH, Kyoda K, Onami S. SSBD: a database of quantitative data of 206

spatiotemporal dynamics of biological phenomena. Bioinformatics. 2016;32:3471-207

2479. 208


https://doi.org/10.1101/2020.04.26.062976


11

2. Kyoda K, Tohsato Y, Ho KH, Onami S. Biological Dynamics Markup Language 209

(BDML): an open format for representing quantitative biological dynamics data. 210

Bioinformatics. 2015;31:1044-1052. 211

3. Clarke JA, Mark ER. Enhancements to the eXtensible Data Model and Format 212

(XDMF). Proceedings of the 2007 DoD High Performance Computing 213

Modernization Program Users Group Conference; 2007 Jun; Washington DC, 214

USA. IEEE Computer Society. pp. 322–327. 215

4. Britten RD, Christie GR, Little C, Miller AK, Bradley C, Wu A, et al. FieldML, a 216

proposed open standard fort the Physiome project for mathematical model 217

representation. Med Biol Eng Comput. 2013;51:1191-1207. 218

5. Baker M. Quantitative data: learning to share. Nature Meth. 2012;9:39-41. 219

6. Dougherty MT, Folk MJ, Zadok E, Bernstein HJ, Bernstein FC, Eliceiri KW, et al. 220

Unifying biological image formats with HDF5. Commun ACM. 2009;52:42-47. 221

7. Hoffman MM, Buske OJ, Noble WS. The Genomedata format for storing large-222

scale functional genomics data. Bioinformatics. 2010;26:1458-1459. 223

8. Millard BL, Niepel M, Menden MP, Muhlich JL, Sorger PK. Adaptive informatics 224

for multifactorial and high-content biological data. Nature Meth. 2011;8:487-492. 225

9. Wilhelm M, Kirchner M, Steen JA, Steen H. mz5: space- and time-efficient storage 226

of mass spectrometry data sets. Mol Cell Proteomics. 2012;11:O111.011379. 227

10. Dance A. Find a home for every imaging data set. Nature. 2020;579:162-163. 228


https://doi.org/10.1101/2020.04.26.062976


12

11. Bao Z, Murray JI, Boyle T, Ooi SL, Sandel MJ, Waterston RH. Automated cell 229

lineage tracing in Caenorhabditis elegans. Proc Natl Acad Sci USA. 230

2006;103:2707-2712. 231

232


https://doi.org/10.1101/2020.04.26.062976


13

Figure 1. Outline of the BD5 data format. The data group includes scaleUnit, 233

objectDef, featureDef, and trackInfo datasets; each data group is numbered to 234

correspond to the index number of the time-ordered sequence. Each numbered group 235

has spatial information about biological objects and numerical information about 236

features related to the objects. Solid and dashed boxes represent the required and 237

optional elements, respectively. 238

Figure 2. An example of the description of the spatiotemporal information of biological 239

objects and their tracking information. The dataset name in object group corresponds to 240

the identifier (ID) of the biological object (red). Each row in the dataset must have a 241

unique ID and its spatiotemporal information. A label can optionally be attached for 242

each object. The tracking information including object divisions and fusions can be 243

stored in trackInfo dataset. 244

Figure 3. An example of the description of the spatiotemporal information based on line 245

(a) and face entities (b). The sequential identifier (sID) represents a set of coordinates 246

that can be connected beginning at the top to describe an entity within one biological 247

object. 248

Figure 4. An example of the description of the feature information related to biological 249

objects. The example object is a nucleus expressing green fluorescent protein (GFP) at t 250

= 0 in a time series. The dataset name in feature group corresponds to the identifier (ID) 251

of the biological object. Each row in the dataset has object ID, feature fID (blue), and 252

the feature value. In this example, fID is 0 or 1 depending on whether the data is total or 253

average GFP signal, respectively. Object ID is 0 if the object is a nucleus. Feature value 254

is the fluorescence intensity expressed in a.u. (arbitrary units). 255


https://doi.org/10.1101/2020.04.26.062976


14

Figure 5. A skeleton of a BDML version 3.0 file for describing meta-information. This 256

version allows the use of an external file for describing the data itself, designation of 257

multiple contact persons, and the use of ORCID, a unique persistent digital identifier of 258

the research scientist. 259

Figure 6. Comparison between BD5 and BDML data formats. a) Access times of the 260

BDML and BD5 files. Access time was measured as the time for accessing and 261

displaying xyz-coordinate data at a randomly selected time point stored in the BDML 262

and BD5 files. The time was measured on an Intel Xeon CPU 2.8 GHz processor with 263

32 GB of main memory. Each dot represents a biological quantitative data set. We used 264

334 biological quantitative data sets, each of which has coordinate data and is stored in 265

SSBD:database as a single BDML file. BD5 files were generated from the BDML files 266

by using the BDML2BD5 program. b) Size of the BDML and BD5 files. Each dot 267

represents a biological quantitative data set. In this comparison, we used 450 biological 268

quantitative data sets stored in SSBD:database as BDML files. As above, the BD5 files 269

were generated from the BDML files by using the BDML2BD5 program. The dashed 270

line represents the linear regression line for all dots. The data within the small rectangle 271

near the origin in the large graph is plotted on expanded axes in the insert. 272

Figure 7. Relationship between access time and file size for BDML and BD5 files. The 273

time for accessing and displaying coordinate data at a randomly selected time point is 274

plotted against file size. Each cross represents a BDML file; each dot represents a BD5 275

file. In the comparison, we used BDML and BD5 files of the 334 biological quantitative 276

data sets described in Fig. 6a. 277


https://doi.org/10.1101/2020.04.26.062976


data

objectDef

scaleUnit

featureDef

0

1

2

trackInfo

…

object

feature

…

…

0

time-orderedsequence

0

oID name

0 nucleus

objectDef

featureDef

dimension xScale yScale zScale sUnit tScale tUnit

3D+T 0.09 0.09 1.0 micrometer 2.0 minute

scaleUnit

Figure 1: Kyoda et al.

fID name fUnit

0 average GFP signal a.u.

1 center of mass GFP signal a.u.

data

0

1

trackInfo

object 0

object 0

2 …

…

ID t entity x y z radius label

000002 0 sphere 380 366 16.1 3.6 A

000003 0 sphere 387 153 16.6 3.87 B


001002 1 sphere 380 366 16.1 3.6 A

001003 1 sphere 387 153 19.6 3.87 C

001004 1 sphere 386 151 13.1 3.87 D

from to

000002 001002

000003 001003

000003 001004

trackInfo

oID name

0 nucleus

objectDef

t = 0 t = 1A A

B C

D


000002

000003001002

001004

001003


ID t entity sID x y z label

000001 0 line 0 2 1 0 E

000001 0 line 0 2 10 0 E

000001 0 line 0 9 13 0 E

000001 0 line 0 8 4 0 E

000001 0 line 0 2 1 0 E

E

a

b ID t entity sID x y z label

000100 0 face 0 1 1 1 F

000100 0 face 0 2 3 0 F

000100 0 face 0 3 1 2 F

000100 0 face 1 2 3 0 F

000100 0 face 1 4 5 4 F

000100 0 face 1 3 1 2 F

… … … … … … … …

F

(2, 1, 0)

(2, 10, 0)

(9, 13, 0)

(8, 4, 1)

(1, 1, 1)

(2, 3, 0)(3, 1, 2)

(4, 5, 4)


000002 0 sphere 380 366 16.1 3.6 A

0

0

ID fID value

000002 0 153

000002 1 214

object

feature

data

featureDef

0

1

…

object

feature

…

0

0

fID name fUnit

0 average GFP signal a.u.

1 center of mass GFP signal a.u.

featureDef

t = 0A

GFP signal


<bdml version="3.0" xmlns=http://ssbd.qbic.riken.jp/bdmlxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ssbd.qbic.riken.jp/bdmlhttp://ssbd.qbic.riken.jp/bdml/bdml3.0.xsd">

<info>...

</info><summary>

...</summary><contact>

<person><first-name>Shuichi</first-name><last-name>Onami</last-name><ORCID>0000-0002-8255-1724</ORCID><affiliation>

...</affiliation>

</person></contact><methods>

...</methods><extfile>

<bd5File>wt_N2_030116_02_bd5.h5</bd5File></extfile></bdml>


http://ssbd.qbic.riken.jp/bdml

a b

0

1E+10

2E+10

3E+10

0 1E+10 2E+10 3E+10

BDML file size (bytes)

BD5

file

size

(byt

es)

Access time for BDML file (s)

Acce

ss ti

me

for B

D5

file

(s)

0

20

40

60

80

100

120

0 20 40 60 80 100 120


0

20

40

60

80

100

120

0.E+00 5.E+07 1.E+08 2.E+08 2.E+08 3.E+08


File size (bytes)

Acce

ss ti

me

(s)

BDML filesBD5 files

Documents

2 BD5: an open HDF5-based data format to represent 3 … · 30 Recent advances in bioimage informatics and mechanobiological simulation techniques ... 43 the SSBD:database (Systems