Upload
erik
View
216
Download
2
Embed Size (px)
Citation preview
An Efficient and Extensible Format, Library, and API forBinary Trajectory Data from Molecular Simulations
Magnus Lundborg,[a] Rossen Apostolov,[b] Daniel Spangberg,[c] Anders G€arden€as,[d]
David van der Spoel,[d] and Erik Lindahl*[a,e]
Molecular dynamics simulations is an important application in
theoretical chemistry, and with the large high-performance com-
puting resources available today the programs also generate
huge amounts of output data. In particular in life sciences, with
complex biomolecules such as proteins, simulation projects reg-
ularly deal with several terabytes of data. Apart from the need
for more cost-efficient storage, it is increasingly important to be
able to archive data, secure the integrity against disk or file
transfer errors, to provide rapid access, and facilitate exchange of
data through open interfaces. There is already a whole range of
different formats used, but few if any of them (including our pre-
vious ones) fulfill all these goals. To address these shortcomings,
we present “Trajectory Next Generation” (TNG)—a flexible but
highly optimized and efficient file format designed with intero-
perability in mind. TNG both provides state-of-the-art multiframe
compression as well as a container framework that will make it
possible to extend it with new compression algorithms without
modifications in programs using it. TNG will be the new file for-
mat in the next major release of the GROMACS package, but it
has been implemented as a separate library and API with liberal
licensing to enable wide adoption both in academic and com-
mercial codes. VC 2013 Wiley Periodicals, Inc.
DOI: 10.1002/jcc.23495
Introduction
Computer simulations constitute a major and powerful tool for
investigating the atomistic behavior of molecular systems, and
the rapid growth of computational power means many appli-
cations are generating more output data than ever. Both mas-
sively parallel single simulations on supercomputers and
distributed computing projects relying on ensemble modeling
can easily generate tens of terabytes of data for a single pro-
ject. In the past few decades, dozens of software packages
have been developed that implement methods such as molec-
ular dynamics (MD) or Monte Carlo for molecular simulations.
Because both these applications rely on inherently stochastic
processes to generate sufficient sampling of a complex system,
this data expansion is a natural consequence of advances in
the field, and if anything the growth rate is increasing. In
many cases, the transfer, storage, analysis, archival, and post
simulation manipulation of data has become just as challeng-
ing as the simulation itself. Both for our own molecular simula-
tion code and in the community, there is a shortage of
extensible file formats that both provide the highest compres-
sion possible, quick random access, all necessary information
contained in a single file that is easily exchanged, as well as
strong integrity checks and the ability to, for example, validate
data with modern digital signatures. There are many universal
data exchange formats that are highly flexible (and even pref-
erable in some cases), but molecular simulation data also has
very special requirements and lossy compression possibilities
that makes it attractive with more specific formats. Develop-
ment and adoption of a common, well-designed standard for
data storage in particle simulations will thus bring great bene-
fits to all users and developers alike.
In this work, we present the specifications of a new strictly
specified format for storage of data obtained from molecular
simulations—Trajectory Next Generation (TNG). The standard
builds on a container-payload framework and is flexible, exten-
sible, optimized for parallel I/O, and multiframe compression,
and aims to address the shortcomings of existing formats such
as the XTC format previously used in GROningen MAchine for
Chemical Simulations (GROMACS). Both the TNG application
programming interface (API) and implementation is open
source and released under the revised BSD license, which in
[a] M. Lundborg, E. Lindahl
Department of Theoretical Physics and Swedish e-Science Research Center,
Royal Institute of Technology, Science for Life Laboratory, Box 1031, SE-171
21 Solna, Sweden
[b] R. Apostolov
PDC Center for High Performance Computing, Royal Institute of Technol-
ogy, Teknikringen 14, SE-100 44 Stockholm, Sweden and Science for Life
Laboratory, Box 1031, SE-171 21 Solna, Sweden
[c] D. Spangberg
Department of Chemistry—Angstr€om Laboratory, Uppsala Multidiscipli-
nary Center for Advanced Computational Methods (UPPMAX), Uppsala
University, Box 523, SE-751 20 Uppsala, Sweden
[d] A. G€arden€as, D. van der Spoel
Department of Cell and Molecular Biology, Uppsala Center for Computa-
tional Chemistry, Uppsala University, Box 596, SE-751 24 Uppsala, Sweden
[e] E. Lindahl
Department of Biochemistry and Biophysics, Center for Biomembrane
Research, Stockholm University, 106 91, Stockholm, Sweden
E-mail: [email protected]
Contract grant sponsor: ERC award 209825; Contract grant sponsor:
ScalaLife project; Contract grant number: EU contract INFSO-RI-261523;
Contract grant sponsor: Swedish e-Science Research Center; Contract
grant sponsor: Swedish research council; Contract grant number: 2010-
491, 2010-5107
VC 2013 Wiley Periodicals, Inc.
260 Journal of Computational Chemistry 2014, 35, 260–269 WWW.CHEMISTRYVIEWS.COM
SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG
particular makes it possible to use as a linked library in com-
mercial codes without any requirements on the license of
those codes.
MD Applications and Their File Formats
The MD method is a rather intuitive direct implementation of
Newton’s equation of motion and is applicable to a wide range
of molecular systems, which has led to the emergence of many
software applications that implement the method. Wikipedia
lists over 40[1] packages, although the real number of existing
codes is likely to be much larger—most scientists in the field
have written their own implementation at some point. Many of
these applications adopt their own formats for storage of trajec-
tory data; this is natural as the codes have had slightly different
goals, and historically storage was never a significant problem,
but in our opinion none of them (including GROMACS) fulfill all
the criteria we need from a modern simple day-to-day file for-
mat. Below we give an overview of some of the most popular
applications, the formats they use and the reason why (in our
opinion) each format was not sufficient for our usage case.
AMBER formats
AMBER[2,3] (Assisted Model Building with Energy Refinement) is
both a software package and a collection of force fields for
molecular mechanics modeling. It is one of the most widely
used programs and many research groups build additional
functionality on top of the core distribution.
Trajectory data from AMBER simulations are stored in
NetCDF[4,5] format (Network Common Data Form) developed by
Unidata.[6] In fact, the format simply specifies a set of conventions
that have to be used along with the NetCDF libraries. Those libra-
ries can represent arbitrary array-based data and have bindings
for many languages such as C/C11, Fortran (F77 and F90), Java,
Python. The NetCDF libraries are portable and the format extensi-
ble, and a huge advantage is that it makes it possible to read arbi-
trary subsets of data into many analysis programs that support
NetCDF. It also supports basic nonlossy compression of data
using Zlib. However, the NetCDF library is large; the source code
is in fact even larger than AMBER itself (see Table 1). A few mega-
bytes of source code is not a problem for storage or transfer
today, but with GROMACS’ requirements to run on nonstandard
platforms such as Playstation3 (natively, not Linux), Field-
Programmable Grid Arrays and future Application-Specific Inte-
grated Circuit chips, we might need to support every line of such
libraries ourselves, which is not realistic. The generality also makes
it difficult to use molecular simulation-specific lossy compression
without cancelling the advantages of NetCDF.
CHARMM formats
CHARMM[7] was one of the very first packages for molecular simu-
lation, and has very broad adoption with hundreds of modules
implemented. Trajectory data are stored in a DCD format—binary
FORTRAN files with atom coordinates (optionally velocities and
forces). The format does not use compression and does not sup-
port storage of additional data types. The files are not transport-
able between big-endian and little-endian computer
architectures, and require additional tools for conversion,[8] but
the DCD format, or slight variations of it, has been adopted by a
large number of codes, including, for example, the NAMD[9,10]
simulation package (focused on high-performance massively par-
allel simulation) and X-PLOR[11] (structure refinement). It is also
supported by a wide range of analysis and display programs.
Desmond formats
Desmond[12,13] is a software package developed to perform
high-speed MD simulations of biological systems on conven-
tional commodity clusters. Desmond stores trajectory data not in
a single file but in a collection of files each containing a number
of trajectory frames.[14] When the number of files is large, they
are organized in a directory structure. Metadata about the simu-
lation is stored in a separate metadata file. The topology of the
molecular system is saved separately. Various tools are used to
inspect and modify the data in the files as the internal binary
structure of the files is difficult for interpretation. This is a very
powerful choice for the very largest simulations when the data
are only handled in the program itself and are not realistic to
compress on-the-fly. However, for more common usage cases, it
is a bit of a hurdle that there is no single file that can be trans-
ferred, and it is not trivial to adopt the Desmond format in other
applications as there is no open library/API for it.
GROMACS formats
GROMACS[15,16] is an MD package mainly used for biomolecu-
lar simulations. It focuses on achieving high performance and
portability across hardware systems, and for full disclosure it
should be noted that it is developed by our team. GROMACS
uses two kinds of formats for storage of output data. TRR is a
full-precision portable-binary data format, while XTC[17] is a
lossy compression format. The latter is available as part of the
xdrfile library. The XTC format is portable and offers very good
compression of the data. In fact, to the best of our knowledge,
it is more efficient than any other available format—if an XTC
file is compressed with gzip the file size increases by a small
fraction. However, it has a number of drawbacks such as no
possibility to store arbitrary user- or meta-data, no indexing
for fast searches (which is complicated when the size of
frames varies with lossy compression), the topology of the
Table 1. Approximate sizes of code bases of three major MD programs
and some data or trajectory I/O libraries (note that NetCDF and HDF5 are
obviously much more general than TNG).
Software Version
Size
(MB)
AMBER 11 4
GROMACS 4.6.3 11
NAMD 2.8 8
NetCDF 5
HDF5 8
TNG 1.4 0.2
The size of AMBER includes only the compute engine sources; tests,
benchmarks etc. are excluded.
SOFTWARE NEWS AND UPDATESWWW.C-CHEM.ORG
Journal of Computational Chemistry 2014, 35, 260–269 261
simulated system has to be read from a separate file, and
both the TRR and XTC formats were developed in the 1990s
and are limited to 232 particles.
LAMMPS formats
LAMMPS[18,19] (Large-scale Atomic/Molecular Massively Parallel
Simulator) is a classical MD code. It is designed to be highly
flexible and adaptable for simulations not only of biomolecular
systems but also polymers, solid-state materials, and coarse-
grained/mesoscopic systems, and supports a large number of
very special potentials. Due to the more universal nature of
LAMMPS compared to other MD packages, output data are
stored in ASCII text files with a very flexible format that allows
detailed description of arbitrary kinds of data. The data are
not compressed although gzipped files can be processed.
From our point-of-view, the LAMMPS file format is thus
impractical for storage of huge amounts of trajectory data.
Requirements
Because none of the present formats solved the needs we had
for a future standard file format, we have developed a new
container-type format named Trajectory Next Generation
(hereafter called TNG) that fulfills the following requirements:
� Fully architecture-independent, regarding both endian-
ness and the ability to mix single/double precision trajec-
tories and I/O libraries.
� It must be self-sufficient, that is, it should not require any
other files for reading, and all the data should be con-
tained in a single file for easy transport.
� Small footprint and high portability of the library and
easy to bundle by third parties, or even compile built-in
as part of an MD package.
� Built-in support for storage of different data types, for
example, arbitrary vectors or floats.
� Custom data storage. The format should be extensible
(say if a user wants to store distance restraints statistics
or something else), and other versions of the library
should be able to skip blocks they cannot interpret.
� Support for future compression algorithms. Inspired by
current multimedia formats, we want a format container
that is easy to read with a standard API, but the payload
itself should be possible to alter under-the-hood as new
compression algorithms are implemented.
� To improve over XTC, we need temporal compression of
data similar to multimedia formats, that is, compressing
several frames as a block.
� Integrity check of data blocks using hashes. (Version 1
only supports MD5 sums in the block headers, but future
version will support digital signatures in each block).
� Digital signatures to make it possible to guarantee what
program and user generated a particular trajectory. This
is critical to ensure data integrity, for example, in distrib-
uted computing projects where users receive credits for
the amount of simulations they produce.
� Possibility to store extended meta-data with full informa-
tion about simulated systems and conditions.
� Random access, using file pointers to quickly locate
frames, even when the frames are of nonconstant size
due to high compression.
� Efficient parallel I/O.
There are many other features that could be implemented
in a trajectory, in particular a complete description of the
topology of the system, but our aim here is to create a great-
est common divisor that can be used in lots of programs with
very little work rather than aiming to describe every single
component of any simulation. To fulfill the above require-
ments, TNG has been developed along the following specifica-
tions as released in version 1 of the format; whereas a normal
developer should likely just use the public API and open
library, the specification is intended to make it possible to
reimplement support of the format from scratch.
Methods
Specifications
A TNG file is made up of a number of data blocks. Numerical
values for fields can be integers (64 bit), floats (32 bit), or dou-
bles (64 bit), and are stored as big or little endians (constant
throughout the file with automatic conversion if the hardware
architecture changes). Some flags are also stored as a single
byte (8 bits). When creating a TNG file, the endianness defaults
to the endianness of the hardware creating the file. This is a
change from XTC that always used network (big) endian. The
reason for this is that most current architectures are little
endian, and in benchmarking we realized that an unacceptably
large part of the I/O time both during writing and reading
was spent on entirely unnecessary endian conversions. TNG
still performs automatic transparent conversion to and from
the native endianness for all numerical fields, but only when
necessary. Strings are encoded as UTF-8 and limited to 1024
bytes, including the terminating null character. Empty strings
consist of only the terminating null character. This limit is
deliberate, as it makes it possible to use the format even in
languages that do not support runtime memory allocation,
and arbitrary-length strings can still be stored as an array of
strings. Each block contains an MD5 hash of the block con-
tents to verify the integrity of the data—to enable checks
against corruptions during file transfer or hard disk failures.
Each block contains the following fields as a header:
� Size of the block header (integer).
� Size of the block contents (except header) (integer).
� Block identifier (integer).
� MD5 hash (16 characters).
� Block name (string).
� Version of the block with respect to the block identifier
(integer).
� ID of alternative hash (integer) (if not using MD5).
� Length of alternative hash in bytes (integer).
SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG
262 Journal of Computational Chemistry 2014, 35, 260–269 WWW.CHEMISTRYVIEWS.COM
� Alternative hash (number of bytes determined by the
previous value).
� ID of digital signature (integer).
� Length of digital signature in bytes (integer).
� Digital signature (number of bytes determined by the
previous value).
The header is directly followed by the block contents, which
varies from block to block. Please note that the current version
of the API does not support an alternative hash (only MD5) or
a digital signature. This will be added soon.
Description of blocks
There are two general kinds of blocks in a TNG file, namely
information, or metadata, blocks (see Table 2), and data blocks
(see Table 3). The information blocks describe the TNG file and
divide it in different sections, whereas the data blocks reflect
the state of the simulation, or of the particles, at a specific
point in time or in general. Initially, the most obvious way of
storing data would appear to be a single block for each time-
step, but to enable temporal compression of data, and to
more quickly access data from a specific frame, we have
designed the trajectory to be divided into frame sets, each
containing a number of frames or timesteps. The numbering
of frames is zero-based, meaning that the first frame is frame
index 0. The TNG format does not strictly define what an indi-
vidual frame has to be. From the file point-of-view, it does not
matter if the time between frames is set according to the data
with the highest output frequency or if there is, for example,
one frame per MD simulation step (which would mean that
not all frame indices are present in the trajectory). The latter is
recommended, but as long as the Time per frame in TRAJEC-
TORY FRAME SET blocks and the stride length in data blocks
are properly set, any scheme can be used for storing the data,
and it will always be possible to read back. A TRAJECTORY
FRAME SET block marks the beginning of a frame set, and all
subsequent blocks until the next TRAJECTORY FRAME SET
block are considered part of that frame set. Data blocks that
contain data that do not change during the simulation (frame-
independent data) should be placed before the first TRAJEC-
TORY FRAME SET block, whereas frame-dependent data blocks
are placed inside frame sets (i.e., after a TRAJECTORY FRAME
SET block). Data can be particle specific, such as POSITIONS,
VELOCITIES, or PARTIAL CHARGES or general, for example, BOX
SHAPE. Frame sets can also be divided by separate particle
mapping blocks, allowing parallel writing of data from distrib-
uted simulations where particles can move between nodes. In
that case, there are one or more PARTICLE MAPPING blocks in
the frame set, each followed by the data blocks related to
those particles. If parallel writing is not required, there is no
need for more than one PARTICLE MAPPING block, and usually
none whatsoever. It is also possible to change the number of
molecules from one frame set to another, which is necessary
to simulate grand canonical ensembles. In that case, the num-
ber of frames per frame set must be set to 1 (if the number of
particles can change every frame), which means that multi-
frame compression cannot be used. Figure 1 provides an over-
view of the file structure.
The GENERAL INFO block is the only mandatory block and
contains the following data fields:
� Name and version of the program used to perform the
simulation (on file creation) (string).
� Name and version of the program used when last modi-
fying the file (string).
� Name of the person who created the file (string).
� Name of the person who last modified the file (string).
Table 2. Information blocks containing general information about the file
and also dividing the file into frame sets, that is, chunks of frames and
their related data blocks.
Block name Block identifier Block description
GENERAL
INFO
0x0000000000000000 Information about the file
creation, internal file
pointers to first/last
frame sets, and so forth.
MOLECULES 0x0000000000000001 Descriptions of molecules,
chains, residues and
atoms.
TRAJECTORY
FRAME SET
0x0000000000000002 Marks the beginning of a
frame set (a sequence of
frames).
PARTICLE
MAPPING
0x0000000000000003 A mapping between the
particle numbering inside
the frame set and the
molecule particle
numbers.
Table 3. Data blocks that can contain general data or particle related data.
Block name Block identifier Block description
BOX SHAPE 0x0000000010000000 Dimensions of the periodic box (nine values per frame).
POSITIONS 0x0000000010000001 Particle coordinates (three values per particle and frame).
VELOCITIES 0x0000000010000002 Particle velocities (three values per particle and frame).
FORCES 0x0000000010000003 Forces on particles (three values per particle and frame).
PARTIAL CHARGES 0x0000000010000004 Partial charges of particles (one value per particle and frame).
FORMAL CHARGES 0x0000000010000005 Formal charges of particles (one value per particle and frame).
B-FACTORS 0x0000000010000006 B-factors (temperature factors) of particles (one value per particle and frame).
ANISOTROPIC B-FACTORS 0x0000000010000007 Anisotropic B-factors of particles (six or nine values per particle and frame).
OCCUPANCY 0x0000000010000008 Occupancy of particles (one value per particle and frame).
Block name is a descriptive name of the block, whereas the block identifier is a unique enumeration of the block.
SOFTWARE NEWS AND UPDATESWWW.C-CHEM.ORG
Journal of Computational Chemistry 2014, 35, 260–269 263
� Name of computer/other info where the file was created
(string).
� Name of computer/other info where the file was last
modified (string).
� PGP signature of the user who created the file (string).
� PGP signature of the user who last modified the file
(string).
� Name of the force field used to perform the simulation
(string).
� Time of initial file creation, UTC time zone seconds (also
called Unix seconds) since GMT 01-01-1970 00:00:00 (inte-
ger). The 64-bit representation makes sure that the for-
mat can be used for another 500 billion years.
� Time of completing the simulation, UTC time zone sec-
onds since GMT 01-01-1970 00:00:00 (integer).
� Use variable number of atoms in frames (8 bit flag). If set
to TRUE, the number of each molecule is specified in the
TRAJECTORY FRAME SET block.
� Number of frames in each frame set (integer). This is the
expected number of frames in each frame set, but it
does not have to be constant. It is OK to have frame sets
with fewer or more frames, for example, after concatenat-
ing multiple trajectory files. This avoids the need to
recompress all data after a concatenation, but it means
that searching for a specific frame might need a few
more steps between frame sets. For simulations using a
grand canonical ensemble, it is best to set this to 1 so
that the number of atoms in the frame sets can be
updated regularly.
� Pointer to the file position of the beginning of the first
TRAJECTORY FRAME SET block (integer).
� Pointer to the file position of the beginning of the
last TRAJECTORY FRAME SET block (integer). (Updated
when finishing writing the trajectory file—otherwise set
to 21.)
� Length of steps (number of “TRAJECTORY FRAME SET
blocks”) for medium stride pointers (integer). By default,
it is set to 100 “TRAJECTORY FRAME SET blocks.”
� Length of steps (number of “TRAJECTORY FRAME SET
blocks”) for long stride pointers (integer). By default, it is
set to 10 000 “TRAJECTORY FRAME SET blocks.”
� Exponential of unit used for distance measurements (inte-
ger). By default, it is set to 29 (i.e., nm).
The MOLECULES block contains a description of the mole-
cules in the system. The numbers of each molecule can
change during the simulation, but the composition must be
constant.
� Number of molecules (integer).
� For each molecule:
� Molecule ID (integer).
� Molecule name (string).
� Quaternary structure, for example, one means mono-
meric, four means tetrameric, and so forth (integer).
� Number of molecules of this kind—only used if not
using “variable number of atoms” in the “general info
block” (integer).
� Number of chains in the molecule (integer).
� Number of residues in the molecule (integer).
� Number of atoms in the molecule (integer).
� For each chain:
� Chain ID (integer). Unique in each molecule.
� Chain name (string).
� Number of residues in the chain (integer).
� For each residue:
� Residue ID (integer). Unique in the chain or (mole-
cule if there is no chain).
� Residue name (string).
� Number of atoms in the residue (integer).
� For each atom (in the molecule if there is no
residue):
� Atom ID (integer). Unique in the molecule.
� Atom name (string).
� Atom type (string).
� Number of bonds in the molecule (integer).
� For each bond:
� From Atom ID (integer).
� To Atom ID (integer).
This first version of the TNG format is somewhat biomole-
cule centric, mainly because this is the area in which we have
Figure 1. Schematic overview of the TNG file structure. The blocks with a
dashed outline can be any number of data blocks. “Constant data” repre-
sents data blocks containing data that does not change during the trajec-
tory. “Variable data” contains data that is modified in the trajectory, for
example, particle positions. FRAME SET represents TRAJECTORY FRAME SET
blocks and PARTICLE MAP. corresponds to PARTICLE MAPPING blocks. (a)
and (b) show files without and with PARTICLE MAPPING blocks,
respectively.
SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG
264 Journal of Computational Chemistry 2014, 35, 260–269 WWW.CHEMISTRYVIEWS.COM
both most experience and needs. This means that chains and
residues need to be recorded without names to, for example,
store crystals of atoms that do not use any residue or chain
information. However, this is a good example of the extensibil-
ity of the format: The MOLECULES block of future versions will
be made more general, but the API will handle different ver-
sions without user intervention.
TRAJECTORY FRAME SET blocks indicate the beginning of a
frame set and divide the trajectory data into smaller chunks.
The pointers enable fast access to any frame set, or frame, in a
large trajectory (see Fig. 2).
� Number of the first frame of the frame set (integer).
� Number of frames in the frame set (integer).
� If the Variable number of atoms flag in the GENERAL
INFO block is set to TRUE: Array of integers specifying
the count of each molecule type. The molecule types are
listed in the MOLECULES block and should be listed in
the same order here. This is used for, for example, simula-
tions using a grand canonical ensemble (in which case
the number of frames in each frame set should be 1).
� Pointer to the next TRAJECTORY FRAME SET block
(integer).
� Pointer to the previous TRAJECTORY FRAME SET block
(integer).
� Medium stride pointer to the next, for example, 100th
TRAJECTORY FRAME SET block (integer). (Medium stride
length specified in the GENERAL INFO block.).
� Medium stride pointer to the previous, for example,
100th TRAJECTORY FRAME SET block (integer).
� Long stride pointer to the next, for example, 10 000th
TRAJECTORY FRAME SET block (integer). (Long stride
length specified in the GENERAL INFO block.).
� Long stride pointer to the previous, for example, 10
000th TRAJECTORY FRAME SET block (integer).
� Time stamp (in seconds) of first frame in frame set
(double).
� Time (in seconds) per frame (double).
PARTICLE MAPPING blocks contain a mapping between the
particle numbers in the molecular system and the order in
which the particles are written in the frame set. If no PARTICLE
MAPPING block is present in a frame set, the particle number-
ing is the same as the particle numbering in the molecular
system. If there is at least one PARTICLE MAPPING block in a
frame set, all particles with data stored in the frame set must
be present in a PARTICLE MAPPING block. This in turn makes it
possible to store only data of specific particles, for example,
charges in a protein but not for any water molecules.
� Number of first particle (particle number as stored in the
molecular system, zero-based numbering) (integer).
� Number of particles in this particle mapping block
(integer).
� Array of particle numbers: .
� Each value is the number of the particle in the molecular
system corresponding to the particle number as stored in
the trajectory (integer).
Data blocks are used for storing generic data. Frame-
dependent data blocks are located after the TRAJECTORY
FRAME SET block to which they belong. Frame- and particle-
dependent data blocks should follow the relevant particle
mapping block (if using any particle mapping block).
� Data type flag (8 bit flag). 0 5 character/string data, 1 5 inte-
ger data, 2 5 float data (32 bit), and 3 5 double data (64 bit).
� Dependency flag (8 bit flag). 1 5 frame-dependent, 2 5
particle-dependent. Can be combined, that is, 3 5 frame-
and particle-dependent.
� Sparse data flag to signify if not all frames in the frame
sets have data entries in this data block, for example,
energies and positions might be saved at different inter-
vals meaning that at least one of them would be saved
as sparse data (8 bit flag). Only present if the data are
frame-dependent.
� Number of values (integer). If the data is frame–depend-
ent, this is the number of values per frame. If the data is
particle–dependent, this is the number of values per par-
ticle (per frame).
� ID of the CODEC used to compress the positions (integer).
� Multiplier for integers to obtain the appropriate floating
point number, for compressed frames (double). Only
present if the above CODEC id is >0 and if the data type
is double or float.
� If using sparse data the following fields are required:
� Number of first frame containing data (integer).
� Number of frames between data points (integer).
Figure 2. Pointer structure of a TNG file. From the GENERAL INFO block,
there are pointers to the first and last frame set. In each TRAJECTORY
FRAME SET block, there are pointers to the next and previous frame sets
and also certain numbers of frame sets ahead and back (determined by
the length of steps of medium and long stride pointers, set in the GEN-
ERAL INFO block). The numbers indicate the frame set number.
SOFTWARE NEWS AND UPDATESWWW.C-CHEM.ORG
Journal of Computational Chemistry 2014, 35, 260–269 265
� Particle-dependent data blocks contain the following
fields:
� Number of the first particle in the data block (integer).
This must be the same as in the preceding PARTICLE
MAPPING block, if present. This allows writing data
starting from a certain atom in the molecular system.
� Number of particles in the data block (integer). This
must be the same as in the preceding PARTICLE MAP-
PING block, if present.
� Continuous field of data.
Data stored in data blocks (see Table 3) can be compressed.
The first temporal multiframe trajectory NG compression for-
mat (which we term TNG-MF1 to avoid confusion with the
container format)[20] can be used for efficiently compressing
positions and velocities. Zlib compression is a nonlossy
included alternative that is not as efficient, but can be used
for compressing any kind of data.
In addition to the data block types mentioned in Table 3,
any number (limited only by the 64-bit block identifier) of cus-
tom data blocks can be added and used to store strings or
numerical data. To identify blocks, we use a hexadecimal 64-
bit representation, where the upper 32 bits denote the devel-
oper, or group, responsible for the data block type, and the
lower 32 bits the type of data. User IDs in the range 0x0–
0x0FFFFFFF are reserved for current and future internal TNG
user specifications. A second range of 0x10000000–0x1FFFFFFF
is reserved for program-specific, or other registered, users. An
official user ID in this range can be reserved by contacting the
authors, and in the future by registering through a web page
that can also be queried. This way it will be possible for any-
body to identify the author of an arbitrary block, and also
guarantee that registered user IDs will never clash. Finally, the
user ID range 0x20000000–0xFFFFFFFF is unofficial, freely avail-
able, and can be used by anybody without registering, but on
the other hand, it will not be possible to predict who the user
of a a block is. Similarly, the data ID part of the block ID is
divided into four ranges. The range 0x0–0x0FFFFFFF denotes
reserved information blocks and the range 0x10000000–
0x1FFFFFFF reserved data blocks. For registered user IDs, all
blocks in this range should also be registered so they can be
queried. In contrast, the ranges 0x20000000–0x2FFFFFFF and
0x30000000 through 0x3FFFFFFF can be used freely for nonre-
gistered information and data blocks, respectively.
Compression of trajectory data
The compression algorithms by Spangberg et al.[20] are included
in the TNG library. These compression algorithms (now called
TNG-MF1) work on the atomic positions and velocities. The sin-
gle or double precision required for conservation of energy
when performing the MD-simulations is usually unnecessarily
high for the properties determined directly from the trajectory,
such as distribution functions and correlation functions, and so
forth. Therefore, the compression algorithms use user-
configurable reduced precision, by quantizing the data into
integers. Atoms close in order in each trajectory frame are often
close in space, so the value of the difference in coordinates are
often smaller than the absolute values of the coordinates. Also,
a given atom typically does not move very far between two
frames. In the case of velocities, the values do not change
much between two frames, given that frames are stored often.
Therefore, both spatial and temporal compression can be used
to reduce the size of the trajectory data. Although this is a very
efficient compression scheme when the properties of the data
is known, it is also orthogonal to the concepts used in general
data formats such as NetCDF or HDF5, which is one of the
main reasons for an MD-specific format.
There are four basic algorithms used to compress either abso-
lute values of coordinates or velocities or differences of coordi-
nates or velocities: Variable number of bits storage of individual
values, storing three values (x, y, and z) together as a single num-
ber with variable base, grouping several consecutive atoms close
in space together and coding the storage of them efficiently,
and finally, utilizing block-sorting algorithms. The actual com-
pression algorithms implemented in TNG-MF1 are combinations
of the above basic algorithms. The speed of the algorithms are
quite different, and therefore they are sorted into different com-
pression levels, allowing the user to choose the appropriate level
of compression ratio and compression speed. The compression
algorithms to use for a given set of coordinates and velocities
are determined automatically, and the algorithm chosen is
returned to the caller, to allow the use of the same predeter-
mined algorithms on subsequent data-blocks. To the best of our
knowledge, TNG-MF1 is the most efficient molecular simulation
data compression format available to date. It is also possible to
compress any data block with other compression algorithms.
Currently, Zlib is the only additional algorithm supported by the
API, but we expect this to be extended in the future, for by
instance spectral recompression algorithms that can only be
used after the simulation has been completed.[21]
API Implementation
The TNG format is released as a standalone library with the main
API written in C. On top of this, there are thin C11 and FOR-
TRAN 77 layers to allow access from a variety of programming
languages. Note that the FORTRAN implementation requires
CRAY pointers, which are technically not part of the FORTRAN 77
standard, but available in virtually all compilers anyhow. The API
is divided into a low-level and a high-level set of functions. These
two sets of functions can be referred to as a low-level and a
high-level API, but they can be used interchangeably. The low-
level API provides functions granting fine-grained control of the
data, whereas the functions of the high-level API are simplified
and use sensible default values to make it easier to use. All func-
tions of the API use a “tng_” prefix and in the high-level API the
prefix is further extended to “tng_util_.”
The routines in the TNG-MF1 compression library, which is
bundled with the TNG API, are prefixed “tng_compress_.” They
all perform memory to memory compression and uncompres-
sion. When compressing, it is necessary to specify whether
positions or velocities are compressed, as different algorithms
are effective for positions or velocities. The type of
SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG
266 Journal of Computational Chemistry 2014, 35, 260–269 WWW.CHEMISTRYVIEWS.COM
compressed data is stored in the compressed data block, so
for uncompression only a single routine is necessary. The data
type to compress/uncompress can either be double precision,
single precision, or (prequantized) integer data.
Currently, the released GROMACS-4.6 uses the XTC and TRR
formats for storing trajectories, but from version 5.0 the TNG for-
mat will be used instead. In addition to making the API and
library available as open source, we have chosen the revised BSD
license to make it possible to link any code in the world with the
library without any restrictions on redistribution, but still encour-
age other teams to contribute new compression algorithms to
the library itself. We also provide converters to/from commonly
used file formats, such as AMBER NetCDF and PDB.
Results
Output benchmarks
To test the performance of the file format and the API, GROMACS
4.6.3 was modified to enable writing TNG files, mainly using the
high-level API. Four different test systems, one of which was run
for two different time lengths, of varying sizes were used:
One molecule of ethanol was solvated by 640 water mole-
cules. The simulation was run for 5 ns. The positions and
the shape of the periodic box were saved every 10 ps.
The RNase ZF-1A (2026 atoms) was solvated by approxi-
mately 4900 water molecules. The system charge was neu-
tralized by six Cl2 ions. The simulation was only run for
400 ps. Trajectory snapshots were saved to file every 200
fs, which is considerably more frequent than in most nor-
mal simulations of a biomolecular system, although, for
example, velocity autocorrelation analysis might require
even denser frames.
N-acetylaladanamide (NAAA) in a POPC bilayer has
previously been simulated by Murugan et al.[22] The sys-
tem consisted of 17,417 atoms, including 3572 water
molecules. In this study, the simulation time was 2 ns.
Positions and the shape of the periodic box were saved
every 4 ps.
The Kv1.2 ion channel[23] was simulated for both 100
ps and 500 ps with 5 fs time steps and trajectory snap-
shots were saved every 1 ps. There were approximately
121,000 particles (including 8920 virtual sites), two thirds
of which were water, in the system.
For all simulations, the stride lengths of the data blocks
were set to the writing frequency (in MD steps) of the simula-
tion, that is, 5000, 100, 2000, and 200 frames, respectively. The
number of frames per frame set was set to have 100 written
frames per frame set, that is, 100 times the writing frequency
of the simulation (500,000, 10,000, 200,000, and 20,000 frames
per frame set). The precision of the compressed coordinates in
the TNG file was set to 0.001 nm, which was also the precision
of the XTC compression.
For comparison, a standard GROMACS 4.6.3 version was
used with the same setups, but writing to a compressed XTC
file instead of a TNG file. For each system, the simulation was
performed three times and the average times are reported in
Table 4. The RNase ZF-1A TRR file was also converted to DCD,
NetCDF, and LAMMPS formats to compare the resulting file
size (see Fig. 3). Apparently, mdconvert[24] does not compress
Table 4. Comparison of times and file sizes using a modified version of
GROMACS writing TNG files compared to a standard version of GROMACS
writing XTC files.
Output type
Total CPU
time (s)
Trajectory
writing
time (s)[a]
% of time
spent writing
trajectory
Output file
size (MB)[b]
Ethanol
TNG 1046.4 0.8 0.07 3.3
XTC 1073.5 0.6 0.06 3.3
RNase ZF-1A
TNG 360.1 8.7 2.42 99.8
XTC 360.0 7.5 2.10 116.8
NAAA in POPC membrane
TNG 3046.0 3.1 0.10 29.2
XTC 3020.7 2.1 0.07 30.8
Kv1.2 ion channel 100 ps
TNG 530.4 6.4 1.21 74.8
XTC 513.8 3.7 0.72 88.7
Kv1.2 ion channel 500 ps
TNG 2450.7 21.6 0.88 370.5
XTC 2481.1 16.0 0.64 441.7
[a] Includes writing the TRR file. [b] The TRR file sizes were 11, 385, 100,
279, and 1391 MB respectively. The reported times are averaged from
three simulation runs. Because the simulation time varies from one sim-
ulation to another (load balancing), it is the percentage of time spent
writing column that is most relevant for comparing the time
differences.
Figure 3. Comparison of file sizes of different file formats. mdconvert[24]
(from the MDTraj library) was used for converting the TRR file from the
RNase simulation to DCD and NetCDF, whereas VMD was used for convert-
ing to LAMMPS. The LAMMPS file in this case was 10.63 times larger than
the TNG file, whereas for TRR, DCD, and NetCDF the size factor to TNG was
3.86 and for XTC it was 1.17.
SOFTWARE NEWS AND UPDATESWWW.C-CHEM.ORG
Journal of Computational Chemistry 2014, 35, 260–269 267
the NetCDF data, but TNG clearly performs well. The bench-
marks show that the molecular system has a large impact on
the efficiency of the TNG-MF1 compression. Even denser frame
storage or a smaller fraction of water could amplify the
advantage, but possibly at the cost of time.
Input benchmarks
There is not yet any tool that reads both XTC and TNG files.
This will be amended in the near future. To compare the read-
ing speeds of the trajectory files from the simulations
tng_io_read_pos_util (from the example files in the TNG
library) was used to read TNG files and gmxdump, from GRO-
MACS 4.6.3, was used to read XTC and TRR files. Both these
tools read coordinates from the files as well as simulation box
shapes. The TNG reading tool does not output the simulation
box shape every frame, whereas gmxdump does that. The out-
put from both programs were directed to the null device to
reduce the influence of terminal output on the results. Each
file was read three times and the execution times were meas-
ured using the Debian GNU/Linux tool “time.” The best “real”
time output is reported in Table 5. In four of the five cases,
the TNG files were quicker to read and in one case slower.
There are currently no benchmarks comparing TNG reading
speeds to other formats that XTC and TRR, but it is expected
that reading other uncompressed formats (e.g., DCD) would
be comparable to reading TRR.
API Examples
The Supporting Information contains a number of brief exam-
ples of writing and reading TNG files using the high- and low-
level APIs (Listings S1–S4).
Conclusions
This work presents both the development and full specifica-
tion of the first public release of the portable binary trajec-
tory file format TNG, along with an API to facilitate easy
access to data. The file format supports state-of-the-art com-
pression, by using the previously described trajectory NG
compression algorithm,[20] referred to as TNG-MF1. The file
format is flexible enough to ensure that custom data can
easily be stored. The small benchmarks performed here
show that writing a TNG file with compressed data blocks is
just-so-slightly slower than writing an XTC file, but the
resulting file is clearly smaller for some of the systems
tested, more noticeably when saving frames often. Systems
consisting mainly of water are almost as well compressed
by XTC, whereas systems with a smaller fraction of water
are more efficiently compressed by TNG-MF1. Picking what
TNG-MF1 compression algorithm to use, which is only done
once, takes longer than compressing the data, which means
that writing very short TNG trajectories is not as efficient,
which can be seen by comparing the two simulation
lengths of the Kv1.2 system in Table 4. The data storage
saved by switching to this new file format can be consider-
able in most MD groups. Using a system with more atoms
and writing output less often, the time required for writing
trajectories (regardless of format) will be overshadowed by
the calculation times. The TNG file includes a description of
the molecular system, but that accounts for a negligible
portion of the file size. Because the Trajectory NG compres-
sion algorithm[20] is dependent on the number of frames in
each compressed block, it is possible that it could be made
more efficient by tuning the number of frames per frame
set in the TNG file or the selection of compression algo-
rithms used. Such optimization was beyond the scope of
this work, but the advantage of the format is that this type
of modifications can easily be implemented in future ver-
sions without requiring developers or users to alter their
API calls. No significant effort has been put into optimizing
the TNG API or the compression algorithm. It is possible
that future versions can be made faster than the current
version, but currently this is not a bottleneck.
Version 1.4 of the API has been released and is available
through the git repository of the project git://git.gromacs.org/
tng.git. Future versions of the API will be fully backwards com-
patible and the version numbering of the blocks in the TNG
file ensure that future modifications of the file format can be
handled gracefully.
Acknowledgment
Computational resources were provided by the Swedish
National Infrastructure for Computing (025/12-32).
Keywords: molecular dynamics simulation � file
format � API � compression
How to cite this article: M. Lundborg, R. Apostolov, D.
Spangberg, A. G€arden€as, D. van der Spoel, E. Lindahl. J. Com-
put. Chem. 2014, 35, 260–269. DOI: 10.1002/jcc.23495
] Additional Supporting Information may be found in the
online version of this article.
[1] List of Software for Molecular Mechanics Modeling. Available at:
http://en.wikipedia.org/w/index.php?title5List_of_software_for_mole-
cular_mechani cs_modeling&oldid5564406774. Accessed on November
17, 2013.
[2] R. Salomon-Ferrer, D. Case, R. Walker, WIREs Comput. Mol. Sci. 2013, 3,
198.
Table 5. Time to read TNG, XTC, and TRR (uncompressed single precision)
files from the five benchmark simulations.
System
N
particles
N
frames
TNG
(s)
XTC
(s)
TRR
(s)
Ethanol 1929 501 1.4 1.0 1.0
RNase ZF-1A 16,816 2001 24.6 33.6 33.7
NAAA in POPC 17,417 501 6.6 8.7 8.8
Kv1.2 100 ps 121,449 201 21.2 26.0 26.4
Kv1.2 500 ps 121,449 1001 104.8 129.3 131.1
SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG
268 Journal of Computational Chemistry 2014, 35, 260–269 WWW.CHEMISTRYVIEWS.COM
[3] The Amber Molecular Dynamics Package. Available at: http://amberm-
d.org/. Accessed on November 17, 2013.
[4] R. Rew, G. Davis, IEEE Comput. Graph. Appl. 1990, 10, 76.
[5] S. A. Brown, M. Folk, G. Goucher, R. Rew, Comput. Phys. 1993, 7, 304.
[6] Unidata j NetCDF. Available at: http://www.unidata.ucar.edu/software/
netcdf/. Accessed on November 17, 2013.
[7] B. R. Brooks, C. L. Brooks, III, A. D. MacKerell Jr., L. Nilsson, R. J. Petrella,
B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch, L.
Caves, Q. Cui, A. R. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W.
Im, K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, R. W. Pastor,
C. B. Post, J. Z. Pu, M. Schaefer, B. Tidor, R. M. Venable, H. L.
Woodcock, X. Wu, W. Yang, D. M. York, M. Karplus, J. Comput. Chem.
2009, 30, 1545.
[8] File formats. Available at: http://www.ks.uiuc.edu/Research/namd/2.6/
ug/node13.html. Accessed on November 17, 2013.
[9] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C.
Chipot, R. D. Skeel, L. Kal�e, K. Schulten, J. Comput. Chem. 2005, 26,
1781.
[10] NAMD—Scalable Molecular Dynamics. Available at: http://www.ks.
uiuc.edu/Research/namd/. Accessed on November 17, 2013.
[11] A. T. Br€unger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W.
Grosse-Kunstleve, J. -S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J.
Read, L. M. Rice, T. Simonson, G. L. Warren, Acta Crystallogr. D Biol.
Crystallogr. 1998, 54, 905.
[12] K. J. Bowers, E. Chow, H. Xu, R. O. Dror, M. P. Eastwood, B. A.
Gregersen, J. L. Klepeis, I. Kolossvary, M. A. Moraes, F. D. Sacerdoti, J.
K. Salmon, Y. Shan, D. E. Shaw, In Proceedings of the 2006 ACM/IEEE
Conference on Supercomputing, ACM: New York, 2006.
[13] D. E. Shaw Research. Available at: http://www.deshawresearch.com/
resources_desmond.html. Accessed on November 17, 2013.
[14] Desmond Users Guide, Section 13.1, Desmond Molecular Dynamics
System, Version 3.4.0/0.7.2, D. E. Shaw Research: New York, 2013.
[15] B. Hess, C. Kutzner, D. van der Spoel, E. Lindahl, J. Chem. Theory Com-
put. 2008, 4, 435.
[16] S. Pronk, S. P�all, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.
Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, E. Lindahl,
Bioinformatics 2013, 29, 845.
[17] D. G. Green, K. E. Meacham, M. Surridge, F. van Hoesel, H. J. C.
Berendsen, In Methods and Techniques in Computational Chemistry:
METECC-95; E. Clementi, G. Corongiu, Eds.; STEF: Cagliari, 1995; pp.
435–463.
[18] S. Plimpton, J. Comput. Phys. 1995, 117, 1.
[19] LAMMPS Molecular Dynamics Simulator. Available at: http://lammps.-
sandia.gov/. Accessed on November 17, 2013.
[20] D. Spangberg, D. S. D. Larsson, D. van der Spoel, J. Mol. Model. 2011,
17, 2669.
[21] T. Meyer, C. Ferrer-Costa, A. P�erez, M. Rueda, A. Bidon-Chanal, F. J. Luque,
C. A. Laughton, M. Orozco, J. Chem. Theory Comput. 2006, 2, 251.
[22] N. A. Murugan, R. Apostolov, Z. Rinkevicius, J. Kongsted, E. Lindahl, H.
Agren, J. Am. Chem. Soc. 2013, 135, 13590.
[23] P. Bjelkmar, P. S. Niemel€a, I. Vattulainen, E. Lindahl, PLoS Comput. Biol.
2009, 5, e1000289.
[24] Command-Line Trajectory Conversion: mdconvert. Available at: http://
mdtraj.s3.amazonaws.com/mdconvert.html. Accessed on November
17, 2013
Received: 31 August 2013Revised: 24 October 2013Accepted: 1 November 2013Published online on 20 November 2013
SOFTWARE NEWS AND UPDATESWWW.C-CHEM.ORG
Journal of Computational Chemistry 2014, 35, 260–269 269