An efficient and extensible format, library, and API for binary trajectory data from molecular simulations

An Efficient and Extensible Format, Library, and API forBinary Trajectory Data from Molecular Simulations

Magnus Lundborg,[a] Rossen Apostolov,[b] Daniel Spangberg,[c] Anders G€arden€as,[d]

David van der Spoel,[d] and Erik Lindahl*[a,e]

Molecular dynamics simulations is an important application in

theoretical chemistry, and with the large high-performance com-

puting resources available today the programs also generate

huge amounts of output data. In particular in life sciences, with

complex biomolecules such as proteins, simulation projects reg-

ularly deal with several terabytes of data. Apart from the need

for more cost-efficient storage, it is increasingly important to be

able to archive data, secure the integrity against disk or file

transfer errors, to provide rapid access, and facilitate exchange of

data through open interfaces. There is already a whole range of

different formats used, but few if any of them (including our pre-

vious ones) fulfill all these goals. To address these shortcomings,

we present “Trajectory Next Generation” (TNG)—a flexible but

highly optimized and efficient file format designed with intero-

perability in mind. TNG both provides state-of-the-art multiframe

compression as well as a container framework that will make it

possible to extend it with new compression algorithms without

modifications in programs using it. TNG will be the new file for-

mat in the next major release of the GROMACS package, but it

has been implemented as a separate library and API with liberal

licensing to enable wide adoption both in academic and com-

mercial codes. VC 2013 Wiley Periodicals, Inc.

DOI: 10.1002/jcc.23495

Introduction

Computer simulations constitute a major and powerful tool for

investigating the atomistic behavior of molecular systems, and

the rapid growth of computational power means many appli-

cations are generating more output data than ever. Both mas-

sively parallel single simulations on supercomputers and

distributed computing projects relying on ensemble modeling

can easily generate tens of terabytes of data for a single pro-

ject. In the past few decades, dozens of software packages

have been developed that implement methods such as molec-

ular dynamics (MD) or Monte Carlo for molecular simulations.

Because both these applications rely on inherently stochastic

processes to generate sufficient sampling of a complex system,

this data expansion is a natural consequence of advances in

the field, and if anything the growth rate is increasing. In

many cases, the transfer, storage, analysis, archival, and post

simulation manipulation of data has become just as challeng-

ing as the simulation itself. Both for our own molecular simula-

tion code and in the community, there is a shortage of

extensible file formats that both provide the highest compres-

sion possible, quick random access, all necessary information

contained in a single file that is easily exchanged, as well as

strong integrity checks and the ability to, for example, validate

data with modern digital signatures. There are many universal

data exchange formats that are highly flexible (and even pref-

erable in some cases), but molecular simulation data also has

very special requirements and lossy compression possibilities

that makes it attractive with more specific formats. Develop-

ment and adoption of a common, well-designed standard for

data storage in particle simulations will thus bring great bene-

fits to all users and developers alike.

In this work, we present the specifications of a new strictly

specified format for storage of data obtained from molecular

simulations—Trajectory Next Generation (TNG). The standard

builds on a container-payload framework and is flexible, exten-

sible, optimized for parallel I/O, and multiframe compression,

and aims to address the shortcomings of existing formats such

as the XTC format previously used in GROningen MAchine for

Chemical Simulations (GROMACS). Both the TNG application

programming interface (API) and implementation is open

source and released under the revised BSD license, which in

[a] M. Lundborg, E. Lindahl

Department of Theoretical Physics and Swedish e-Science Research Center,

Royal Institute of Technology, Science for Life Laboratory, Box 1031, SE-171

21 Solna, Sweden

[b] R. Apostolov

PDC Center for High Performance Computing, Royal Institute of Technol-

ogy, Teknikringen 14, SE-100 44 Stockholm, Sweden and Science for Life

Laboratory, Box 1031, SE-171 21 Solna, Sweden

[c] D. Spangberg

Department of Chemistry—Angstr€om Laboratory, Uppsala Multidiscipli-

nary Center for Advanced Computational Methods (UPPMAX), Uppsala

University, Box 523, SE-751 20 Uppsala, Sweden

[d] A. G€arden€as, D. van der Spoel

Department of Cell and Molecular Biology, Uppsala Center for Computa-

tional Chemistry, Uppsala University, Box 596, SE-751 24 Uppsala, Sweden

[e] E. Lindahl

Department of Biochemistry and Biophysics, Center for Biomembrane

Research, Stockholm University, 106 91, Stockholm, Sweden

E-mail: [email protected]

Contract grant sponsor: ERC award 209825; Contract grant sponsor:

ScalaLife project; Contract grant number: EU contract INFSO-RI-261523;

Contract grant sponsor: Swedish e-Science Research Center; Contract

grant sponsor: Swedish research council; Contract grant number: 2010-

491, 2010-5107

VC 2013 Wiley Periodicals, Inc.

260 Journal of Computational Chemistry 2014, 35, 260–269 WWW.CHEMISTRYVIEWS.COM

SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG

particular makes it possible to use as a linked library in com-

mercial codes without any requirements on the license of

those codes.

MD Applications and Their File Formats

The MD method is a rather intuitive direct implementation of

Newton’s equation of motion and is applicable to a wide range

of molecular systems, which has led to the emergence of many

software applications that implement the method. Wikipedia

lists over 40[1] packages, although the real number of existing

codes is likely to be much larger—most scientists in the field

have written their own implementation at some point. Many of

these applications adopt their own formats for storage of trajec-

tory data; this is natural as the codes have had slightly different

goals, and historically storage was never a significant problem,

but in our opinion none of them (including GROMACS) fulfill all

the criteria we need from a modern simple day-to-day file for-

mat. Below we give an overview of some of the most popular

applications, the formats they use and the reason why (in our

opinion) each format was not sufficient for our usage case.

AMBER formats

AMBER[2,3] (Assisted Model Building with Energy Refinement) is

both a software package and a collection of force fields for

molecular mechanics modeling. It is one of the most widely

used programs and many research groups build additional

functionality on top of the core distribution.

Trajectory data from AMBER simulations are stored in

NetCDF[4,5] format (Network Common Data Form) developed by

Unidata.[6] In fact, the format simply specifies a set of conventions

that have to be used along with the NetCDF libraries. Those libra-

ries can represent arbitrary array-based data and have bindings

for many languages such as C/C11, Fortran (F77 and F90), Java,

Python. The NetCDF libraries are portable and the format extensi-

ble, and a huge advantage is that it makes it possible to read arbi-

trary subsets of data into many analysis programs that support

NetCDF. It also supports basic nonlossy compression of data

using Zlib. However, the NetCDF library is large; the source code

is in fact even larger than AMBER itself (see Table 1). A few mega-

bytes of source code is not a problem for storage or transfer

today, but with GROMACS’ requirements to run on nonstandard

platforms such as Playstation3 (natively, not Linux), Field-

Programmable Grid Arrays and future Application-Specific Inte-

grated Circuit chips, we might need to support every line of such

libraries ourselves, which is not realistic. The generality also makes

it difficult to use molecular simulation-specific lossy compression

without cancelling the advantages of NetCDF.

CHARMM formats

CHARMM[7] was one of the very first packages for molecular simu-

lation, and has very broad adoption with hundreds of modules

implemented. Trajectory data are stored in a DCD format—binary

FORTRAN files with atom coordinates (optionally velocities and

forces). The format does not use compression and does not sup-

port storage of additional data types. The files are not transport-

able between big-endian and little-endian computer

architectures, and require additional tools for conversion,[8] but

the DCD format, or slight variations of it, has been adopted by a

large number of codes, including, for example, the NAMD[9,10]

simulation package (focused on high-performance massively par-

allel simulation) and X-PLOR[11] (structure refinement). It is also

supported by a wide range of analysis and display programs.

Desmond formats

Desmond[12,13] is a software package developed to perform

high-speed MD simulations of biological systems on conven-

tional commodity clusters. Desmond stores trajectory data not in

a single file but in a collection of files each containing a number

of trajectory frames.[14] When the number of files is large, they

are organized in a directory structure. Metadata about the simu-

lation is stored in a separate metadata file. The topology of the

molecular system is saved separately. Various tools are used to

inspect and modify the data in the files as the internal binary

structure of the files is difficult for interpretation. This is a very

powerful choice for the very largest simulations when the data

are only handled in the program itself and are not realistic to

compress on-the-fly. However, for more common usage cases, it

is a bit of a hurdle that there is no single file that can be trans-

ferred, and it is not trivial to adopt the Desmond format in other

applications as there is no open library/API for it.

GROMACS formats

GROMACS[15,16] is an MD package mainly used for biomolecu-

lar simulations. It focuses on achieving high performance and

portability across hardware systems, and for full disclosure it

should be noted that it is developed by our team. GROMACS

uses two kinds of formats for storage of output data. TRR is a

full-precision portable-binary data format, while XTC[17] is a

lossy compression format. The latter is available as part of the

xdrfile library. The XTC format is portable and offers very good

compression of the data. In fact, to the best of our knowledge,

it is more efficient than any other available format—if an XTC

file is compressed with gzip the file size increases by a small

fraction. However, it has a number of drawbacks such as no

possibility to store arbitrary user- or meta-data, no indexing

for fast searches (which is complicated when the size of

frames varies with lossy compression), the topology of the

Table 1. Approximate sizes of code bases of three major MD programs

and some data or trajectory I/O libraries (note that NetCDF and HDF5 are

obviously much more general than TNG).

Software Version

Size

(MB)

AMBER 11 4

GROMACS 4.6.3 11

NAMD 2.8 8

NetCDF 5

HDF5 8

TNG 1.4 0.2

The size of AMBER includes only the compute engine sources; tests,

benchmarks etc. are excluded.

SOFTWARE NEWS AND UPDATESWWW.C-CHEM.ORG

Journal of Computational Chemistry 2014, 35, 260–269 261

http://onlinelibrary.wiley.com/

simulated system has to be read from a separate file, and

both the TRR and XTC formats were developed in the 1990s

and are limited to 232 particles.

LAMMPS formats

LAMMPS[18,19] (Large-scale Atomic/Molecular Massively Parallel

Simulator) is a classical MD code. It is designed to be highly

flexible and adaptable for simulations not only of biomolecular

systems but also polymers, solid-state materials, and coarse-

grained/mesoscopic systems, and supports a large number of

very special potentials. Due to the more universal nature of

LAMMPS compared to other MD packages, output data are

stored in ASCII text files with a very flexible format that allows

detailed description of arbitrary kinds of data. The data are

not compressed although gzipped files can be processed.

From our point-of-view, the LAMMPS file format is thus

impractical for storage of huge amounts of trajectory data.

Requirements

Because none of the present formats solved the needs we had

for a future standard file format, we have developed a new

container-type format named Trajectory Next Generation

(hereafter called TNG) that fulfills the following requirements:

� Fully architecture-independent, regarding both endian-

ness and the ability to mix single/double precision trajec-

tories and I/O libraries.

� It must be self-sufficient, that is, it should not require any

other files for reading, and all the data should be con-

tained in a single file for easy transport.

� Small footprint and high portability of the library and

easy to bundle by third parties, or even compile built-in

as part of an MD package.

� Built-in support for storage of different data types, for

example, arbitrary vectors or floats.

� Custom data storage. The format should be extensible

(say if a user wants to store distance restraints statistics

or something else), and other versions of the library

should be able to skip blocks they cannot interpret.

� Support for future compression algorithms. Inspired by

current multimedia formats, we want a format container

that is easy to read with a standard API, but the payload

itself should be possible to alter under-the-hood as new

compression algorithms are implemented.

� To improve over XTC, we need temporal compression of

data similar to multimedia formats, that is, compressing

several frames as a block.

� Integrity check of data blocks using hashes. (Version 1

only supports MD5 sums in the block headers, but future

version will support digital signatures in each block).

� Digital signatures to make it possible to guarantee what

program and user generated a particular trajectory. This

is critical to ensure data integrity, for example, in distrib-

uted computing projects where users receive credits for

the amount of simulations they produce.

� Possibility to store extended meta-data with full informa-

tion about simulated systems and conditions.

� Random access, using file pointers to quickly locate

frames, even when the frames are of nonconstant size

due to high compression.

� Efficient parallel I/O.

There are many other features that could be implemented

in a trajectory, in particular a complete description of the

topology of the system, but our aim here is to create a great-

est common divisor that can be used in lots of programs with

very little work rather than aiming to describe every single

component of any simulation. To fulfill the above require-

ments, TNG has been developed along the following specifica-

tions as released in version 1 of the format; whereas a normal

developer should likely just use the public API and open

library, the specification is intended to make it possible to

reimplement support of the format from scratch.

Methods

Specifications

A TNG file is made up of a number of data blocks. Numerical

values for fields can be integers (64 bit), floats (32 bit), or dou-

bles (64 bit), and are stored as big or little endians (constant

throughout the file with automatic conversion if the hardware

architecture changes). Some flags are also stored as a single

byte (8 bits). When creating a TNG file, the endianness defaults

to the endianness of the hardware creating the file. This is a

change from XTC that always used network (big) endian. The

reason for this is that most current architectures are little

endian, and in benchmarking we realized that an unacceptably

large part of the I/O time both during writing and reading

was spent on entirely unnecessary endian conversions. TNG

still performs automatic transparent conversion to and from

the native endianness for all numerical fields, but only when

necessary. Strings are encoded as UTF-8 and limited to 1024

bytes, including the terminating null character. Empty strings

consist of only the terminating null character. This limit is

deliberate, as it makes it possible to use the format even in

languages that do not support runtime memory allocation,

and arbitrary-length strings can still be stored as an array of

strings. Each block contains an MD5 hash of the block con-

tents to verify the integrity of the data—to enable checks

against corruptions during file transfer or hard disk failures.

Each block contains the following fields as a header:

� Size of the block header (integer).

� Size of the block contents (except header) (integer).

� Block identifier (integer).

� MD5 hash (16 characters).

� Block name (string).

� Version of the block with respect to the block identifier

(integer).

� ID of alternative hash (integer) (if not using MD5).

� Length of alternative hash in bytes (integer).



� Alternative hash (number of bytes determined by the

previous value).

� ID of digital signature (integer).

� Length of digital signature in bytes (integer).

� Digital signature (number of bytes determined by the

previous value).

The header is directly followed by the block contents, which

varies from block to block. Please note that the current version

of the API does not support an alternative hash (only MD5) or

a digital signature. This will be added soon.

Description of blocks

There are two general kinds of blocks in a TNG file, namely

information, or metadata, blocks (see Table 2), and data blocks

(see Table 3). The information blocks describe the TNG file and

divide it in different sections, whereas the data blocks reflect

the state of the simulation, or of the particles, at a specific

point in time or in general. Initially, the most obvious way of

storing data would appear to be a single block for each time-

step, but to enable temporal compression of data, and to

more quickly access data from a specific frame, we have

designed the trajectory to be divided into frame sets, each

containing a number of frames or timesteps. The numbering

of frames is zero-based, meaning that the first frame is frame

index 0. The TNG format does not strictly define what an indi-

vidual frame has to be. From the file point-of-view, it does not

matter if the time between frames is set according to the data

with the highest output frequency or if there is, for example,

one frame per MD simulation step (which would mean that

not all frame indices are present in the trajectory). The latter is

recommended, but as long as the Time per frame in TRAJEC-

TORY FRAME SET blocks and the stride length in data blocks

are properly set, any scheme can be used for storing the data,

and it will always be possible to read back. A TRAJECTORY

FRAME SET block marks the beginning of a frame set, and all

subsequent blocks until the next TRAJECTORY FRAME SET

block are considered part of that frame set. Data blocks that

contain data that do not change during the simulation (frame-

independent data) should be placed before the first TRAJEC-

TORY FRAME SET block, whereas frame-dependent data blocks

are placed inside frame sets (i.e., after a TRAJECTORY FRAME

SET block). Data can be particle specific, such as POSITIONS,

VELOCITIES, or PARTIAL CHARGES or general, for example, BOX

SHAPE. Frame sets can also be divided by separate particle

mapping blocks, allowing parallel writing of data from distrib-

uted simulations where particles can move between nodes. In

that case, there are one or more PARTICLE MAPPING blocks in

the frame set, each followed by the data blocks related to

those particles. If parallel writing is not required, there is no

need for more than one PARTICLE MAPPING block, and usually

none whatsoever. It is also possible to change the number of

molecules from one frame set to another, which is necessary

to simulate grand canonical ensembles. In that case, the num-

ber of frames per frame set must be set to 1 (if the number of

particles can change every frame), which means that multi-

frame compression cannot be used. Figure 1 provides an over-

view of the file structure.

The GENERAL INFO block is the only mandatory block and

contains the following data fields:

� Name and version of the program used to perform the

simulation (on file creation) (string).

� Name and version of the program used when last modi-

fying the file (string).

� Name of the person who created the file (string).

� Name of the person who last modified the file (string).

Table 2. Information blocks containing general information about the file

and also dividing the file into frame sets, that is, chunks of frames and

their related data blocks.

Block name Block identifier Block description

GENERAL

INFO

0x0000000000000000 Information about the file

creation, internal file

pointers to first/last

frame sets, and so forth.

MOLECULES 0x0000000000000001 Descriptions of molecules,

chains, residues and

atoms.

TRAJECTORY

FRAME SET

0x0000000000000002 Marks the beginning of a

frame set (a sequence of

frames).

PARTICLE

MAPPING

0x0000000000000003 A mapping between the

particle numbering inside

the frame set and the

molecule particle

numbers.

Table 3. Data blocks that can contain general data or particle related data.

Block name Block identifier Block description

BOX SHAPE 0x0000000010000000 Dimensions of the periodic box (nine values per frame).

POSITIONS 0x0000000010000001 Particle coordinates (three values per particle and frame).

VELOCITIES 0x0000000010000002 Particle velocities (three values per particle and frame).

FORCES 0x0000000010000003 Forces on particles (three values per particle and frame).

PARTIAL CHARGES 0x0000000010000004 Partial charges of particles (one value per particle and frame).

FORMAL CHARGES 0x0000000010000005 Formal charges of particles (one value per particle and frame).

B-FACTORS 0x0000000010000006 B-factors (temperature factors) of particles (one value per particle and frame).

ANISOTROPIC B-FACTORS 0x0000000010000007 Anisotropic B-factors of particles (six or nine values per particle and frame).

OCCUPANCY 0x0000000010000008 Occupancy of particles (one value per particle and frame).

Block name is a descriptive name of the block, whereas the block identifier is a unique enumeration of the block.




� Name of computer/other info where the file was created

(string).

� Name of computer/other info where the file was last

modified (string).

� PGP signature of the user who created the file (string).

� PGP signature of the user who last modified the file

(string).

� Name of the force field used to perform the simulation

(string).

� Time of initial file creation, UTC time zone seconds (also

called Unix seconds) since GMT 01-01-1970 00:00:00 (inte-

ger). The 64-bit representation makes sure that the for-

mat can be used for another 500 billion years.

� Time of completing the simulation, UTC time zone sec-

onds since GMT 01-01-1970 00:00:00 (integer).

� Use variable number of atoms in frames (8 bit flag). If set

to TRUE, the number of each molecule is specified in the

TRAJECTORY FRAME SET block.

� Number of frames in each frame set (integer). This is the

expected number of frames in each frame set, but it

does not have to be constant. It is OK to have frame sets

with fewer or more frames, for example, after concatenat-

ing multiple trajectory files. This avoids the need to

recompress all data after a concatenation, but it means

that searching for a specific frame might need a few

more steps between frame sets. For simulations using a

grand canonical ensemble, it is best to set this to 1 so

that the number of atoms in the frame sets can be

updated regularly.

� Pointer to the file position of the beginning of the first

TRAJECTORY FRAME SET block (integer).

� Pointer to the file position of the beginning of the

last TRAJECTORY FRAME SET block (integer). (Updated

when finishing writing the trajectory file—otherwise set

to 21.)

� Length of steps (number of “TRAJECTORY FRAME SET

blocks”) for medium stride pointers (integer). By default,

it is set to 100 “TRAJECTORY FRAME SET blocks.”

� Length of steps (number of “TRAJECTORY FRAME SET

blocks”) for long stride pointers (integer). By default, it is

set to 10 000 “TRAJECTORY FRAME SET blocks.”

� Exponential of unit used for distance measurements (inte-

ger). By default, it is set to 29 (i.e., nm).

The MOLECULES block contains a description of the mole-

cules in the system. The numbers of each molecule can

change during the simulation, but the composition must be

constant.

� Number of molecules (integer).

� For each molecule:

� Molecule ID (integer).

� Molecule name (string).

� Quaternary structure, for example, one means mono-

meric, four means tetrameric, and so forth (integer).

� Number of molecules of this kind—only used if not

using “variable number of atoms” in the “general info

block” (integer).

� Number of chains in the molecule (integer).

� Number of residues in the molecule (integer).

� Number of atoms in the molecule (integer).

� For each chain:

� Chain ID (integer). Unique in each molecule.

� Chain name (string).

� Number of residues in the chain (integer).

� For each residue:

� Residue ID (integer). Unique in the chain or (mole-

cule if there is no chain).

� Residue name (string).

� Number of atoms in the residue (integer).

� For each atom (in the molecule if there is no

residue):

� Atom ID (integer). Unique in the molecule.

� Atom name (string).

� Atom type (string).

� Number of bonds in the molecule (integer).

� For each bond:

� From Atom ID (integer).

� To Atom ID (integer).

This first version of the TNG format is somewhat biomole-

cule centric, mainly because this is the area in which we have

Figure 1. Schematic overview of the TNG file structure. The blocks with a

dashed outline can be any number of data blocks. “Constant data” repre-

sents data blocks containing data that does not change during the trajec-

tory. “Variable data” contains data that is modified in the trajectory, for

example, particle positions. FRAME SET represents TRAJECTORY FRAME SET

blocks and PARTICLE MAP. corresponds to PARTICLE MAPPING blocks. (a)

and (b) show files without and with PARTICLE MAPPING blocks,

respectively.



both most experience and needs. This means that chains and

residues need to be recorded without names to, for example,

store crystals of atoms that do not use any residue or chain

information. However, this is a good example of the extensibil-

ity of the format: The MOLECULES block of future versions will

be made more general, but the API will handle different ver-

sions without user intervention.

TRAJECTORY FRAME SET blocks indicate the beginning of a

frame set and divide the trajectory data into smaller chunks.

The pointers enable fast access to any frame set, or frame, in a

large trajectory (see Fig. 2).

� Number of the first frame of the frame set (integer).

� Number of frames in the frame set (integer).

� If the Variable number of atoms flag in the GENERAL

INFO block is set to TRUE: Array of integers specifying

the count of each molecule type. The molecule types are

listed in the MOLECULES block and should be listed in

the same order here. This is used for, for example, simula-

tions using a grand canonical ensemble (in which case

the number of frames in each frame set should be 1).

� Pointer to the next TRAJECTORY FRAME SET block

(integer).

� Pointer to the previous TRAJECTORY FRAME SET block

(integer).

� Medium stride pointer to the next, for example, 100th

TRAJECTORY FRAME SET block (integer). (Medium stride

length specified in the GENERAL INFO block.).

� Medium stride pointer to the previous, for example,

100th TRAJECTORY FRAME SET block (integer).

� Long stride pointer to the next, for example, 10 000th

TRAJECTORY FRAME SET block (integer). (Long stride

length specified in the GENERAL INFO block.).

� Long stride pointer to the previous, for example, 10

000th TRAJECTORY FRAME SET block (integer).

� Time stamp (in seconds) of first frame in frame set

(double).

� Time (in seconds) per frame (double).

PARTICLE MAPPING blocks contain a mapping between the

particle numbers in the molecular system and the order in

which the particles are written in the frame set. If no PARTICLE

MAPPING block is present in a frame set, the particle number-

ing is the same as the particle numbering in the molecular

system. If there is at least one PARTICLE MAPPING block in a

frame set, all particles with data stored in the frame set must

be present in a PARTICLE MAPPING block. This in turn makes it

possible to store only data of specific particles, for example,

charges in a protein but not for any water molecules.

� Number of first particle (particle number as stored in the

molecular system, zero-based numbering) (integer).

� Number of particles in this particle mapping block

(integer).

� Array of particle numbers: .

� Each value is the number of the particle in the molecular

system corresponding to the particle number as stored in

the trajectory (integer).

Data blocks are used for storing generic data. Frame-

dependent data blocks are located after the TRAJECTORY

FRAME SET block to which they belong. Frame- and particle-

dependent data blocks should follow the relevant particle

mapping block (if using any particle mapping block).

� Data type flag (8 bit flag). 0 5 character/string data, 1 5 inte-

ger data, 2 5 float data (32 bit), and 3 5 double data (64 bit).

� Dependency flag (8 bit flag). 1 5 frame-dependent, 2 5

particle-dependent. Can be combined, that is, 3 5 frame-

and particle-dependent.

� Sparse data flag to signify if not all frames in the frame

sets have data entries in this data block, for example,

energies and positions might be saved at different inter-

vals meaning that at least one of them would be saved

as sparse data (8 bit flag). Only present if the data are

frame-dependent.

� Number of values (integer). If the data is frame–depend-

ent, this is the number of values per frame. If the data is

particle–dependent, this is the number of values per par-

ticle (per frame).

� ID of the CODEC used to compress the positions (integer).

� Multiplier for integers to obtain the appropriate floating

point number, for compressed frames (double). Only

present if the above CODEC id is >0 and if the data type

is double or float.

� If using sparse data the following fields are required:

� Number of first frame containing data (integer).

� Number of frames between data points (integer).

Figure 2. Pointer structure of a TNG file. From the GENERAL INFO block,

there are pointers to the first and last frame set. In each TRAJECTORY

FRAME SET block, there are pointers to the next and previous frame sets

and also certain numbers of frame sets ahead and back (determined by

the length of steps of medium and long stride pointers, set in the GEN-

ERAL INFO block). The numbers indicate the frame set number.




� Particle-dependent data blocks contain the following

fields:

� Number of the first particle in the data block (integer).

This must be the same as in the preceding PARTICLE

MAPPING block, if present. This allows writing data

starting from a certain atom in the molecular system.

� Number of particles in the data block (integer). This

must be the same as in the preceding PARTICLE MAP-

PING block, if present.

� Continuous field of data.

Data stored in data blocks (see Table 3) can be compressed.

The first temporal multiframe trajectory NG compression for-

mat (which we term TNG-MF1 to avoid confusion with the

container format)[20] can be used for efficiently compressing

positions and velocities. Zlib compression is a nonlossy

included alternative that is not as efficient, but can be used

for compressing any kind of data.

In addition to the data block types mentioned in Table 3,

any number (limited only by the 64-bit block identifier) of cus-

tom data blocks can be added and used to store strings or

numerical data. To identify blocks, we use a hexadecimal 64-

bit representation, where the upper 32 bits denote the devel-

oper, or group, responsible for the data block type, and the

lower 32 bits the type of data. User IDs in the range 0x0–

0x0FFFFFFF are reserved for current and future internal TNG

user specifications. A second range of 0x10000000–0x1FFFFFFF

is reserved for program-specific, or other registered, users. An

official user ID in this range can be reserved by contacting the

authors, and in the future by registering through a web page

that can also be queried. This way it will be possible for any-

body to identify the author of an arbitrary block, and also

guarantee that registered user IDs will never clash. Finally, the

user ID range 0x20000000–0xFFFFFFFF is unofficial, freely avail-

able, and can be used by anybody without registering, but on

the other hand, it will not be possible to predict who the user

of a a block is. Similarly, the data ID part of the block ID is

divided into four ranges. The range 0x0–0x0FFFFFFF denotes

reserved information blocks and the range 0x10000000–

0x1FFFFFFF reserved data blocks. For registered user IDs, all

blocks in this range should also be registered so they can be

queried. In contrast, the ranges 0x20000000–0x2FFFFFFF and

0x30000000 through 0x3FFFFFFF can be used freely for nonre-

gistered information and data blocks, respectively.

Compression of trajectory data

The compression algorithms by Spangberg et al.[20] are included

in the TNG library. These compression algorithms (now called

TNG-MF1) work on the atomic positions and velocities. The sin-

gle or double precision required for conservation of energy

when performing the MD-simulations is usually unnecessarily

high for the properties determined directly from the trajectory,

such as distribution functions and correlation functions, and so

forth. Therefore, the compression algorithms use user-

configurable reduced precision, by quantizing the data into

integers. Atoms close in order in each trajectory frame are often

close in space, so the value of the difference in coordinates are

often smaller than the absolute values of the coordinates. Also,

a given atom typically does not move very far between two

frames. In the case of velocities, the values do not change

much between two frames, given that frames are stored often.

Therefore, both spatial and temporal compression can be used

to reduce the size of the trajectory data. Although this is a very

efficient compression scheme when the properties of the data

is known, it is also orthogonal to the concepts used in general

data formats such as NetCDF or HDF5, which is one of the

main reasons for an MD-specific format.

There are four basic algorithms used to compress either abso-

lute values of coordinates or velocities or differences of coordi-

nates or velocities: Variable number of bits storage of individual

values, storing three values (x, y, and z) together as a single num-

ber with variable base, grouping several consecutive atoms close

in space together and coding the storage of them efficiently,

and finally, utilizing block-sorting algorithms. The actual com-

pression algorithms implemented in TNG-MF1 are combinations

of the above basic algorithms. The speed of the algorithms are

quite different, and therefore they are sorted into different com-

pression levels, allowing the user to choose the appropriate level

of compression ratio and compression speed. The compression

algorithms to use for a given set of coordinates and velocities

are determined automatically, and the algorithm chosen is

returned to the caller, to allow the use of the same predeter-

mined algorithms on subsequent data-blocks. To the best of our

knowledge, TNG-MF1 is the most efficient molecular simulation

data compression format available to date. It is also possible to

compress any data block with other compression algorithms.

Currently, Zlib is the only additional algorithm supported by the

API, but we expect this to be extended in the future, for by

instance spectral recompression algorithms that can only be

used after the simulation has been completed.[21]

API Implementation

The TNG format is released as a standalone library with the main

API written in C. On top of this, there are thin C11 and FOR-

TRAN 77 layers to allow access from a variety of programming

languages. Note that the FORTRAN implementation requires

CRAY pointers, which are technically not part of the FORTRAN 77

standard, but available in virtually all compilers anyhow. The API

is divided into a low-level and a high-level set of functions. These

two sets of functions can be referred to as a low-level and a

high-level API, but they can be used interchangeably. The low-

level API provides functions granting fine-grained control of the

data, whereas the functions of the high-level API are simplified

and use sensible default values to make it easier to use. All func-

tions of the API use a “tng_” prefix and in the high-level API the

prefix is further extended to “tng_util_.”

The routines in the TNG-MF1 compression library, which is

bundled with the TNG API, are prefixed “tng_compress_.” They

all perform memory to memory compression and uncompres-

sion. When compressing, it is necessary to specify whether

positions or velocities are compressed, as different algorithms

are effective for positions or velocities. The type of



compressed data is stored in the compressed data block, so

for uncompression only a single routine is necessary. The data

type to compress/uncompress can either be double precision,

single precision, or (prequantized) integer data.

Currently, the released GROMACS-4.6 uses the XTC and TRR

formats for storing trajectories, but from version 5.0 the TNG for-

mat will be used instead. In addition to making the API and

library available as open source, we have chosen the revised BSD

license to make it possible to link any code in the world with the

library without any restrictions on redistribution, but still encour-

age other teams to contribute new compression algorithms to

the library itself. We also provide converters to/from commonly

used file formats, such as AMBER NetCDF and PDB.

Results

Output benchmarks

To test the performance of the file format and the API, GROMACS

4.6.3 was modified to enable writing TNG files, mainly using the

high-level API. Four different test systems, one of which was run

for two different time lengths, of varying sizes were used:

One molecule of ethanol was solvated by 640 water mole-

cules. The simulation was run for 5 ns. The positions and

the shape of the periodic box were saved every 10 ps.

The RNase ZF-1A (2026 atoms) was solvated by approxi-

mately 4900 water molecules. The system charge was neu-

tralized by six Cl2 ions. The simulation was only run for

400 ps. Trajectory snapshots were saved to file every 200

fs, which is considerably more frequent than in most nor-

mal simulations of a biomolecular system, although, for

example, velocity autocorrelation analysis might require

even denser frames.

N-acetylaladanamide (NAAA) in a POPC bilayer has

previously been simulated by Murugan et al.[22] The sys-

tem consisted of 17,417 atoms, including 3572 water

molecules. In this study, the simulation time was 2 ns.

Positions and the shape of the periodic box were saved

every 4 ps.

The Kv1.2 ion channel[23] was simulated for both 100

ps and 500 ps with 5 fs time steps and trajectory snap-

shots were saved every 1 ps. There were approximately

121,000 particles (including 8920 virtual sites), two thirds

of which were water, in the system.

For all simulations, the stride lengths of the data blocks

were set to the writing frequency (in MD steps) of the simula-

tion, that is, 5000, 100, 2000, and 200 frames, respectively. The

number of frames per frame set was set to have 100 written

frames per frame set, that is, 100 times the writing frequency

of the simulation (500,000, 10,000, 200,000, and 20,000 frames

per frame set). The precision of the compressed coordinates in

the TNG file was set to 0.001 nm, which was also the precision

of the XTC compression.

For comparison, a standard GROMACS 4.6.3 version was

used with the same setups, but writing to a compressed XTC

file instead of a TNG file. For each system, the simulation was

performed three times and the average times are reported in

Table 4. The RNase ZF-1A TRR file was also converted to DCD,

NetCDF, and LAMMPS formats to compare the resulting file

size (see Fig. 3). Apparently, mdconvert[24] does not compress

Table 4. Comparison of times and file sizes using a modified version of

GROMACS writing TNG files compared to a standard version of GROMACS

writing XTC files.

Output type

Total CPU

time (s)

Trajectory

writing

time (s)[a]

% of time

spent writing

trajectory

Output file

size (MB)[b]

Ethanol

TNG 1046.4 0.8 0.07 3.3

XTC 1073.5 0.6 0.06 3.3

RNase ZF-1A

TNG 360.1 8.7 2.42 99.8

XTC 360.0 7.5 2.10 116.8

NAAA in POPC membrane

TNG 3046.0 3.1 0.10 29.2

XTC 3020.7 2.1 0.07 30.8

Kv1.2 ion channel 100 ps

TNG 530.4 6.4 1.21 74.8

XTC 513.8 3.7 0.72 88.7

Kv1.2 ion channel 500 ps

TNG 2450.7 21.6 0.88 370.5

XTC 2481.1 16.0 0.64 441.7

[a] Includes writing the TRR file. [b] The TRR file sizes were 11, 385, 100,

279, and 1391 MB respectively. The reported times are averaged from

three simulation runs. Because the simulation time varies from one sim-

ulation to another (load balancing), it is the percentage of time spent

writing column that is most relevant for comparing the time

differences.

Figure 3. Comparison of file sizes of different file formats. mdconvert[24]

(from the MDTraj library) was used for converting the TRR file from the

RNase simulation to DCD and NetCDF, whereas VMD was used for convert-

ing to LAMMPS. The LAMMPS file in this case was 10.63 times larger than

the TNG file, whereas for TRR, DCD, and NetCDF the size factor to TNG was

3.86 and for XTC it was 1.17.




the NetCDF data, but TNG clearly performs well. The bench-

marks show that the molecular system has a large impact on

the efficiency of the TNG-MF1 compression. Even denser frame

storage or a smaller fraction of water could amplify the

advantage, but possibly at the cost of time.

Input benchmarks

There is not yet any tool that reads both XTC and TNG files.

This will be amended in the near future. To compare the read-

ing speeds of the trajectory files from the simulations

tng_io_read_pos_util (from the example files in the TNG

library) was used to read TNG files and gmxdump, from GRO-

MACS 4.6.3, was used to read XTC and TRR files. Both these

tools read coordinates from the files as well as simulation box

shapes. The TNG reading tool does not output the simulation

box shape every frame, whereas gmxdump does that. The out-

put from both programs were directed to the null device to

reduce the influence of terminal output on the results. Each

file was read three times and the execution times were meas-

ured using the Debian GNU/Linux tool “time.” The best “real”

time output is reported in Table 5. In four of the five cases,

the TNG files were quicker to read and in one case slower.

There are currently no benchmarks comparing TNG reading

speeds to other formats that XTC and TRR, but it is expected

that reading other uncompressed formats (e.g., DCD) would

be comparable to reading TRR.

API Examples

The Supporting Information contains a number of brief exam-

ples of writing and reading TNG files using the high- and low-

level APIs (Listings S1–S4).

Conclusions

This work presents both the development and full specifica-

tion of the first public release of the portable binary trajec-

tory file format TNG, along with an API to facilitate easy

access to data. The file format supports state-of-the-art com-

pression, by using the previously described trajectory NG

compression algorithm,[20] referred to as TNG-MF1. The file

format is flexible enough to ensure that custom data can

easily be stored. The small benchmarks performed here

show that writing a TNG file with compressed data blocks is

just-so-slightly slower than writing an XTC file, but the

resulting file is clearly smaller for some of the systems

tested, more noticeably when saving frames often. Systems

consisting mainly of water are almost as well compressed

by XTC, whereas systems with a smaller fraction of water

are more efficiently compressed by TNG-MF1. Picking what

TNG-MF1 compression algorithm to use, which is only done

once, takes longer than compressing the data, which means

that writing very short TNG trajectories is not as efficient,

which can be seen by comparing the two simulation

lengths of the Kv1.2 system in Table 4. The data storage

saved by switching to this new file format can be consider-

able in most MD groups. Using a system with more atoms

and writing output less often, the time required for writing

trajectories (regardless of format) will be overshadowed by

the calculation times. The TNG file includes a description of

the molecular system, but that accounts for a negligible

portion of the file size. Because the Trajectory NG compres-

sion algorithm[20] is dependent on the number of frames in

each compressed block, it is possible that it could be made

more efficient by tuning the number of frames per frame

set in the TNG file or the selection of compression algo-

rithms used. Such optimization was beyond the scope of

this work, but the advantage of the format is that this type

of modifications can easily be implemented in future ver-

sions without requiring developers or users to alter their

API calls. No significant effort has been put into optimizing

the TNG API or the compression algorithm. It is possible

that future versions can be made faster than the current

version, but currently this is not a bottleneck.

Version 1.4 of the API has been released and is available

through the git repository of the project git://git.gromacs.org/

tng.git. Future versions of the API will be fully backwards com-

patible and the version numbering of the blocks in the TNG

file ensure that future modifications of the file format can be

handled gracefully.

Acknowledgment

Computational resources were provided by the Swedish

National Infrastructure for Computing (025/12-32).

Keywords: molecular dynamics simulation � file

format � API � compression

How to cite this article: M. Lundborg, R. Apostolov, D.

Spangberg, A. G€arden€as, D. van der Spoel, E. Lindahl. J. Com-

put. Chem. 2014, 35, 260–269. DOI: 10.1002/jcc.23495

] Additional Supporting Information may be found in the

online version of this article.

[1] List of Software for Molecular Mechanics Modeling. Available at:

http://en.wikipedia.org/w/index.php?title5List_of_software_for_mole-

cular_mechani cs_modeling&oldid5564406774. Accessed on November

17, 2013.

[2] R. Salomon-Ferrer, D. Case, R. Walker, WIREs Comput. Mol. Sci. 2013, 3,

198.

Table 5. Time to read TNG, XTC, and TRR (uncompressed single precision)

files from the five benchmark simulations.

System

N

particles

N

frames

TNG

(s)

XTC

(s)

TRR

(s)

Ethanol 1929 501 1.4 1.0 1.0

RNase ZF-1A 16,816 2001 24.6 33.6 33.7

NAAA in POPC 17,417 501 6.6 8.7 8.8

Kv1.2 100 ps 121,449 201 21.2 26.0 26.4

Kv1.2 500 ps 121,449 1001 104.8 129.3 131.1



http://git://git.gromacs.org/tng.git

http://git://git.gromacs.org/tng.git

info:doi/10.1002/jcc.23495

[3] The Amber Molecular Dynamics Package. Available at: http://amberm-

d.org/. Accessed on November 17, 2013.

[4] R. Rew, G. Davis, IEEE Comput. Graph. Appl. 1990, 10, 76.

[5] S. A. Brown, M. Folk, G. Goucher, R. Rew, Comput. Phys. 1993, 7, 304.

[6] Unidata j NetCDF. Available at: http://www.unidata.ucar.edu/software/

netcdf/. Accessed on November 17, 2013.

[7] B. R. Brooks, C. L. Brooks, III, A. D. MacKerell Jr., L. Nilsson, R. J. Petrella,

B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, A. Caflisch, L.

Caves, Q. Cui, A. R. Dinner, M. Feig, S. Fischer, J. Gao, M. Hodoscek, W.

Im, K. Kuczera, T. Lazaridis, J. Ma, V. Ovchinnikov, E. Paci, R. W. Pastor,

C. B. Post, J. Z. Pu, M. Schaefer, B. Tidor, R. M. Venable, H. L.

Woodcock, X. Wu, W. Yang, D. M. York, M. Karplus, J. Comput. Chem.

2009, 30, 1545.

[8] File formats. Available at: http://www.ks.uiuc.edu/Research/namd/2.6/

ug/node13.html. Accessed on November 17, 2013.

[9] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C.

Chipot, R. D. Skeel, L. Kal�e, K. Schulten, J. Comput. Chem. 2005, 26,

1781.

[10] NAMD—Scalable Molecular Dynamics. Available at: http://www.ks.

uiuc.edu/Research/namd/. Accessed on November 17, 2013.

[11] A. T. Br€unger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W.

Grosse-Kunstleve, J. -S. Jiang, J. Kuszewski, M. Nilges, N. S. Pannu, R. J.

Read, L. M. Rice, T. Simonson, G. L. Warren, Acta Crystallogr. D Biol.

Crystallogr. 1998, 54, 905.

[12] K. J. Bowers, E. Chow, H. Xu, R. O. Dror, M. P. Eastwood, B. A.

Gregersen, J. L. Klepeis, I. Kolossvary, M. A. Moraes, F. D. Sacerdoti, J.

K. Salmon, Y. Shan, D. E. Shaw, In Proceedings of the 2006 ACM/IEEE

Conference on Supercomputing, ACM: New York, 2006.

[13] D. E. Shaw Research. Available at: http://www.deshawresearch.com/

resources_desmond.html. Accessed on November 17, 2013.

[14] Desmond Users Guide, Section 13.1, Desmond Molecular Dynamics

System, Version 3.4.0/0.7.2, D. E. Shaw Research: New York, 2013.

[15] B. Hess, C. Kutzner, D. van der Spoel, E. Lindahl, J. Chem. Theory Com-

put. 2008, 4, 435.

[16] S. Pronk, S. P�all, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R.

Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, E. Lindahl,

Bioinformatics 2013, 29, 845.

[17] D. G. Green, K. E. Meacham, M. Surridge, F. van Hoesel, H. J. C.

Berendsen, In Methods and Techniques in Computational Chemistry:

METECC-95; E. Clementi, G. Corongiu, Eds.; STEF: Cagliari, 1995; pp.

435–463.

[18] S. Plimpton, J. Comput. Phys. 1995, 117, 1.

[19] LAMMPS Molecular Dynamics Simulator. Available at: http://lammps.-

sandia.gov/. Accessed on November 17, 2013.

[20] D. Spangberg, D. S. D. Larsson, D. van der Spoel, J. Mol. Model. 2011,

17, 2669.

[21] T. Meyer, C. Ferrer-Costa, A. P�erez, M. Rueda, A. Bidon-Chanal, F. J. Luque,

C. A. Laughton, M. Orozco, J. Chem. Theory Comput. 2006, 2, 251.

[22] N. A. Murugan, R. Apostolov, Z. Rinkevicius, J. Kongsted, E. Lindahl, H.

Agren, J. Am. Chem. Soc. 2013, 135, 13590.

[23] P. Bjelkmar, P. S. Niemel€a, I. Vattulainen, E. Lindahl, PLoS Comput. Biol.

2009, 5, e1000289.

[24] Command-Line Trajectory Conversion: mdconvert. Available at: http://

mdtraj.s3.amazonaws.com/mdconvert.html. Accessed on November

17, 2013

Received: 31 August 2013Revised: 24 October 2013Accepted: 1 November 2013Published online on 20 November 2013



http://ambermd.org/

http://ambermd.org/

http://www.unidata.ucar.edu/software/netcdf/

http://www.unidata.ucar.edu/software/netcdf/

http://www.ks.uiuc.edu/Research/namd/2.6/ug/node13.html

http://www.ks.uiuc.edu/Research/namd/2.6/ug/node13.html

http://www.ks.uiuc.edu/Research/namd/

http://www.ks.uiuc.edu/Research/namd/

http://www.deshawresearch.com/resources_desmond.html

http://www.deshawresearch.com/resources_desmond.html

http://lammps.sandia.gov/

http://lammps.sandia.gov/

http://mdtraj.s3.amazonaws.com/mdconvert.html.

http://mdtraj.s3.amazonaws.com/mdconvert.html.


Documents

An efficient and extensible format, library, and API for binary trajectory data from molecular simulations