Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Flexible data storagefor minimal effort
A tale of two formats
user
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Persuasion• General problem to be addressed
• First use case - ATLAS FSI c. 1998
• ATLAS FSI data format
Widening the scope :
• Recent work – MONALISA format 2006
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Great Expectations
Want to use disk files to record...• Data from experiments/simulations• Prepared data ready for analysis• Analysis results
Want to see the data (not unrelated)• Plotting graphs is essential for
analysis, publication etc
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Laboratory data sources...
DataAcquisition
...measuring ambient conditions...
...and deliberately induced signals
Take instrument readings...
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Laboratory data is usually ...
• A collection of one dimensional arrays– Counts from an ADC - 16 bit integer– Photon counts 16 or 32 bit integer, etc
• Some arrays only have a single element – Ambient relative humidity – 1 float
• Treating all data is if it were in arrays may be a presumptuous idea... but fruitful
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Data increases with compound interest...
• Preparation of acquired data– filtering, averaging, smoothing, cutting, etc
• Analysis of prepared data – fitting, calculating FFT, time derivatives etc
• Data collation– measurement to measurement trends, etc
Want to emphasise that it is all treated as one dimensional arrays of data :
e.g. FFT spectrum in two 1-d arrays of doubles
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Let us not forget "annotation" Data
• Experimental set-up (which instruments, how and where connected etc)
• Other "one-off" parameters i.e. timestamp• Version info for DAQ/analysis algorithms• Seed parameters• Other fit/preparation control parameters• ...
(META?)
Software can deliver much more if annotations are included
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
An ideal data file format • Holds data and meta data / related information• Should be simple to write code to :
– find/read stored data of interest from the file– write any stored data to the file so it can be identified– append new data to the file, without disruption
• Handle (store/retrieve) data :– flexibly (of any format, in any order)– reliably (data should come back intact)– robustly (absent data should not break the format)– with language / platform independence
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
A database as a solution?
• A database in place of a file meets most requirements of a file format
• I have no database experience and did not want to be coupled to (tied down by) database related issues
• For example...is it easy to access the same data using different languages?
A pseudo-random quote from the web...“17.1. Do You Really Need a Relational Database?
It is common for web developers to jump to the conclusion that they need an SQL-compliant RDBMS like Oracle, when in fact they have a rather small data set that could be organized as one table. Commercial RDBMSs are expensive as well as nontrivial to install and administer.”
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ASCII text file : Less than ideal (1)• Meta data
– some may be in column labels– remaining meta data often poured out into the filename!
• How easy it is to ...– ...find/read stored data of interest from the file?
Code for reading single lines/columns is reasonable The price is a rigid file structure
– ...write any stored data to the file so it can be identified? Code for writing ASCII columns is very simple Data Identity is based on assumptions about column order
– ...append new data to the file, without disruption? Maybe possible in Perl but... ... in most languages it is easier to create a new file
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
• What about handling data? :– flexibly (of any format, in any order)
Data Format is ASCII columns (or similar enough format) only!Ordering flexibility is lost (unless only humans read)
– reliably (data should come back intact)
Rounding / formatting issues – places effort on user– robustly (absent data should not break the format)
Empty columns do break the format
– with language / platform independence Most languages allow ASCII I/OMost applications (Excel, Origin, etc) read ASCIIMinor cross platform issues – very rarely fatal
ASCII text file : Less than ideal (2)
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Can a binary file format help?The answer depends greatly on the :
– format design
• flexibility is easily lost without careful forethought / revision• simplicity :
The format should be “as simple as it needs to be,
but no simpler” – paraphrasing Einstein (1933)
– format implementation• design advantages are easily lost by the implementation
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Two binary formats
• Old ATLAS format – relies on ID codes to identify data– has some naive file structures - too rigid– implemented in LabVIEW (and ROOT!)
• New “MonAliSA” format– relies on ID codes to identify data– simpler and more flexible– implemented in C, Java and LabVIEW
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ATLAS demo FSI 1998-
• Experiments based on tuning a laser
• Timing of experiment has two "modes" based on rapid or slow laser tuning
• DAQ & analysis data file structures reflect this– alternate fast/slow periods in "blocks"
• All I/O software in National Instruments language LabVIEW (v4.x later v6.1 for analysis)
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Binary file format (ATLAS FSI 1998-)
File Header
Version Number + Number of data blocks
Data block 1 2 3 4 5
Simple structure at highest level • A single, minimal sized file header
• Followed by N data blocks-Rarely of equal size
2 bytes 1 byte
locationinside file
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Inside a data block (roughly)Block prefix : points to
a) start of first data labelb) start of next data block
also counts how many arrays stored
Header : most of the meta data
Array label : Identifies array(details on next slide)
1-d array BIG ENDIAN or IEEE
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ATLAS FSI format - data array label
ID codes
Location at end of the data array (pointer to next label)
Previousarray up tostart of thislabel
Byte locationinside file
Array attachedto the end of the label
InstrumentChannel ID
Fixed lengthString labelling thearray contents
Element type
Number of array elements
3.1“Long Ref raw data”+white space padding tofixed length
unsigned16 bit integer
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ID code examples from ATLAS
• 2 numbers in code (category, subcategory)• Categories :
1. DAQ timing parameters2. Thermometer / humidity : "environmental"3. Reference Interferometer System...6. Grid line interferometers7. (Reference) Phase 8. Sine fitting prepared data...
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ID code examples from ATLAS
• Subcategories arbitrarily assigned by hand
• In 3 (Reference Interferometer System)– 3.1 Long Reference Interferometer raw data– 3.3 Etalon raw data– 3.129 Long reference data for laser 1– 3.257 Long reference data for laser 2
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
How does ATLAS format work? 1) Writing data to the file
• Meta structures in place first (tedious)• Each array with label : placed at end of file• Any order permitted by unique array ID codes
(inside a given block at least)• Writing each data array involves :
– Preparing meta data for label– Writing label, including pointer to end of the array– Writing the array after the label– Updating meta structures : end of block & no. of arrays
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
How does ATLAS format work? 2) Reading data from the file
• Finding the correct data block is similar to finding array (below)– block label and pointer to next block are in block prefix
• Find array : Seek array at (block, ID) – (In the correct block) 1st array label easily found from prefix– then iterate within the block...
• Read array label : Do ID codes match two required?If YES read label & array at end of the label If NO find next label using pointer in this label – continue iteration
• N.B. reading finds 1st matching instance (in block) only– should have only been one matching instance written
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
4
Search pattern schematic
Data block 1 2
4
e.g. Looking for 7.1028 data in block 3
There are5 blocks
I am block 1
Block 3 starts here
I am block 2
Block 2 starts here
I am block 3There are 26 arrays in this block
My first data label starts here
I label 7.1028 data
Array of interest
5
Labels comparedwith required7.1028 ID code
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ATLAS format - review (1)Ideal : “Holds data and meta data / related information”
Meta data storage was useful but on the down side was... Scattered
some in the label some in header
Inflexible No easy way to augment meta data All block header sections had to be complete or left out
Also Block headers added large effort overhead to writing
new software made innovation in other areas painful / tedious not all block header meta data used, some never
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ATLAS format - review (2)Ideal : “Should be simple to write code to :”
– “find/read stored data of interest from the file” Very simple small stable I/O routines library ID codes stored in one place
easy to maintain easy to use
– “write any stored data to the file so it can be identified” Same success with same I/O routines and ID codes
– “append new data to the file, without disruption” Only possible to append to last block of the file
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ATLAS format - review (3)Ideal : "Handle (store/retrieve) data"
– flexibly (of any format, in any order) Storage order flexible (within a block) All required numerical formats supported (string as bytes)
– reliably (data should come back intact) Never reported any data errors in 9 years
– robustly (absent data should not break the format) Absent data does not break the format Earlier caveat about Meta Data applies
– with language / platform independenceNever fully tested this point in the ATLAS format Was beyond the scope of the implementationMostly used LabVIEW – format worked across LabVIEW versionsT. Kohno wrote a file reader for ROOT, no problems reported
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
MonAliSA 2006 : A new format
• Want to read/write files from different operating systems – C and Java for DAQ,
analysis, simulation– Run C inside LabVIEW on
windows XP (DAQ)– Most analysis / simulation
on Linux– Java work with LiCAS
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Broadening the scopeOnce you have a cross platform file format :
• Want to offer to ATLAS, LiCAS, etc...– Saves duplicating "reinvented wheels"– Same I/O software for each group
• Hence same basic format / file structure
– Using ID codes for data finding• ID code range expanded from 2 numbers to 5• Each group will want control over their own ID
codes / software versions
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Feeding lessons / requirements into the new format (1)...
• Kept the Labelled Data arrays– Data labels retain ID
codes, strings, pointers– Data labels drop meta data
(instrument ID)
• Removed data blocks – same DoF recorded with
"instance" label element– now possible to append
any data array to the end of the file
block 1OLD:Arraysstoredinblocks
NEW:StandaloneArrays
block 2
block 5
1st 2nd 1st
1st2nd 3rd
3rd
5th
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Feeding lessons / requirements into the new format (2)...
• Removed almost all header structures– Meta data stored in arrays like other data
• Very small remaining file header holds – file compatibility information
• Group ID (e.g. 2 = MonAliSA) • Format version ID (file / header structures change)• ID codes look up table version
This needs further explanation
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
New format : Simpler file structure
File Header 13 bytes
• Group ID• Software ID• ID codes version• File format version• File lock• Number of arrays in file
RED : CentralisedSet by protocol definitions
BLUE : "Group" specificManaged by a "Group"
File / array specific
Immutable
Mutable
• byte (can be general purpose)• 16,32,64 bit integers (big endian)• IEEE 32 bit float, 64 bit double
• ID codes• Instance count• Pointer to next label• Data type• Number of array elements• Error detection checksum• Variable length, label string
Array Label
One dimensional array
KEY TO TEXT COLOURS
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Obvious questions
• Is reading / writing data with the new format similar to the old? – ANSWER YES!
• Why does the new, simpler format appear to be so complicated?
• What is all this about "central" and "group" management?
• How much longer will this talk go on for?
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
All because of ID codes ...• For the 1998 - ATLAS format :
– ID codes were created and used • by 1 or 2 programmers • in the one set of software • written on one machine• in one language• Keeping ID codes unique and distinct was simple enough
• For the 2007 – MonAliSA format :– ID codes could be created and used by
• any users• for any software they require• written on any number of machines• in any language (although so far only C, LabVIEW, Java are
possible)
– ID code clashes need to be prevented
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ID code management : CentralAny project / person producing software
• Issued with a copy of GIACoNDE – a "Group" level ID code management tool – written in Java (for platform independence)– has group ID hardwired as a constant
Any Binary File Reading software
• Can check for matching group ID in file header
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
ID code management : GroupUses GIACoNDE tool• Present state - Beta version• Creates ID codes with 5 parts• Publishes ID code template
– ID codes represented by named constants – in C header files– in Java interface files– together with writing
• group ID• software ID• ID code template version
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Future outlook
• GIACoNDE is close to completion– already produces useable output– some polishing still to be done
• File I/O libraries already written and tested– written in C, LabVIEW, Java– files written by one language can be read by
another– other groups encouraged to use the software– Java I/O will be added to LiCAS framework
Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics
Not just suitable for lab data...
For example :
Plan using binary I/O in next version of 3 player game "Austerlitz" for saving state of play