20
TRIFLE An efficient way to serialize tree structured information. by Daniil Moerman Document History Release ID Release Date Modifications 1.00 19.03.2013 - This is the first version of the document. All rights reserved Copyright by Daniil Moerman Germany, 2013 1 / 20

TRIFLE

Embed Size (px)

DESCRIPTION

An efficient way to serialize tree structured information by Daniil Moerman

Citation preview

Page 1: TRIFLE

TRIFLEAn efficient way to serialize tree structured information.by Daniil Moerman

Document History

Release ID Release Date Modifications

1.00 19.03.2013 - This is the first version of the document.

All rights reservedCopyright by Daniil Moerman

Germany, 2013

1 / 20

Page 2: TRIFLE

Table of contents1 Abstract.................................................................................................................................................................. 32 Terminology......................................................................................................................................................... 3

2.1 Overview.......................................................................................................................................................32.2 Tree................................................................................................................................................................. 52.3 Root Branch.................................................................................................................................................52.4 Branch............................................................................................................................................................ 52.5 Leaf................................................................................................................................................................. 52.6 Association.................................................................................................................................................. 52.7 Association Key..........................................................................................................................................52.8 Tree Projection............................................................................................................................................5

3 Format Specification.........................................................................................................................................63.1 Overview.......................................................................................................................................................63.2 VINT................................................................................................................................................................ 63.3 VINT32........................................................................................................................................................... 73.4 VINT64........................................................................................................................................................... 93.5 TRIFLE-Tree................................................................................................................................................113.6 Format Byte...............................................................................................................................................113.7 Header Section........................................................................................................................................133.8 Branch Descriptor...................................................................................................................................133.9 Leaf-Entry.................................................................................................................................................. 143.10 Branch-Entry..........................................................................................................................................143.11 Payload Section.....................................................................................................................................143.12 Checksum................................................................................................................................................153.13 Well known Type-Identifier..............................................................................................................16

4 License................................................................................................................................................................. 165 About the Author............................................................................................................................................176 Example of a TRIFLE-Tree.............................................................................................................................177 Where can TRIFLE help me?........................................................................................................................198 Visual Overview of TRIFLE............................................................................................................................20

2 / 20

Page 3: TRIFLE

1 AbstractThis document can be seen as a specification of a method which allows its users to serialize tree structured data in a very efficient way. In this document the term efficiency deals with following aspects:

• Reduce the amount of data which has to be stored or transferred to get complete tree serialization

• Reduce the complexity of the parsing process, so that the serialization and de-serialization can be be done very fast.

• Reduce the time which a person who has to implement this method needs to do his/her job.

• Allow access to the received information without waiting until every single byte of the whole serialization reached the target system.

I tried my best to visualize everything that could sound complicated or even allow multiple interpretation with diagrams and tables thus making the implementation of the needed methods to a more or less easy task which can be done not only by computer scientist.

2 TerminologyThis section should provide you the whole terminology you will need to understand this document completely.

2.1 OverviewA TRIFLE consists of a small amount of substructures which either can consists of further substructure. The format is both hierarchical and recursively build making the definition and processing very simple. But before we can transfer the elements of a real tree to a TRIFLE we have to understand the basic structure of a tree. Following diagram will show you all needed elements.

3 / 20

Page 4: TRIFLE

4 / 20

Page 5: TRIFLE

2.2 TreeBy a tree we mean a structure which stores information about the available branches and leafs within this tree, and its connections to each other. A connection associates one branch with an another, or with exactly one leaf. To identify a certain association we use a association key.

2.3 Root BranchThe root branch is the highest branch of a tree. It is also the only one on this level. That means that there are no multiple root branches within a single tree. Every non-trivial tree has exactly one root branch.

2.4 BranchA branch is a node within a tree structure with which you can associate either further branches or leafs. The amount of associated elements can vary from 0 to n where n is a non-negative integer. A non-trivial branch has at least one element associated with it.

2.5 LeafA leaf is node within a tree structure which is always associated with exactly one branch within the tree. It stores the “real” data which you want to transfer. That data can be any kind of digital information like integers, arrays, documents and so on. To be able to read the content of a leaf properly a type identifier must be stored within a leaf.

2.6 AssociationAn association is a connection between two elements of a tree. It can exist between either a branch and a leaf, or between a branch and an another branch. Every association must have an association key, so that in later context one can address this very association without any misunderstanding.

2.7 Association KeyAn association key is a Byte-Array which makes a certain association between two elements of a tree unique. It is not necessary to use one special association key only once within the whole tree, but every association key which identify a association within one branch has to be unique. That means that you are not allowed to use the same association key for e.g. two associations within one certain branch.

2.8 Tree ProjectionA tree projection is a vector of all leafs which you can find within a single tree. After sorting

5 / 20

Page 6: TRIFLE

all branches and leafs of a single tree using a defined order, the resulting vector of the projection becomes unique. Thus from now on you have only to provide the associations and the branches to have all data together which you need to reconstruct the tree back in its origin shape.

3 Format SpecificationThe following section will describe the structure of all the elements a TRIFLE consist of. Knowing this is essential for those who has to write a parser which should be able to handle with a TRIFLE-Tree

3.1 OverviewThis chapter is divided in two parts. In the first part you will see how every format-specific integer value is coded using the so-called “VINT” representation. Afterwards every single format element of TRIFLE will be shown byte-by-byte, in combination with definitions and of every single section and element.

3.2 VINTAs already mentioned VINT which stands for “Variable Integer” is used within TRIFLE as an efficient way to represent non-negative integer values. Such values can be used to store the length of leaf with content. Another purpose is using an integer as a type identifier, where allowed integer values represent special types. In both situations different ranges of integer values must be realized. By using normal integer format we would mostly waste a lot of memory which will be filled with zeros (remember that a simply integer value has a size of 4 Bytes and long integer value of 8 Bytes). VINT allows us to adapt the size of the representation to the value it actually is intended to store. The VINT format can be used in two different versions – VINT32 and VINT64. The difference of both versions are the intervals of values you can represent in this format. Lets look at the numerical boundaries you come across by using VINT:

Type Boundaries (MIN / MAX)

unsigned integer (int32) 0 … 4.294.967.295

unsigned long integer (int64) 0 … 18.446.744.073.709.551.615

VINT32 0 … 1.077.952.575

VINT64 0 … 2.314.885.530.818.453.536

Every VINT number consist of 2 – 3 Range-Bits and the rest which stores an unsigned integer in Big-Endian style. To calculate the value which had been stored in VINT you simply need to know the used range (which you can retrieve from reading its range-bits).

6 / 20

Page 7: TRIFLE

Afterwards you add the rest value to the minimal value which the current range can represent. So that rest value can be seen as an offset to the chosen range. Both process – encoding and decoding VINT`s can be implemented by using bit-shift operations and basic arithmetic operations.

3.3 VINT32VINT32 has 4 ranges. The boundaries of this ranges are:

Range Name Boundaries

RANGE 0 0 … 63

RANGE 1 64 … 16.447

RANGE 2 16.448 … 4.210.751

RANGE 3 4.210.752 … 1.077.952.575

The exactly structure of a VINT32 value in a certain range you can see in the following diagram:

7 / 20

Page 8: TRIFLE

8 / 20

Page 9: TRIFLE

3.4 VINT64VINT64 has 8 ranges. The boundaries of this ranges are:

Range Name Boundaries

RANGE 0 0 … 31

RANGE 1 32 … 8.223

RANGE 2 8.224 … 2.105.375

RANGE 3 2.105.376 … 538.976.287

RANGE 4 538.976.288 … 137.977.929.759

RANGE 5 37.977.929.760 … 35.322.350.018.591

RANGE 6 35.322.350.018.592 … 9.042.521.604.759.583

RANGE 7 9.042.521.604.759.584 … 2.314.885.530.818.453.536

The exactly structure of a VINT64 value in a certain range you can see in the following diagram:

9 / 20

Page 10: TRIFLE

10 / 20

Page 11: TRIFLE

3.5 TRIFLE-TreeThe whole TRIFLE-Tree consist of maximum 4 logical sections.

Section Name FB HS PS CS

Content Format-Byte Header-Section

Payload-Section

Checksum

If the last section – the checksum – is available decides a flag inside the Format-Byte. That is why a TRIFLE-Tree can consist also only of 3 sections instead of 4.

3.6 Format ByteThe Format-Byte defines specific properties of the serialization which a parser has to read carefully to parse the TRIFLE-Tree in a correct way. The Format-Byte has following structure.

Bits of the Format-Byte

1 2 3 4 5 6 7 8

TOB ALI WORDSZ BYO BIO CHA

Abbreviation Original Term Description

TOB TRIFLE Original Flag This flag is used to define if the current TRIFLE-Tree is formed in the here defined way. If a modified version of this specification had been used for the serialization, this flag should be set to “0”. Otherwise it should be set to “1”

ALI Aligned Flag This flag is used to define if the current payload which is inside the Payload-Section is aligned or not aligned. If the Alignment Flag is set, all written payload-sizes must be interpreted as values, which has to be completed to the word size defined in WORDSZ. Example: Payload-Size = 42 Byte [WORDSZ = 32 Bit] The transferred payload has a size of 44 Bytes, where the last 2 Bytes are filled with zeros and do not store any information.

WORDSZ Word Size This flag is used to define the Word-Size which had been used to generate this serialization. This

11 / 20

Page 12: TRIFLE

is important for the above mentioned alignment calculation as well as to interpret some payload types in a right way. (e.g. payload of type ARRAY_INT stores values which has the size of 1 Word. So by using 32-Bit Words it will have the size of 4 Bytes and in case of 64-Bit Words it will have 8 Bytes.)The possible combinations has following meaning:Bit 3 Bit 4 Meaning

0 0 Word Size equals 8 Bit

0 1 Word Size equals 16 Bit

1 0 Word Size equals 32 Bit

1 1 Word Size equals 64 Bit

BYO Byte-Order This flag is important for some types of payload. In case of ARRAY_INT it defines the value of every byte of a single integer value. If this flag is set, the Big-Endian style is used. If this flag is not set, the Little-Endian style is used.ATTENTION: This flag do not have an effect to the VINT values as they are always coded in Big-Endian style.

BIO Bit-Endianes This flag is important to define the Bit-Endianes within the whole message. Normaly this flag should be set, which signals that the most significant bit is the first, and the least significant bit the last within a byte. In case of the inverted case, this flag should be unset.

CHA Checksum Algorithm This two bits decide, which checksum should be appended to the end of the serialization. The possible combinations has following meaning:Bit 7 Bit 8 Meaning

0 0 No Checksum at all

0 1 CRC

1 0 MD5

1 1 User defined algorithm

12 / 20

Page 13: TRIFLE

3.7 Header SectionThe Header-Section do not exist really but should help to imagine the location of all the meta-data one has to store to get a complete serialization. Syntactically the Header-Section consists of exactly one Branch Descriptor (the Branch Descriptor of the Root Branch). As this Descriptor is recursively defined, in the end we get a complete description of the whole tree.

3.8 Branch DescriptorA Branch Descriptor describes the content of a single branch. It mention the associated branches, its association key, the associated leafs as well as its association keys and the size of the payload in bytes.

The Branch Descriptor consist of 4 Sections:

Section Name LC BC LDA BDA

Content Leaf-Counter Branch-Counter

Leaf-Entry-Array

Branch-Entry-Array

The LC (Leaf-Counter) is a VINT32 value and stores the number of the leafs which are associated with this branch.

The BC (Branch-Counter) is a VINT32 value and stores the number of the branches which are associated with this branch.

The LDA (Leaf-Entry-Array) is an array which consist of Leaf-Entries which are writen one after another without gaps in between. The number of the entities is defined in the LC-value.

The BDA (Branch-Entry-Array) is an array which consist of Branch-Entries which are writen one after another without gaps in between. The number of the entities is definied in the BC-value.

13 / 20

Page 14: TRIFLE

3.9 Leaf-EntryA Leaf-Enty describe one single leaf and consists of 4 sections:

Section Name KL KEY PS TID

Content Key Length Association Key

Payload-Size Type-Identifier

The KL (Key Length) is a VINT32 value and stores the size of the following KEY-Section in bytes.

The KEY (Association-Key) is a Byte-Array which stores the Association-Key.

The PS (Payload-Size) is a VINT64 value which stores the size of the payload in bytes.

ATTENTION: This size has to be recalculated, in case of a settet Alignment Flag within the Format-Byte. If the flag is set, the PS-value stores the size of the payload, but not the size of the region, which this payload needs within the Payload-Section. The difference between this two values is padded with zeros in the end of the region.

The TID (Type-Identifier) is a VINT32 value which stores the type-number of the current payload's type (for well-known Type-Identifier see the section Fehler: Referenz nichtgefunden in this document).

3.10 Branch-EntryA Branch-Entry describes the content of single branch. In opposite to a simple Branch-Descriptor it is used only on deeper levels than the Root Branch. In fact it is a composition of a Association Key as seen in the Leaf-Entry definition and the Branch-Descriptor itself. This fact results in a recursive behaviour of the Branch-Descriptor definition. The recursion will terminate when the deepest branch had been reached, as the Branch-Counter goes to zero in this situation.

Section Name KL KEY BD

Content Key Length Association Key Branch-Descriptor

The KL (Key Length) and the KEY (Association-Key) has the same meaning type and meaning as in the Leaf-Entry (see section Fehler: Referenz nicht gefunden).

The BD (Branch-Descriptor) is a Branch-Descriptor (see section 3.8) which describs the associated branch.

3.11 Payload SectionThe Payload Section is a more or less huge Byte-Array where the stored payload is written

14 / 20

Page 15: TRIFLE

one after another withouth letting gaps inbetween. The information of the Header Section allows to reconstruct all information about the boundaries of the payload entries and its hierarchy within the serialized tree.

ATTENTION: Do not forget about the alignment and recalculate if it is necassery the given payload size of every single payload element. If you forget it you will get a wrong content into your leafs.

3.12 ChecksumThe Checksum is a value of a fixed size which allows you to determine if there were any kind of transmission errors on the way between your source where you got you TRIFLE-Tree from and you. If a checksum is really needed on this level is up to the application and the environment you work at. Normaly there is no need for a checksum, as the layer on which TRIFLE-Structures are transferred is so high, that the information had passed 3 previous checksums.

According to the value of the Checksum-Flag within the Format-Byte (see section 3.6) the checksum has following format:

Bit 7 Bit 8 Meaning

0 0 No Checksum at all

0 1 A CRC32 checksum is appended. It has the size of 32-Bit and the used polynom is 0x04C11DB7 or more mathematicaly:

x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1

1 0 A MD5 checksum is appended. (128-Bit)

1 1 User defined algorithm can be used. In this case you have to define your own format and validation algorithm. A library which implements TRIFLE should provide an API which allows the user of that library to append a self-coded checksum validator.

15 / 20

Page 16: TRIFLE

3.13 Well known Type-IdentifierFollowing Type-Identifier are seen as well-known and should be used only in the here written meaning. The rest of the VINT32 number space can be used individually. In case you are looking for primitive datatype instead of the array type see a primitive value as an array of that type with the length of 1.

Type-Identifier Type-Label

0x01 ARRAY_BOOLEAN (8-Bit per element)true = 0xAAfalse = 0x55

0x02 ARRAY_BYTE

(8-Bit per element)

0x03 ARRAY_SHORT (size of one element = 0.5 x WORD)

0x04 ARRAY_INT (size of one element = WORD)

0x05 ARRAY_LONG (size of one element = 2 x WORD)

0x06 ARRAY_FLOAT (IEEE 754 – single precision) (32-Bit per element)

0x07 ARRAY_DOUBLE (IEEE 754 – double precision) (64-Bit per element)

0x08 ARRAY_CHAR (UTF16 Big Endian)

0x09 ZIP (DEFLATE)

4 LicenseYou are free to use the here presented technology for any kind of application (as long as the application is not intended to harm any living creature directly or indirectly). Using this technology is absolutely free of charges. The only restriction is that you have to mention the name of this technology (TRIFLE) in every technical specification of the product which uses TRIFLE – internal as well as in published specifications.

16 / 20

Page 17: TRIFLE

5 About the AuthorMy name is Daniil Moerman. I am currently (2013) studying Computer Science in Germany and am interessting in many scientific areas.

In case you want to contact me to get further informations about TRIFLE I recommend you to use the following e-mail-address:

[email protected]

(I accept e-mail's written in German, English or Russian language)

6 Example of a TRIFLE-TreeLet's assume you have the following tree structure which you want to serialize to a byte stream which meats the reqiurenments of the TRIFLE-Definition in this document.

┌ Root Brach

├--- Branch : ABC

├------ Leaf : XYZ [type:FLOAT_ARRAY] {0.678,0.445,9.6556}

├------ Leaf : SyMbOL [type:INT_ARRAY] {1,566,87654}

├------ Leaf : Param01 [type:CHAR_ARRAY] {a,b,c,d,e}

├--- Branch : Next_Branch

├------ Leaf : Truth [type:BOOLEAN] {true,false,false}

Word Size = 64 Bit

Byte-Endianess = Big-Endian

Bit-Endianes = Big-Endian

Checksum-Algorithm = MD5

Checksum Wished = YES

Aligned = YES

So now I want to calculate step by step the resulting byte sequence. Let us first look on the last 8 lines which specify the format of the TRIFLE-Structure and form an appropriate Format-Byte. According to section 3.6 which defines the structure of the Format-Byte I get the following value:

11111110 = 0xFE

Second step is to calculate the Header-Section which is a little bit more difficult. That's why I will first write the binary string in human readable form before encoding it to a sequence of bytes:

17 / 20

Page 18: TRIFLE

{ROOT [LeafNr = 0] [BranchNr = 2] {Branch:ABC { [LeafNr=3] [BranchNr=0] [Leaf:XYZ Type:FLOAT_ARRAY Size:12] [Leaf:SyMbOL Type:INT_ARRAY Size:24] [Leaf:Param01 Type:CHAR_ARRAY Size:10]} {Branch:Next_Branch { [LeafNr=1] [BranchNr=0] [Leaf: Truth Type:ARRAY_BOOLEAN Size:3]} }

Now we will be replace all the noted elements to the appropriate TRIFLE elements:

0x00 0x02 0x03 0x41 0x42 0x43 0x03 0x00 0x03 0x58 0x59 0x5A 0x0C

A B C X Y Z

0x06 0x06 0x53 0x79 0x4D 0x62 0x4F 0x4C 0x18 0x04 0x07 0x50 0x61

S y M b O L P a

0x72 0x61 0x6D 0x30 0x31 0x0A 0x0B 0x4E 0x65 0x78 0x74 0x5F 0x42

r a m 0 1 N e x t _ B

0x72 0x61 0x6E 0x63 0x68 0x08 0x01 0x00 0x05 0x54 0x72 0x75 0x74

r a n c h T r u t

0x68 0x03 0x01

h

This was the most complicated part. Writing an recursive algorithm for this task is very easy in both direction – serialization and deserialization. Now we will write the Payload-Section. First the human readable form …

{Payload : XYZ} {Payload : SyMbOL} {Payload : Param01} {Payload : Truth}

And now the binary form …

0x3F 0x2D 0x91 0x68 0x3E 0xE3 0xD7 0x0A 0x41 0x1A 0x7D 0x56 0x00

XYZ

0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00

SyMbOL

0x00 0x00 0x00 0x00 0x02 0x36 0x00 0x00 0x00 0x00 0x00 0x01 0x56

0x66 0x00 0x61 0x00 0x62 0x00 0x63 0x00 0x64 0x00 0x65 0x00 0x00

Param01

0x00 0x00 0x00 0x00 0xAA 0x55 0x55 0x00 0x00 0x00 0x00 0x00

Truth

Now we have only to calculate the MD5 hash value of the whole binary message we just created. This can be done by already implemented library. From the content in this example the MD5 value is

0xB6 0x97 0xF5 0x2F 0x27 0x05 0x3E 0xC0 0xC2 0x78 0x48 0xF1 0xA3 0x82 0x5B 0x9B

18 / 20

Page 19: TRIFLE

After concatination of all four regions we get a complete TRIFLE-Tree serialization.

I hope this document and the example were enough to transfer the requirenments which a TRIFLE-Implementation have to meet and would really like to see at least as many implementations, as programming languages exists.

7 Where can TRIFLE help me?Following applications I see as possible domain for the TRIFLE specification:

• RPC

• Storing tree structures

• Interprocess Communication

19 / 20

Page 20: TRIFLE

8 Visual Overview of TRIFLE

20 / 20