Introduction to Memoria

Introduction to

Memoria

The Application Specific Data Structures Toolkit

● Memory is cheap, but fast memory will always be expensive and limited in size.

● Random access to DRAM memory is relatively slow and hasn't improved significantly for decades (it takes about 50 ns).

● The sequential one is 10-100-1000 times faster.

● Memory is hierarchical, the faster the memory, the less its size is.

● And this limitation is physical (the speed of light is bounded).

● We need to fit as much information into the limited memory as possible.

● And we need data structures exploiting fast sequential access over the slow random one.

Motivation I

Motivation II

● Most of data structures for main memory haven't changed for decades. They are still based on directly mapped linked lists where links are represented with memory pointers.

● Pointer operations have O(1) theoretical complexity in the plain RAM model but in the hierarchical memory it is not true.

● Moreover pointers consume too much memory, especially on 64-bit architectures. STL std::set<BigInt> consumes 40 bytes for every tree node (Linux GCC 4.6), this is 5 times larger than the size of data (sizeof(BigInt) = 8 bytes). An std::set<BigInt> containing 1M elements will consume 40 MiB of memory.

● The situation is even worse for the in-memory representation of structured documents (XML, HTML, ODF etc) where the document representation consumes 100+ times more memory than the raw data (check the Firefox memory usage).

● Linked lists based data structures are simple and flexible, they just do their job. But how well does they perform on the modern hardware? Let's check performance of two different implementations of balanced search tree – one of the most fundamental data structures.

Balanced Search Trees

● Let's consider balanced partial sums tree representing an ordered sequence of numbers (it is used in Memoria).

● And instead of implementing it with well-known linked lists, lets pack it into an array: PackedTree and PackedSet.

● We are not going to limit ourselves with binary tree case. PackedTree is multiary tree.

● Let's check its performance under various conditions and compare with std::set<> (that is based on the binary RB-tree).

PackedTree Performance Analysis● PC Box for benchmarks has Intel Q9400 @ 3GHz with 2x3M L2 Cache CPU, 8G

DDR2 RAM @ 750 MHz, Fedora Core 16 with GCC 4.6.3

● When packed tree fits into the CPU cache, trees with low fanout perform better.

● But when data is in memory, trees with high fanout perform better (except for the 64-children one).

● This is because balanced tree with high fanout has less levels that means less random access operations. And search for the next child in a node is sequential and quick.

● DRAM has high latency and CPU loses hundreds of cycles in case of cache miss. These lost cycles are hidden reserve of performance for data structures.

● Linked list based STL std::set<> is much faster than PackedTree if the data structure fits into CPU cache, but PackedTree is much faster otherwise.

● This is despite the fact that packed tree performs much more mathematical operations to find one element of a sequence than pointer based RB-tree inside std::set<>.

Motivation III

● The situation is even worse with external memory. Which is so slow at random access that each such external IO operation should be taken into account.

● No size fits all here. Efficient data structures for external memory are always application specific or workload specific (think NoSQL).

● It is true not only for external memory. Intelligent data processing requires such complex data structures as searchable sequences and bit vectors.

● Giant volumes of hot structured data need succinct and compressed versions of data structures.

● This is the reason why application specific data structures are black magic of computer science. Each of them incorporates significant amount of hardcore knowledge.

● The only way to accelerate the progress in this area is solution sharing. To make this possible we need a development framework or toolkit for data structures in generalized hierarchical memory (from CPU registers to data grids).

What is Memoria?

● Memoria for data structures is like Qt for GUI and like LLVM for program code. It separates logical data representation in the form of data structures from its physical representation in hierarchical memory. The only universal way to achieve high performance is to implicitly reorganize physical data layout according to workload-specific access patterns.

● The word is Latin, and can be translated as "memory". Memoria was the term for aspects involving memory in Western classical rhetoric discourse.

● Memoria is written in C++ with and relies heavily on template metaprogramming for data structure construction from basic building blocks. Memoria provides STL-like API for client code. GCC 4.6 and Clang 3.0 are supported.

● It uses modular design with separation of concerns. Physical memory block management is isolated behind the Allocator interface.

● Default implementation of Allocator (SmallInMemoryAllocator) provides serialization of the allocator's state to a stream and copy-on-write based transactions.

● The project core provides templated balanced search tree as well as several basic data structures such as Map, Set, Vector and VectorMap over it.

The Hidden Reserve of Performance

● The fundamental data structure of Memoria is balanced tree of array-packed trees of limited size. Something like B-Tree but with many differences...

● Balanced tree of Memoria is generalized by design and can be customized to various types of balanced search trees.

● It is transactional and abstracted from physical memory manager (Allocator).

● So huge amount of computational work is performed when balanced tree is read or written.

● But, as benchmarks show, Memoria Set<> is only about 3 times slower than lightweight PackedSet, when the most part of the tree is in RAM.

● And it is even faster than STL set<> in this case.

● Even for linear ordered scan. Check it...

Memoria Update Performance

● A cool feature of Memoria balanced search tree is batch update operations.

● If multiple updates are performed in a batch then some amount of computational work can be shared between individual update operations.

● There are two kinds of batching: batch insertions/deletions and transactional grouping.

● There is no upper limit on batch or transaction size.

● In our benchmarks random insertion rate went from about 450K/sec for single insert to 20M/sec for 128-elements batch. And finally reached more than 80M/sec for batches of size 10K+ elements.

● Placing several updates in one transaction does not introduce such giant performance improvement. Only 3-5 times. Check it...

Update Rate Limits

● How fast can updates be? Memoria uses 4K memory blocks for search tree nodes by default, variable block size is supported.

● Let's Memove() data in randomly selected 4K blocks at randomly selected indexes (emulating insertions into blocks).

● For arrays which don't fit in CPU cache we get about 2M moves/sec and about 4GB/sec of memory throughput (for our Q9400 PC Box).

● So 2M insertions per second is an upper bound for random individual insertion rate of our balanced tree.

● From the previous benchmark, 450K insertions/sec is not very far from this limit.

● Memoria core data structure does not introduce significant performance overhead over hardware memory level.

● And there is a room for improvements.

Dynamic Vector● Dynamic Vector is based on partial sums key-value

map where key is an offset of the data block in the vector's index space and value is an ID of this block.

● Partial sums tree provides insert/remove/access operations with O(Log N) worst-case complexity.

● It is not a replacement of std::vector<> if fast random access is required.

● But sequential read throughput is limited only with main memory. The benchmark shows that it comes quite close to the limit (4 GB/sec) even for 4K blocks.

● Random access throughput for 4K blocks comes close to 1.5 GB/sec that is very high in absolute numbers :)

● Random access read performance for 128 byte blocks is about 850K op/sec. Check it...

Vector Update Performance

● Insertions are slower then reads mainly because it is necessary to move data in data blocks and update index tree for the new data.

● In absolute numbers if sequential read comes close to 4GB/s then sequential append reaches only 1.2 GB/sec for our PC Box.

● Random insert memory throughput for 16K blocks comes close to 600MB/sec. That is much higher than current HDD/SSD are able to consume.

● Random insert performance for 128 byte blocks is about 550K writes/sec that is not far from 2M/sec practical limit.

● Vector does not introduce significant overhead over hardware memory for random insertions.

● Check it...

VectorMap

● VectorMap is a mapping from BigInt (64 bits) to dynamic vector region.

● It is a combination of a dynamic vector and a set of pairs (Key, Offset) represented with two-key partial sum tree.

● It is relatively succinct – only 8 or 16 bytes per entry (this value doesn't include internal search tree nodes). For 256 byte values total overhead is less than 10%.

● 300K random reads and 200K random writes of 128 byte values.

● 3.8/1.2 GB/sec memory read/wright throughput for 256K byte values.

● Up to 2/0.3 GB/sec sequential read/write throughput for 256 byte values.

● It also supports batch updates.

Roadmap

● Better alignment with modern theoretical results for balanced trees (cache-oblivious etc).

● Multithreading with MVCC-like conflict resolution.

● Native integration with various virtual machines build on top of LLVM JIT.

● Extending core support for external memory.

● Shared memory support for allocators to share data structures between processes.

● Variable blocks size support.

● Dynamic bit vector with rank()/select() operations.

● LOUDS/DFUDS succinct trees.

● Generic searchable sequences of 1 to 8 bit symbols on top of Vector/VectoMap.

● Multiary Wavelet Tree and searchable sequences over arbitrary alphabets.

● Full-text search indexes for NLP and other applications.

● Succinct graph representation.

● Etc...

● Etc...

MemoriaIs still in active development

But your contributions are welcome

If you are interested please follow us on Bitbucket

http://bitbucket.org/vsmirnov/memoria

For any questions feel free to contact meVictor Smirnov <[email protected]>

http://bitbucket.org/vsmirnov/memoria

Technology

Introduction to Memoria