PDQCollections: A Data-Parallel Programming Model andLibrary for Associative Containers
Maneesh Varshney Vishwa Goudar
Technical Report #130004Computer Science Department
University of California, Los AngelesApril, 2013
ABSTRACTAssociative containers are content-addressable data struc-tures, such as maps, ordered maps, multimaps, sets etc.,that are wide employed in a variety of computational prob-lems. In this paper, we explore a parallel programmingparadigm for data-centric computations involving associa-tive data. We present PDQCollections - a novel set of datastructures, coupled with a computation model (which werefer to as the Split-Replicate-Merge model) - that can effi-ciently exploit parallelism in multi-core as well distributedenvironments, for in-memory as well as large datasets. Thedistinguishing characteristics of our programming model arethe design of data structure that inherently encapsulate par-allelism and the computation model that transforms theproblem of parallelization to that of defining addition oper-ator for value data types. The PDQ design offers fundamen-tal benefits over traditional data structures: with memory-bound workloads, PDQ avoids locks and other forms of syn-chronization, and is able to significantly outperform lock-based data structures; with larger disk-bound workloads,PDQ does not use caches and avoids random disk access,and thus significantly outperform traditional disk-backedstructures; and with distributed workloads, PDQ accessesremote data resources sequentially and only once, and is ableto significantly outperform distributed data structures. Wehighlighted the distinguishing capabilities of PDQ librarywith several applications drawn from a variety of fields inComputer Science, including machine learning, data mining,graph processing, relational processing of structured dataand incremental processing of log data.
1. INTRODUCTIONAssociative containers, also commonly referred to as hashes,
hash tables, dictionaries or maps, are a cornerstone of pro-gramming template libraries. They provide standardized,robust and easy-to-use mechanisms to efficiently store andaccess key-value mappings. In this paper, we explore data-centric programming problems where either the input orthe output or both are associative containers. Such com-putations are commonly employed in the fields of docu-ment processing, data mining, data analytics and machinelearning, statistical analysis, log analysis, natural languageprocessing, indexing and so on. In particular, we seek adata-parallel programming framework for associative data,where the parallelism can scale from multi-core to distributedenvironments, the data can scale from in-memory to disk-backed to distributed storage and the programming paradigm
is as close as possible to the natural sequential programmingpatterns.
The problems of data parallelism with associative contain-ers are unlike from the index-based data structures such asthe arrays and matrices. The most familiar parallel pro-gramming paradigm for the latter is the parallel for loop,as exemplified in OpenMP, Intels Thread Building Blocks(TBB) and Microsofts .NET Task Parallel Library (TPL).In this model, the index range of the for loop is partitionedand each partition of the range is assigned to a separatethread. However, this paradigm requires that the inputto the computation (that has to be partitioned for paral-lelization) must be index-addressable. Secondly, as the out-puts generated by each thread must be protected againstconcurrent modifications, each thread must write data innon-overlapping memory regions or use critical sections orconcurrent-safe containers. Maps are content-addressable,and consequently, can neither serve as input or output in aparallel for computation. Furthermore, in the data-parallelprogramming context, the synchronization locks are con-tested often leading to a significant locking overhead, whichwe confirmed in our benchmarks. Finally, these libraries,are typically meant for shared-memory and memory boundworkloads and offer little support or perform poorly withdistributed systems and persistent data stores.
In our study of data-parallel programming paradigms, wedid not discover any library or framework that can: (a) oper-ate on associative containers, (b) execute in shared memorymulti-threaded as well as distributed contexts, (c) supportdata size that scales from in-memory to disk-backed, and (d)have parallelization constructs that are as close as possibleto the natural sequential and object-oriented style of pro-gramming. Towards this last point, we note that the widelyacclaimed Map-Reduce model, owing to the functional na-ture of the programming framework, does not provide to theprogrammers the familiar Abstract Data Type of associativecontainers.
In this paper, we present PDQCollections 1 a novel setof data structures, coupled with a computation model, forexploiting data parallelism in associative data sets. PDQ-Collections is a comprehensive library of collection classesthat implement the native associative container interfaces:map, multimap, ordered map, ordered multimap, set, sortedset and others. We have also proposed a computation model,which we refer to as the Split-Replicate-Merge (SRM) model,that transparently and efficiently supports and exploits par-
1PDQ could stand for Processes Data Quickly
allelism in multi-core as well as distributed environments,over the data scales that range from memory-bound to disk-backed. We have shown an equivalence between the SRMand the map-reduce (MR) model by mutual reducibility,that is, any problem that can be solved by one model canbe solved by another.
The distinguishing characteristic of our programming modelis encapsulating the parallelism within the data struc-ture implementations (the PDQ Collections classes) ratherthan the program code. Object oriented programming en-courages encapsulating the distinguishing characteristics withinthe implementation of the objects, while providing famil-iar interfaces to the programmer. For example, BerkeleyDBs StoredMap abstracts the knowledge that the data isbacked on disk, while providing the familiar Map interface;GUI libraries hide the platform specific characteristics be-hind the implementation, while providing the same widgetinterface; and the Remote Method Invocations hide the factthat the objects are remotely located. However, traditionalapproaches for data parallel programming have instead re-lied on modifying the computation code, either by extendingthe grammar of a language (e.g. the parallel for loop), orenforcing alternate paradigms (e.g. functional programmingin Map Reduce).
By encapsulating parallelism within the data structures,the programming model is able to cleanly separate out thecode for actual computation and the code for paralleliza-tion. Consider the analogy of Semaphores vs. Monitors:semaphores require introducing the logic of concurrency withinthe code, as opposed to monitors where all logic is capturedseparately within the monitor object. Similarly, with ourprogramming model, the logic of parallelism is not inter-spersed within the code, rather it is expressed separately.As a concrete example, we shall discuss the Frequent Pat-tern (FP) growth algorithm  in Section 5.2, where theinput data is processed to generate a FP tree data struc-ture. The traditional methods of parallelism would requiremodifying this tree-generation code by embedding parallelconstructs within the code. With our model, the program-mer only needs to separately define how to add two trees.We believe, our design choice lead to a cleaner separationof functionality, resulting in a flexible and robust systemarchitecture.
The SRM computation model employs the divide-and-conquer strategy for parallelization. We have proposed anovel shared-nothing strategy for merging associative con-tainers, which obviates the need for any explicit or underly-ing synchronization, concurrent access or transactional pro-cessing. By avoiding locking overheads, PDQ containerssignificantly outperform the locking-based data structures,such as ConcurrentMaps. The same strategy, when appliedto disk-bound workloads, is a cache-free design and ensuresthat data is always read and written to disk sequentially. Byavoiding cache-miss penalties and random disk seek over-heads, PDQ containers significantly outperform the tradi-tional disk-backed data structures, such as BerkeleyDB.
The PDQ collection classes are versatile in managing dataover a wide range of scale. Initially, the data is stored in-memory, but as the size of the container grows, the datais transparently and automatically spilled over to the disk,where it is stored in a format most efficient for further pro-cessing. The disk-backed form of the container objects havethe capability to store data on multiple disks, if present, to
improve the disk I/O speeds. They have a distributed im-plementation as well, where the data can be flexibly storedat common location (e.g. SAN, NFS) or in a distributedstorage.
The rest of the paper is organized as follows: in Section 2we provide a formal description of programming model andin Section 3 we discuss our implementation. We conduct acomparative design performance analysis study in Section 4with concurrent data structures (for memory bound compu-
tation), BerkeleyDBaAZs StoredMap (for disk-backed com-putation) and Hazelcast (for distributed computation) tohighlight the salient characteristics of our programming modelthat explains how PDQ significantly outperforms these tra-ditional data structures. We illustrate several applicationsof our PDQ in different fields of Computer Science in Section5 and conclude in Section 6.
2. DESIGNIn this section, we provide a formal description of pro-
gramming model, including the data model, which formal-izes the problems tha