PDQCollections: A Data-Parallel Programming Model andLibrary for Associative Containers
Maneesh Varshney Vishwa Goudar
Technical Report #130004Computer Science Department
University of California, Los AngelesApril, 2013
ABSTRACTAssociative containers are content-addressable data struc-tures, such as maps, ordered maps, multimaps, sets etc.,that are wide employed in a variety of computational prob-lems. In this paper, we explore a parallel programmingparadigm for data-centric computations involving associa-tive data. We present PDQCollections - a novel set of datastructures, coupled with a computation model (which werefer to as the Split-Replicate-Merge model) - that can effi-ciently exploit parallelism in multi-core as well distributedenvironments, for in-memory as well as large datasets. Thedistinguishing characteristics of our programming model arethe design of data structure that inherently encapsulate par-allelism and the computation model that transforms theproblem of parallelization to that of defining addition oper-ator for value data types. The PDQ design offers fundamen-tal benefits over traditional data structures: with memory-bound workloads, PDQ avoids locks and other forms of syn-chronization, and is able to significantly outperform lock-based data structures; with larger disk-bound workloads,PDQ does not use caches and avoids random disk access,and thus significantly outperform traditional disk-backedstructures; and with distributed workloads, PDQ accessesremote data resources sequentially and only once, and is ableto significantly outperform distributed data structures. Wehighlighted the distinguishing capabilities of PDQ librarywith several applications drawn from a variety of fields inComputer Science, including machine learning, data mining,graph processing, relational processing of structured dataand incremental processing of log data.
1. INTRODUCTIONAssociative containers, also commonly referred to as hashes,
hash tables, dictionaries or maps, are a cornerstone of pro-gramming template libraries. They provide standardized,robust and easy-to-use mechanisms to efficiently store andaccess key-value mappings. In this paper, we explore data-centric programming problems where either the input orthe output or both are associative containers. Such com-putations are commonly employed in the fields of docu-ment processing, data mining, data analytics and machinelearning, statistical analysis, log analysis, natural languageprocessing, indexing and so on. In particular, we seek adata-parallel programming framework for associative data,where the parallelism can scale from multi-core to distributedenvironments, the data can scale from in-memory to disk-backed to distributed storage and the programming paradigm
is as close as possible to the natural sequential programmingpatterns.
The problems of data parallelism with associative contain-ers are unlike from the index-based data structures such asthe arrays and matrices. The most familiar parallel pro-gramming paradigm for the latter is the parallel for loop,as exemplified in OpenMP, Intels Thread Building Blocks(TBB) and Microsofts .NET Task Parallel Library (TPL).In this model, the index range of the for loop is partitionedand each partition of the range is assigned to a separatethread. However, this paradigm requires that the inputto the computation (that has to be partitioned for paral-lelization) must be index-addressable. Secondly, as the out-puts generated by each thread must be protected againstconcurrent modifications, each thread must write data innon-overlapping memory regions or use critical sections orconcurrent-safe containers. Maps are content-addressable,and consequently, can neither serve as input or output in aparallel for computation. Furthermore, in the data-parallelprogramming context, the synchronization locks are con-tested often leading to a significant locking overhead, whichwe confirmed in our benchmarks. Finally, these libraries,are typically meant for shared-memory and memory boundworkloads and offer little support or perform poorly withdistributed systems and persistent data stores.
In our study of data-parallel programming paradigms, wedid not discover any library or framework that can: (a) oper-ate on associative containers, (b) execute in shared memorymulti-threaded as well as distributed contexts, (c) supportdata size that scales from in-memory to disk-backed, and (d)have parallelization constructs that are as close as possibleto the natural sequential and object-oriented style of pro-gramming. Towards this last point, we note that the widelyacclaimed Map-Reduce model, owing to the functional na-ture of the programming framework, does not provide to theprogrammers the familiar Abstract Data Type of associativecontainers.
In this paper, we present PDQCollections 1 a novel setof data structures, coupled with a computation model, forexploiting data parallelism in associative data sets. PDQ-Collections is a comprehensive library of collection classesthat implement the native associative container interfaces:map, multimap, ordered map, ordered multimap, set, sortedset and others. We have also proposed a computation model,which we refer to as the Split-Replicate-Merge (SRM) model,that transparently and efficiently supports and exploits par-
1PDQ could stand for Processes Data Quickly
allelism in multi-core as well as distributed environments,over the data scales that range from memory-bound to disk-backed. We have shown an equivalence between the SRMand the map-reduce (MR) model by mutual reducibility,that is, any problem that can be solved by one model canbe solved by another.
The distinguishing characteristic of our programming modelis encapsulating the parallelism within the data struc-ture implementations (the PDQ Collections classes) ratherthan the program code. Object oriented programming en-courages encapsulating the distinguishing characteristics withinthe implementation of the objects, while providing famil-iar interfaces to the programmer. For example, BerkeleyDBs StoredMap abstracts the knowledge that the data isbacked on disk, while providing the familiar Map interface;GUI libraries hide the platform specific characteristics be-hind the implementation, while providing the same widgetinterface; and the Remote Method Invocations hide the factthat the objects are remotely located. However, traditionalapproaches for data parallel programming have instead re-lied on modifying the computation code, either by extendingthe grammar of a language (e.g. the parallel for loop), orenforcing alternate paradigms (e.g. functional programmingin Map Reduce).
By encapsulating parallelism within the data structures,the programming model is able to cleanly separate out thecode for actual computation and the code for paralleliza-tion. Consider the analogy of Semaphores vs. Monitors:semaphores require introducing the logic of concurrency withinthe code, as opposed to monitors where all logic is capturedseparately within the monitor object. Similarly, with ourprogramming model, the logic of parallelism is not inter-spersed within the code, rather it is expressed separately.As a concrete example, we shall discuss the Frequent Pat-tern (FP) growth algorithm  in Section 5.2, where theinput data is processed to generate a FP tree data struc-ture. The traditional methods of parallelism would requiremodifying this tree-generation code by embedding parallelconstructs within the code. With our model, the program-mer only needs to separately define how to add two trees.We believe, our design choice lead to a cleaner separationof functionality, resulting in a flexible and robust systemarchitecture.
The SRM computation model employs the divide-and-conquer strategy for parallelization. We have proposed anovel shared-nothing strategy for merging associative con-tainers, which obviates the need for any explicit or underly-ing synchronization, concurrent access or transactional pro-cessing. By avoiding locking overheads, PDQ containerssignificantly outperform the locking-based data structures,such as ConcurrentMaps. The same strategy, when appliedto disk-bound workloads, is a cache-free design and ensuresthat data is always read and written to disk sequentially. Byavoiding cache-miss penalties and random disk seek over-heads, PDQ containers significantly outperform the tradi-tional disk-backed data structures, such as BerkeleyDB.
The PDQ collection classes are versatile in managing dataover a wide range of scale. Initially, the data is stored in-memory, but as the size of the container grows, the datais transparently and automatically spilled over to the disk,where it is stored in a format most efficient for further pro-cessing. The disk-backed form of the container objects havethe capability to store data on multiple disks, if present, to
improve the disk I/O speeds. They have a distributed im-plementation as well, where the data can be flexibly storedat common location (e.g. SAN, NFS) or in a distributedstorage.
The rest of the paper is organized as follows: in Section 2we provide a formal description of programming model andin Section 3 we discuss our implementation. We conduct acomparative design performance analysis study in Section 4with concurrent data structures (for memory bound compu-
tation), BerkeleyDBaAZs StoredMap (for disk-backed com-putation) and Hazelcast (for distributed computation) tohighlight the salient characteristics of our programming modelthat explains how PDQ significantly outperforms these tra-ditional data structures. We illustrate several applicationsof our PDQ in different fields of Computer Science in Section5 and conclude in Section 6.
2. DESIGNIn this section, we provide a formal description of pro-
gramming model, including the data model, which formal-izes the problems that can be parallelized with our model,the Split-Replicate-Merge computation model, which describesthe strategy for parallelism, and the programming APIs. Wealso show an equivalence with the Map Reduce programmingmodel.
2.1 Data Model: Mergeable MapsThis section formalizes the data-centric problems that can
be parallelized using our model. We begin by defining anassociative container as a mapping function from key spaceK to value space V:
A : K V
Different variants of associative containers can be inter-preted by adapting the semantics of the mapping function.For example, the container is a multimap if the value spaceV is a set of collection of values, or a set if a Boolean space.Similarly, the container is an array if the key space K isa range of natural numbers, or a matrix if a set of tuples.Other complex containers, such as graphs, can similarly beinterpreted.
Next we define a computation as a function with inputin the domain I (which may be an associative container) toproduce as output an associative container A:
Cseq : I A
We have explicitly formulated the computation to be asequential function. For the sake of simplicity, we have alsoassumed only one input and one output for this computa-tion. The extension to the general case is trivial, and willbe discussed later.
Our objective is to develop a data-parallel programmingparadigm for this computation that simultaneously satisfiesthe following requirements:
1. the computation can be parallelized and scaled acrossmultiple cores and processors, in shared-memory con-current as well as distributed processing contexts.
2. the parallel computation can handle input and outputdata that scales from memory-bound to disk-backed todistributed storage.
3. both forms of scaling (processing and data) can beachieved by using the unmodified sequential form ofthe computation.
We claim that the above requirements can be satisfied forthe class of computation problems with inherent parallelismthat can be formulated as follows:
If I1 I, and I2 I, such that I1 I2 = , and C(I1)A1, C(I2) A2. Then the following must be equivalent:
C(I1 I2) A1 ]merge A2Where the merge operator ]merge is defined as:
(A1 ]merge A2)(k) =
A1(k) if k A1, k / A2A2(k) if k / A1, k A2A1(k)A2(k) otherwise
Here, is some user-defined operator. It is required, how-ever, that this add operator is both commutative and as-sociative.
Intuitively, the above formulation specifies that the inher-ent parallelism in the computation is such that if partialresults were obtained by processing partial inputs, then theresult of the combined inputs can be produced by mergingthe partial outputs.
We shall later prove that this formulation of parallelism isequivalent to the map-reduce programming model, that is,any problem that can be solved by map-reduce can be for-mulated in this manner. Let us now review some examplesto illustrate this model of parallelism in action.
Word Count : Given a document, find the number of timeseach word appeared in the document. The input is a listof words, and the output is a simple map of string to inte-gers. This problem can be parallelized since we can partitionthe input list into sublists, execute the word count programseparately on each sublist, and merge the generated outputmaps by defining as integer addition operator.
Inverted Index : Given a map of keys to a list of words (theinput), generate a map of words to the list of keys that theybelonged (the output). Although we have not yet discussedhow the maps are partitioned, for now we imagine that thekey space is partitioned and the sub-maps are created withkey-value mappings for each partition, and the program toinvert indices is executed separately on each sub-map. Ifwe define as the list concatenation operator, the mergedoutput will indeed be the inverted index of the original map.
Finding Average: Find the average from the list of num-bers. Since the average of averages is not the average, weneed to define a custom data type with two fields, sum andcount. We also define the operator for this data typethat adds the two fields. The output of the computationis a singleton container, which contains only one element.The computation, then, takes as input a sublist of numbersand computes the result into this data type. These partialresults are added together to produce the final result.
2.2 Computation Model: Split, Replicate,Merge
In this section we describe our parallel computation model,called the Split-Replicate-Merge model, to exploit paral-lelism for the class of problems identified in the previoussection. As previously, for the sake of simplicity we assumethe computation with single input and single output. Themodel executes as follows: the input is first splitted into M
partitions. The number of splits is atleast equal to the num-ber of processing units available, although it can be greaterif it is desired that each partition should be of some manage-able size (for example, if each partition has to be processedentirely within memory). Next, the output data structureis replicated M times. Each replication is an initially emptydata structure of same type as the output...