Data-Intensive Computing with MapReduce - GitHub …lintool.github.io/UMD-courses/bigdata-2013-Spring/slides/session11.pdf · Hadoop vs. Databases: Grep" 1 Nodes 10 Nodes 25 Nodes

Data-Intensive Computing ��with MapReduce

Jimmy Lin University of Maryland

Thursday, April 11, 2013

Session 11: Beyond MapReduce

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States��See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Today’s Agenda

¢  Making Hadoop more efficient

¢  Tweaking the MapReduce programming model

¢  Beyond MapReduce

Source: Wikipedia (Tortoise)

Hadoop is slow...

A Major Step Backwards?

¢  MapReduce is a step backward in database access: l  Schemas are good l  Separation of the schema from the application is good

l  High-level access languages are good

¢  MapReduce is poor implementation l  Brute force and only brute force (no indexes, for example)

¢  MapReduce is not novel

¢  MapReduce is missing features l  Bulk loader, indexing, updates, transactions…

¢  MapReduce is incompatible with DMBS tools

Source: Blog post by DeWitt and Stonebraker

Hadoop vs. Databases: Grep

1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes0

10

20

30

40

50

60

70

seco

nds

Vertica Hadoop

Figure 4: Grep Task Results – 535MB/node Data Set

25 Nodes 50 Nodes 100 Nodes0

250

500

750

1000

1250

1500

seco

nds

Vertica Hadoop

Figure 5: Grep Task Results – 1TB/cluster Data Set

two figures. In Figure 4, the two parallel databases perform aboutthe same, more than a factor of two faster in Hadoop. But in Fig-ure 5, both DBMS-X and Hadoop perform more than a factor oftwo slower than Vertica. The reason is that the amount of data pro-cessing varies substantially from the two experiments. For the re-sults in Figure 4, very little data is being processed (535MB/node).This causes Hadoop’s non-insignificant start-up costs to become thelimiting factor in its performance. As will be described in Section5.1.2, for short-running queries (i.e., queries that take less than aminute), Hadoop’s start-up costs can dominate the execution time.In our observations, we found that takes 10–25 seconds before allMap tasks have been started and are running at full speed across thenodes in the cluster. Furthermore, as the total number of allocatedMap tasks increases, there is additional overhead required for thecentral job tracker to coordinate node activities. Hence, this fixedoverhead increases slightly as more nodes are added to the clusterand for longer data processing tasks, as shown in Figure 5, this fixedcost is dwarfed by the time to complete the required processing.

The upper segments of each Hadoop bar in the graphs representthe execution time of the additional MR job to combine the outputinto a single file. Since we ran this as a separate MapReduce job,these segments consume a larger percentage of overall time in Fig-ure 4, as the fixed start-up overhead cost again dominates the workneeded to perform the rest of the task. Even though the Grep task isselective, the results in Figure 5 show how this combine phase canstill take hundreds of seconds due to the need to open and combinemany small output files. Each Map instance produces its output ina separate HDFS file, and thus even though each file is small thereare many Map tasks and therefore many files on each node.

For the 1TB/cluster data set experiments, Figure 5 shows that allsystems executed the task on twice as many nodes in nearly half theamount of time, as one would expect since the total amount of datawas held constant across nodes for this experiment. Hadoop andDBMS-X performs approximately the same, since Hadoop’s start-up cost is amortized across the increased amount of data processingfor this experiment. However, the results clearly show that Verticaoutperforms both DBMS-X and Hadoop. We attribute this to Ver-tica’s aggressive use of data compression (see Section 5.1.3), whichbecomes more effective as more data is stored per node.

4.3 Analytical TasksTo explore more complex uses of both types of systems, we de-

veloped four tasks related to HTML document processing. We firstgenerate a collection of random HTML documents, similar to thatwhich a web crawler might find. Each node is assigned a set of

600,000 unique HTML documents, each with a unique URL. Ineach document, we randomly generate links to other pages set us-ing a Zipfian distribution.

We also generated two additional data sets meant to model logfiles of HTTP server traffic. These data sets consist of values de-rived from the HTML documents as well as several randomly gen-erated attributes. The schema of these three tables is as follows:CREATE TABLE Documents (

url VARCHAR(100)PRIMARY KEY,

contents TEXT );

CREATE TABLE Rankings (pageURL VARCHAR(100)

PRIMARY KEY,pageRank INT,avgDuration INT );

CREATE TABLE UserVisits (sourceIP VARCHAR(16),destURL VARCHAR(100),visitDate DATE,adRevenue FLOAT,userAgent VARCHAR(64),countryCode VARCHAR(3),languageCode VARCHAR(6),searchWord VARCHAR(32),duration INT );

Our data generator created unique files with 155 million UserVis-its records (20GB/node) and 18 million Rankings records (1GB/node)on each node. The visitDate, adRevenue, and sourceIP fields arepicked uniformly at random from specific ranges. All other fieldsare picked uniformly from sampling real-world data sets. Each datafile is stored on each node as a column-delimited text file.

4.3.1 Data LoadingWe now describe the procedures for loading the UserVisits and

Rankings data sets. For reasons to be discussed in Section 4.3.5,only Hadoop needs to directly load the Documents files into its in-ternal storage system. DBMS-X and Vertica both execute a UDFthat processes the Documents on each node at runtime and loadsthe data into a temporary table. We account for the overhead ofthis approach in the benchmark times, rather than in the load times.Therefore, we do not provide results for loading this data set.

Hadoop: Unlike the Grep task’s data set, which was uploaded di-rectly into HDFS unaltered, the UserVisits and Rankings data setsneeded to be modified so that the first and second columns are sep-arated by a tab delimiter and all other fields in each line are sepa-rated by a unique field delimiter. Because there are no schemas inthe MR model, in order to access the different attributes at run time,the Map and Reduce functions in each task must manually split thevalue by the delimiter character into an array of strings.

We wrote a custom data loader executed in parallel on each nodeto read in each line of the data sets, prepare the data as needed,and then write the tuple into a plain text file in HDFS. Loadingthe data sets in this manner was roughly three times slower thanusing the command-line utility, but did not require us to write cus-

SELECT * FROM Data WHERE field LIKE ‘%XYZ%’;

Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

Hadoop vs. Databases: Select

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X;


20

40

60

80

100

120

140

160

seco

nds

! 0

.3

! 0

.8

! 1

.8

! 4

.7

! 1

2.4

Vertica Hadoop

Figure 6: Selection Task Results

tom input handlers in Hadoop; the MR programs are able to useHadoop’s KeyValueTextInputFormat interface on the datafiles to automatically split lines of text files into key/values pairs bythe tab delimiter. Again, we found that other data format options,such as SequenceFileInputFormat or custom Writabletuples, resulted in both slower load and execution times.

DBMS-X: We used the same loading procedures for DBMS-X asdiscussed in Section 4.2. The Rankings table was hash partitionedacross the cluster on pageURL and the data on each node was sortedby pageRank. Likewise, the UserVisits table was hash partitionedon destinationURL and sorted by visitDate on each node.

Vertica: Similar to DBMS-X, Vertica used the same bulk load com-mands discussed in Section 4.2 and sorted the UserVisits and Rank-ings tables by the visitDate and pageRank columns, respectively.

Results & Discussion: Since the results of loading the UserVisitsand Ranking data sets are similar, we only provide the results forloading the larger UserVisits data in Figure 3. Just as with loadingthe Grep 535MB/node data set (Figure 1), the loading times foreach system increases in proportion to the number of nodes used.

4.3.2 Selection TaskThe Selection task is a lightweight filter to find the pageURLs

in the Rankings table (1GB/node) with a pageRank above a user-defined threshold. For our experiments, we set this threshold pa-rameter to 10, which yields approximately 36,000 records per datafile on each node.

SQL Commands: The DBMSs execute the selection task using thefollowing simple SQL statement:

SELECT pageURL, pageRankFROM Rankings WHERE pageRank > X;

MapReduce Program: The MR program uses only a single Mapfunction that splits the input value based on the field delimiter andoutputs the record’s pageURL and pageRank as a new key/valuepair if its pageRank is above the threshold. This task does not re-quire a Reduce function, since each pageURL in the Rankings dataset is unique across all nodes.

Results & Discussion: As was discussed in the Grep task, the re-sults from this experiment, shown in Figure 6, demonstrate againthat the parallel DBMSs outperform Hadoop by a rather significant

factor across all cluster scaling levels. Although the relative per-formance of all systems degrade as both the number of nodes andthe total amount of data increase, Hadoop is most affected. Forexample, there is almost a 50% difference in the execution timebetween the 1 node and 10 node experiments. This is again dueto Hadoop’s increased start-up costs as more nodes are added tothe cluster, which takes up a proportionately larger fraction of totalquery time for short-running queries.

Another important reason for why the parallel DBMSs are ableto outperform Hadoop is that both Vertica and DBMS-X use an in-dex on the pageRank column and store the Rankings table alreadysorted by pageRank. Thus, executing this query is trivial. It shouldalso be noted that although Vertica’s absolute times remain low, itsrelative performance degrades as the number of nodes increases.This is in spite of the fact that each node still executes the query inthe same amount of time (about 170ms). But because the nodes fin-ish executing the query so quickly, the system becomes flooded withcontrol messages from too many nodes, which then takes a longertime for the system to process. Vertica uses a reliable message layerfor query dissemination and commit protocol processing [4], whichwe believe has considerable overhead when more than a few dozennodes are involved in the query.

4.3.3 Aggregation TaskOur next task requires each system to calculate the total adRev-

enue generated for each sourceIP in the UserVisits table (20GB/node),grouped by the sourceIP column. We also ran a variant of this querywhere we grouped by the seven-character prefix of the sourceIP col-umn to measure the effect of reducing the total number of groupson query performance. We designed this task to measure the per-formance of parallel analytics on a single read-only table, wherenodes need to exchange intermediate data with one another in ordercompute the final value. Regardless of the number of nodes in thecluster, this tasks always produces 2.5 million records (53 MB); thevariant query produces 2,000 records (24KB).

SQL Commands: The SQL commands to calculate the total adRev-enue is straightforward:

SELECT sourceIP, SUM(adRevenue)FROM UserVisits GROUP BY sourceIP;

The variant query is:

SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);

MapReduce Program: Unlike the previous tasks, the MR programfor this task consists of both a Map and Reduce function. The Mapfunction first splits the input value by the field delimiter, and thenoutputs the sourceIP field (given as the input key) and the adRev-enue field as a new key/value pair. For the variant query, only thefirst seven characters (representing the first two octets, each storedas three digits) of the sourceIP are used. These two Map functionsshare the same Reduce function that simply adds together all of theadRevenue values for each sourceIP and then outputs the prefix andrevenue total. We also used MR’s Combine feature to perform thepre-aggregate before data is transmitted to the Reduce instances,improving the first query’s execution time by a factor of two [8].

Results & Discussion: The results of the aggregation task experi-ment in Figures 7 and 8 show once again that the two DBMSs out-perform Hadoop. The DBMSs execute these queries by having eachnode scan its local table, extract the sourceIP and adRevenue fields,and perform a local group by. These local groups are then merged at



200

400

600

800

1000

1200

1400

1600

1800

seco

nds

Vertica Hadoop

Figure 7: Aggregation Task Results (2.5 million Groups)


200

400

600

800

1000

1200

1400

seco

nds

Vertica Hadoop

Figure 8: Aggregation Task Results (2,000 Groups)

the query coordinator, which outputs results to the user. The resultsin Figure 7 illustrate that the two DBMSs perform about the samefor a large number of groups, as their runtime is dominated by thecost to transmit the large number of local groups and merge themat the coordinator. For the experiments using fewer nodes, Verticaperforms somewhat better, since it has to read less data (since itcan directly access the sourceIP and adRevenue columns), but itbecomes slightly slower as more nodes are used.

Based on the results in Figure 8, it is more advantageous to usea column-store system when processing fewer groups for this task.This is because the two columns accessed (sourceIP and adRev-enue) consist of only 20 bytes out of the more than 200 bytes perUserVisits tuple, and therefore there are relatively few groups thatneed to be merged so communication costs are much lower than inthe non-variant plan. Vertica is thus able to outperform the othertwo systems from not reading unused parts of the UserVisits tuples.

Note that the execution times for all systems are roughly consis-tent for any number of nodes (modulo Vertica’s slight slow down asthe number of nodes increases). Since this benchmark task requiresthe system to scan through the entire data set, the run time is alwaysbounded by the constant sequential scan performance and networkrepartitioning costs for each node.

4.3.4 Join TaskThe join task consists of two sub-tasks that perform a complex

calculation on two data sets. In the first part of the task, each sys-tem must find the sourceIP that generated the most revenue withina particular date range. Once these intermediate records are gener-ated, the system must then calculate the average pageRank of all thepages visited during this interval. We use the week of January 15-22, 2000 in our experiments, which matches approximately 134,000records in the UserVisits table.

The salient aspect of this task is that it must consume two datadifferent sets and join them together in order to find pairs of Rank-ing and UserVisits records with matching values for pageURL anddestURL. This task stresses each system using fairly complex op-erations over a large amount of data. The performance results arealso a good indication on how well the DBMS’s query optimizerproduces efficient join plans.

SQL Commands: In contrast to the complexity of the MR programdescribed below, the DBMSs need only two fairly simple queries tocomplete the task. The first statement creates a temporary table anduses it to store the output of the SELECT statement that performsthe join of UserVisits and Rankings and computes the aggregates.

Once this table is populated, it is then trivial to use a second queryto output the record with the largest totalRevenue field.

SELECT INTO Temp sourceIP,AVG(pageRank) as avgPageRank,SUM(adRevenue) as totalRevenue

FROM Rankings AS R, UserVisits AS UVWHERE R.pageURL = UV.destURL

AND UV.visitDate BETWEEN Date(‘2000-01-15’)AND Date(‘2000-01-22’)

GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRankFROM Temp

ORDER BY totalRevenue DESC LIMIT 1;

MapReduce Program: Because the MR model does not have aninherent ability to join two or more disparate data sets, the MR pro-gram that implements the join task must be broken out into threeseparate phases. Each of these phases is implemented together as asingle MR program in Hadoop, but do not begin executing until theprevious phase is complete.

Phase 1 – The first phase filters UserVisits records that are outsidethe desired data range and then joins the qualifying records withrecords from the Rankings file. The MR program is initially givenall of the UserVisits and Rankings data files as input.

Map Function: For each key/value input pair, we determine itsrecord type by counting the number of fields produced when split-ting the value on the delimiter. If it is a UserVisits record, weapply the filter based on the date range predicate. These qualify-ing records are emitted with composite keys of the form (destURL,K1), where K1 indicates that it is a UserVisits record. All Rankingsrecords are emitted with composite keys of the form (pageURL,K2), where K2 indicates that it is a Rankings record. These outputrecords are repartitioned using a user-supplied partitioning functionthat only hashes on the URL portion of the composite key.

Reduce Function: The input to the Reduce function is a singlesorted run of records in URL order. For each URL, we divide itsvalues into two sets based on the tag component of the compositekey. The function then forms the cross product of the two sets tocomplete the join and outputs a new key/value pair with the sour-ceIP as the key and the tuple (pageURL, pageRank, adRevenue) asthe value.

Phase 2 – The next phase computes the total adRevenue and aver-age pageRank based on the sourceIP of records generated in Phase1. This phase uses a Reduce function in order to gather all of the

Hadoop vs. Databases: Aggregation

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;


Hadoop vs. Databases: Join


200

400

600

800

1000

1200

1400

1600

1800

seco

nds

! 2

1.5

! 2

8.2

! 3

1.3

! 3

6.1

! 8

5.0

! 1

5.7

! 2

8.0

! 2

9.2

! 2

9.4

! 3

1.9

Vertica DBMS!X Hadoop

Figure 9: Join Task Results


1000

2000

3000

4000

5000

6000

7000

8000

seco

nds

Vertica Hadoop

Figure 10: UDF Aggregation Task Results

records for a particular sourceIP on a single node. We use the iden-tity Map function in the Hadoop API to supply records directly tothe split process [1, 8].

Reduce Function: For each sourceIP, the function adds up theadRevenue and computes the average pageRank, retaining the onewith the maximum total ad revenue. Each Reduce instance outputsa single record with sourceIP as the key and the value as a tuple ofthe form (avgPageRank, totalRevenue).

Phase 3 – In the final phase, we again only need to define a sin-gle Reduce function that uses the output from the previous phase toproduce the record with the largest total adRevenue. We only exe-cute one instance of the Reduce function on a single node to scanall the records from Phase 2 and find the target record.

Reduce Function: The function processes each key/value pairand keeps track of the record with the largest totalRevenue field.Because the Hadoop API does not easily expose the total numberrecords that a Reduce instance will process, there is no way forthe Reduce function to know that it is processing the last record.Therefore, we override the closing callback method in our Reduceimplementation so that the MR program outputs the largest recordright before it exits.

Results & Discussion: The performance results for this task is dis-played in Figure 9. We had to slightly change the SQL used in 100node experiments for Vertica due to an optimizer bug in the system,which is why there is an increase in the execution time for Verticagoing from 50 to 100 nodes. But even with this increase, it is clearthat this task results in the biggest performance difference betweenHadoop and the parallel database systems. The reason for this dis-parity is two-fold.

First, despite the increased complexity of the query, the perfor-mance of Hadoop is yet again limited by the speed with which thelarge UserVisits table (20GB/node) can be read off disk. The MRprogram has to perform a complete table scan, while the paralleldatabase systems were able to take advantage of clustered indexeson UserVisits.visitDate to significantly reduce the amount of datathat needed to be read. When breaking down the costs of the dif-ferent parts of the Hadoop query, we found that regardless of thenumber of nodes in the cluster, phase 2 and phase 3 took on aver-age 24.3 seconds and 12.7 seconds, respectively. In contrast, phase1, which contains the Map task that reads in the UserVisits andRankings tables, takes an average of 1434.7 seconds to complete.Interestingly, it takes approximately 600 seconds of raw I/O to readthe UserVisits and Rankings tables off of disk and then another 300

seconds to split, parse, and deserialize the various attributes. Thus,the CPU overhead needed to parse these tables on the fly is the lim-iting factor for Hadoop.

Second, the parallel DBMSs are able to take advantage of the factthat both the UserVisits and the Rankings tables are partitioned bythe join key. This means that both systems are able to do the joinlocally on each node, without any network overhead of repartition-ing before the join. Thus, they simply have to do a local hash joinbetween the Rankings table and a selective part of the UserVisitstable on each node, with a trivial ORDER BY clause across nodes.

4.3.5 UDF Aggregation TaskThe final task is to compute the inlink count for each document

in the dataset, a task that is often used as a component of PageR-ank calculations. Specifically, for this task, the systems must readeach document file and search for all the URLs that appear in thecontents. The systems must then, for each unique URL, count thenumber of unique pages that reference that particular URL acrossthe entire set of files. It is this type of task that the MR is believedto be commonly used for.

We make two adjustments for this task in order to make pro-cessing easier in Hadoop. First, we allow the aggregate to includeself-references, as it is non-trivial for a Map function to discoverthe name of the input file it is processing. Second, on each nodewe concatenate the HTML documents into larger files when storingthem in HDFS. We found this improved Hadoop’s performance bya factor of two and helped avoid memory issues with the centralHDFS master when a large number of files are stored in the system.

SQL Commands: To perform this task in a parallel DBMS re-quires a user-defined function F that parses the contents of eachrecord in the Documents table and emits URLs into the database.This function can be written in a general-purpose language and iseffectively identical to the Map program discussed below. With thisfunction F, we populate a temporary table with a list of URLs andthen can execute a simple query to calculate the inlink count:

SELECT INTO Temp F(contents) FROM Documents;SELECT url, SUM(value) FROM Temp GROUP BY url;

Despite the simplicity of this proposed UDF, we found that inpractice it was difficult to implement in the DBMSs.

For DBMS-X, we translated the MR program used in Hadoopinto an equivalent C program that uses the POSIX regular expres-sion library to search for links in the document. For each URLfound in the document contents, the UDF returns a new tuple (URL,


Source: Wikipedia (Tortoise)

Why?

Schemas are a good idea!

¢  Parsing fields out of flat text files is slow

¢  Schemas define a contract, decoupling logical from physical

Thrift

¢  Originally developed by Facebook, now an Apache project

¢  Provides a DDL with numerous language bindings

l  Compact binary encoding of typed structs l  Fields can be marked as optional or required

l  Compiler automatically generates code for manipulating messages

¢  Provides RPC mechanisms for service definitions

¢  Alternatives include protobufs and Avro

Thrift

struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; } struct Location { 1: required double latitude; 2: required double longitude; }

Row vs. Column Stores

R1

R2

R3

R4

Row store

Column store

Row vs. Column Stores

¢  Row stores l  Easy to modify a record l  Might read unnecessary data when processing

¢  Column stores l  Only read necessary data when processing

l  Tuple writes require multiple accesses

OLTP/OLAP Architecture

OLTP OLAP

ETL��(Extract, Transform, and Load)

Advantages of Column Stores

¢  Read efficiency

¢  Better compression

¢  Vectorized processing

¢  Opportunities to operate directly on compressed data

Why not in Hadoop?

Source: He et al. (2011) RCFile: A Fast and Space-Efficient Data Placement Structure in MapReduce-based Warehouse Systems. ICDE.

No reason why not

limitation would not help our goal of fast query pro-cessing for a huge amount of disk scans on massivelygrowing data sets.

3) Limited by the page-level data manipulation inside atraditional DBMS engine, PAX uses a fixed page as thebasic unit of data record organization. With such a fixedsize, PAX would not efficiently store data sets with ahighly-diverse range of data resource types of differentsizes in large data processing systems, such as the onein Facebook.

III. THE DESIGN AND IMPLEMENTATION OF RCFILE

In this section, we present RCFile (Record Columnar File),a data placement structure designed for MapReduce-based datawarehouse systems, such as Hive. RCFile applies the conceptof “first horizontally-partition, then vertically-partition” fromPAX. It combines the advantages of both row-store andcolumn-store. First, as row-store, RCFile guarantees that datain the same row are located in the same node, thus it haslow cost of tuple reconstruction. Second, as column-store,RCFile can exploit a column-wise data compression and skipunnecessary column reads.

A. Data Layout and Compression

RCFile is designed and implemented on top of the HadoopDistributed File System (HDFS). As demonstrated in theexample shown in Figure 3, RCFile has the following datalayout to store a table:

1) According to the HDFS structure, a table can havemultiple HDFS blocks.

2) In each HDFS block, RCFile organizes records withthe basic unit of a row group. That is to say, all therecords stored in an HDFS block are partitioned intorow groups. For a table, all row groups have the samesize. Depending on the row group size and the HDFSblock size, an HDFS block can have only one or multiplerow groups.

Fig. 3: An example to demonstrate the data layout of RCFilein an HDFS block.

3) A row group contains three sections. The first section isa sync marker that is placed in the beginning of the rowgroup. The sync marker is mainly used to separate twocontinuous row groups in an HDFS block. The secondsection is a metadata header for the row group. Themetadata header stores the information items on howmany records are in this row group, how many bytesare in each column, and how many bytes are in eachfield in a column. The third section is the table datasection that is actually a column-store. In this section,all the fields in the same column are stored continuouslytogether. For example, as shown in Figure 3, the sectionfirst stores all fields in column A, and then all fields incolumn B, and so on.

We now introduce how data is compressed in RCFile. Ineach row group, the metadata header section and the tabledata section are compressed independently as follows.

• First, for the whole metadata header section, RCFile usesthe RLE (Run Length Encoding) algorithm to compressdata. Since all the values of the field lengths in the samecolumn are continuously stored in this section, the RLEalgorithm can find long runs of repeated data values,especially for fixed field lengths.

• Second, the table data section is not compressed as awhole unit. Rather, each column is independently com-pressed with the Gzip compression algorithm. RCFileuses the heavy-weight Gzip algorithm in order to getbetter compression ratios than other light-weight algo-rithms. For example, the RLE algorithm is not used sincethe column data is not already sorted. In addition, dueto the lazy decompression technology to be discussednext, RCFile does not need to decompress all the columnswhen processing a row group. Thus, the relatively highdecompression overhead of the Gzip algorithm can bereduced.

Though currently RCFile uses the same algorithm for allcolumns in the table data section, it allows us to use differentalgorithms to compress different columns. One future workrelated to the RCFile project is to automatically select thebest compression algorithm for each column according to itsdata type and data distribution.

B. Data Appending

RCFile does not allow arbitrary data writing operations.Only an appending interface is provided for data writing inRCFile because the underlying HDFS currently only supportsdata writes to the end of a file. The method of data appendingin RCFile is summarized as follows.

1) RCFile creates and maintains an in-memory column

holder for each column. When a record is appended,all its fields will be scattered, and each field willbe appended into its corresponding column holder. Inaddition, RCFile will record corresponding metadata ofeach field in the metadata header.

2) RCFile provides two parameters to control how manyrecords can be buffered in memory before they are

RCFile

“Trojan Layouts”

Source: Jindal et al. (2011) Trojan Data Layouts: Right Shoes for a Running Elephant. SoCC.

for future distributed systems as emphasized in the conclusion sec-tion of the ten-year best paper award of Surajit Chaudhuri [10].

1.3 ContributionsTrojan Layouts are inspired by PAX in the sense that we only

change the internal organization of a data block and not amongdata blocks. However, we considerably depart from PAX as wecan: (i) co-locate attributes together according to query workloads,(ii) use di↵erent Trojan Layouts for di↵erent data block replicas,and (iii) in a special case, mimic fractured mirrors: having the bestfrom both PAX and Row Layouts. In summary, we make the fol-lowing key contributions:

1. We propose a column grouping algorithm in which we first(i) determine column groups using a novel interestingnessmeasure, which denotes how well a set of attributes speedsup most or all queries in a workload; and then (ii) pack thecolumn groups in order to maximize the total interestingnessof data blocks. We use this algorithm as a basis to determinethe Trojan Layout of data blocks in HDFS. It is worth notingthat even if we focus on MapReduce in this paper, one canuse our column grouping algorithm in other domains as well.

2. We exploit default HDFS data replication to create a di↵er-ent Trojan Layout per data block replica. For this, we firstshow how to apply our column grouping algorithm for querygrouping as well, i.e. for clustering queries in a workload ac-cording to their access patterns. We then map each resultingquery group to one data block replica so as to compute theTrojan Layout for such a replica.

3. We present Trojan HDFS, a (per-replica) Trojan Layoutaware HDFS. At data upload time, Trojan HDFS automat-ically transforms data block replicas into their correspond-ing Trojan Layouts; it hides all messy details from the user.Thereafter, Trojan HDFS keeps track of Trojan Layouts foreach data block replica. With Trojan HDFS, neither theMapReduce processing pipeline nor the MapReduce inter-face are changed at all.

4. We evaluate Trojan Layouts using three real-world work-loads: TPC-H, Star Schema Benchmark (SSB), and SloanDigital Sky Survey (SDSS). The results demonstrate thatTrojan Layouts allow MapReduce jobs to read data up to afactor of 4.8 faster than Row Layout and up to a factor of 3.5faster than PAX Layout.

2. OVERVIEWWe propose Trojan Layouts as our solution to decrease the wait-

ing time of data-intensive jobs when accessing data from HDFS.The core idea of Trojan Layouts is to internally organize datablocks into column groups according to the workload. Our ap-proach has three phases: (1) compute the Trojan Layout for eachdata block replica, (2) create the computed Trojan Layouts inHDFS, and (3) access the existing Trojan Layouts. From the userperspective, the data analysis workflow remains the same: uploadthe input data and run the query workload exactly as before.

Given a query workload W, at upload time we determine theTrojan Layout for each data block replica. We then store eachdata block replica in its respective Trojan Layout. We illustratethe core idea of Trojan Layouts in Figure 3. A data block using aTrojan Layout is composed of Header metadata and a set of col-umn groups (see Replica 1 in data node 1). The header contains thenumber of attributes stored in a data block and attribute pointers.

... B D A G H C E F

... A B H C D F E G

... H A B D E F C G

Data Block 42 Replica 2

Data Block 42 Replica 3Data Block 42 Replica 1

Query Workload W

data node 1

data node 3

data node 2

Column Group 1Pointers Column Group 2

# Attributes

Header Data

job tracker

Q1 & Q3

Q4

Q2

Figure 3: Per-replica Trojan Layouts in HDFS.

An attribute pointer points to the beginning of the column groupthat contains that attribute. For instance, in Replica 1, the first at-tribute pointer (the red arrow) points to Column Group 2, whichcontains attribute A. The second attribute pointer would then pointto Column Group 1, and so on. Each column group in turn containsa set of attributes in row-fashion, e.g. Column Group 1 in Replica 1has tuples containing attributes B and D.

At query time, we transparently adapt an incoming MapReducejob to query the data block replica that minimizes the data accesstime. Then, we route the map tasks of the MapReduce job to thedata nodes storing such data block replicas. For example, in Fig-ure 3, the map tasks of Q4 are routed to data node 1, while the maptasks of Q1 and Q3 are routed to data node 2 and those of Q2 arerouted to data node 3. Notice that in case the scheduler cannot routea map task to the best data block replica, e.g. because the data nodestoring such a replica is busy, the scheduler transparently fallbacksto other Trojan Layouts. An important feature of our approach isthat we keep the Hadoop MapReduce interface intact by providingthe right itemize function in the Hadoop plan [14]. As in normalMapReduce, users only care about map and reduce functions.

The salient features of our approach are as follows:

• Invisibility. We create and access Trojan Layouts in a waythat is invisible to users.

• Non-Invasive. We do not change the outside HDFS unit, i.e.the data block, but rather change the internal representationof a data block. As a result, we do not require any changein the MapReduce processing pipeline for accessing TrojanLayouts.

• Seamless Query Processing. We seamlessly wrap the inputdata to a given MapReduce job. Obviously, we do this forthe data block replica that minimizes the time for readingrequired data.

• Rich DBMS features support. One can easily enrich TrojanLayouts with standard DBMS optimizations, such as index-ing, partitioning, and co-partitioning.

• Per-Replica Trojan Layout: We exploit existing data blockreplication in HDFS to create a di↵erent Trojan Layout foreach replica so as to better fit to the workload.

We provide the details on how we compute Trojan Layouts inSection 3. Then, in Section 4, we describe how we enable HDFS tostore each data block replica in a di↵erent Trojan Layout in TrojanHDFS. In the same Section 4, we discuss how we physically createTrojan Layouts. In addition, we discuss the query processing andscheduling aspects of our approach.

3

Key Ideas

¢  Separate logical from physical

¢  Preserve HDFS block structure

¢  Hide physical storage layout behind InputFormats

Hadoop vs. Databases: Grep (Revisited)


10

20

30

40

50

60

70

seco

nds

Vertica Hadoop

Figure 4: Grep Task Results – 535MB/node Data Set

25 Nodes 50 Nodes 100 Nodes0

250

500

750

1000

1250

1500

seco

nds

Vertica Hadoop

Figure 5: Grep Task Results – 1TB/cluster Data Set

two figures. In Figure 4, the two parallel databases perform aboutthe same, more than a factor of two faster in Hadoop. But in Fig-ure 5, both DBMS-X and Hadoop perform more than a factor oftwo slower than Vertica. The reason is that the amount of data pro-cessing varies substantially from the two experiments. For the re-sults in Figure 4, very little data is being processed (535MB/node).This causes Hadoop’s non-insignificant start-up costs to become thelimiting factor in its performance. As will be described in Section5.1.2, for short-running queries (i.e., queries that take less than aminute), Hadoop’s start-up costs can dominate the execution time.In our observations, we found that takes 10–25 seconds before allMap tasks have been started and are running at full speed across thenodes in the cluster. Furthermore, as the total number of allocatedMap tasks increases, there is additional overhead required for thecentral job tracker to coordinate node activities. Hence, this fixedoverhead increases slightly as more nodes are added to the clusterand for longer data processing tasks, as shown in Figure 5, this fixedcost is dwarfed by the time to complete the required processing.

The upper segments of each Hadoop bar in the graphs representthe execution time of the additional MR job to combine the outputinto a single file. Since we ran this as a separate MapReduce job,these segments consume a larger percentage of overall time in Fig-ure 4, as the fixed start-up overhead cost again dominates the workneeded to perform the rest of the task. Even though the Grep task isselective, the results in Figure 5 show how this combine phase canstill take hundreds of seconds due to the need to open and combinemany small output files. Each Map instance produces its output ina separate HDFS file, and thus even though each file is small thereare many Map tasks and therefore many files on each node.

For the 1TB/cluster data set experiments, Figure 5 shows that allsystems executed the task on twice as many nodes in nearly half theamount of time, as one would expect since the total amount of datawas held constant across nodes for this experiment. Hadoop andDBMS-X performs approximately the same, since Hadoop’s start-up cost is amortized across the increased amount of data processingfor this experiment. However, the results clearly show that Verticaoutperforms both DBMS-X and Hadoop. We attribute this to Ver-tica’s aggressive use of data compression (see Section 5.1.3), whichbecomes more effective as more data is stored per node.

4.3 Analytical TasksTo explore more complex uses of both types of systems, we de-

veloped four tasks related to HTML document processing. We firstgenerate a collection of random HTML documents, similar to thatwhich a web crawler might find. Each node is assigned a set of

600,000 unique HTML documents, each with a unique URL. Ineach document, we randomly generate links to other pages set us-ing a Zipfian distribution.

We also generated two additional data sets meant to model logfiles of HTTP server traffic. These data sets consist of values de-rived from the HTML documents as well as several randomly gen-erated attributes. The schema of these three tables is as follows:CREATE TABLE Documents (

url VARCHAR(100)PRIMARY KEY,

contents TEXT );

CREATE TABLE Rankings (pageURL VARCHAR(100)

PRIMARY KEY,pageRank INT,avgDuration INT );

CREATE TABLE UserVisits (sourceIP VARCHAR(16),destURL VARCHAR(100),visitDate DATE,adRevenue FLOAT,userAgent VARCHAR(64),countryCode VARCHAR(3),languageCode VARCHAR(6),searchWord VARCHAR(32),duration INT );

Our data generator created unique files with 155 million UserVis-its records (20GB/node) and 18 million Rankings records (1GB/node)on each node. The visitDate, adRevenue, and sourceIP fields arepicked uniformly at random from specific ranges. All other fieldsare picked uniformly from sampling real-world data sets. Each datafile is stored on each node as a column-delimited text file.

4.3.1 Data LoadingWe now describe the procedures for loading the UserVisits and

Rankings data sets. For reasons to be discussed in Section 4.3.5,only Hadoop needs to directly load the Documents files into its in-ternal storage system. DBMS-X and Vertica both execute a UDFthat processes the Documents on each node at runtime and loadsthe data into a temporary table. We account for the overhead ofthis approach in the benchmark times, rather than in the load times.Therefore, we do not provide results for loading this data set.

Hadoop: Unlike the Grep task’s data set, which was uploaded di-rectly into HDFS unaltered, the UserVisits and Rankings data setsneeded to be modified so that the first and second columns are sep-arated by a tab delimiter and all other fields in each line are sepa-rated by a unique field delimiter. Because there are no schemas inthe MR model, in order to access the different attributes at run time,the Map and Reduce functions in each task must manually split thevalue by the delimiter character into an array of strings.

We wrote a custom data loader executed in parallel on each nodeto read in each line of the data sets, prepare the data as needed,and then write the tuple into a plain text file in HDFS. Loadingthe data sets in this manner was roughly three times slower thanusing the command-line utility, but did not require us to write cus-

SELECT * FROM Data WHERE field LIKE ‘%XYZ%’;


Hadoop vs. Databases: Select (Revisited)

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X;


20

40

60

80

100

120

140

160

seco

nds

! 0

.3

! 0

.8

! 1

.8

! 4

.7

! 1

2.4

Vertica Hadoop

Figure 6: Selection Task Results

tom input handlers in Hadoop; the MR programs are able to useHadoop’s KeyValueTextInputFormat interface on the datafiles to automatically split lines of text files into key/values pairs bythe tab delimiter. Again, we found that other data format options,such as SequenceFileInputFormat or custom Writabletuples, resulted in both slower load and execution times.

DBMS-X: We used the same loading procedures for DBMS-X asdiscussed in Section 4.2. The Rankings table was hash partitionedacross the cluster on pageURL and the data on each node was sortedby pageRank. Likewise, the UserVisits table was hash partitionedon destinationURL and sorted by visitDate on each node.

Vertica: Similar to DBMS-X, Vertica used the same bulk load com-mands discussed in Section 4.2 and sorted the UserVisits and Rank-ings tables by the visitDate and pageRank columns, respectively.

Results & Discussion: Since the results of loading the UserVisitsand Ranking data sets are similar, we only provide the results forloading the larger UserVisits data in Figure 3. Just as with loadingthe Grep 535MB/node data set (Figure 1), the loading times foreach system increases in proportion to the number of nodes used.

4.3.2 Selection TaskThe Selection task is a lightweight filter to find the pageURLs

in the Rankings table (1GB/node) with a pageRank above a user-defined threshold. For our experiments, we set this threshold pa-rameter to 10, which yields approximately 36,000 records per datafile on each node.

SQL Commands: The DBMSs execute the selection task using thefollowing simple SQL statement:

SELECT pageURL, pageRankFROM Rankings WHERE pageRank > X;

MapReduce Program: The MR program uses only a single Mapfunction that splits the input value based on the field delimiter andoutputs the record’s pageURL and pageRank as a new key/valuepair if its pageRank is above the threshold. This task does not re-quire a Reduce function, since each pageURL in the Rankings dataset is unique across all nodes.

Results & Discussion: As was discussed in the Grep task, the re-sults from this experiment, shown in Figure 6, demonstrate againthat the parallel DBMSs outperform Hadoop by a rather significant

factor across all cluster scaling levels. Although the relative per-formance of all systems degrade as both the number of nodes andthe total amount of data increase, Hadoop is most affected. Forexample, there is almost a 50% difference in the execution timebetween the 1 node and 10 node experiments. This is again dueto Hadoop’s increased start-up costs as more nodes are added tothe cluster, which takes up a proportionately larger fraction of totalquery time for short-running queries.

Another important reason for why the parallel DBMSs are ableto outperform Hadoop is that both Vertica and DBMS-X use an in-dex on the pageRank column and store the Rankings table alreadysorted by pageRank. Thus, executing this query is trivial. It shouldalso be noted that although Vertica’s absolute times remain low, itsrelative performance degrades as the number of nodes increases.This is in spite of the fact that each node still executes the query inthe same amount of time (about 170ms). But because the nodes fin-ish executing the query so quickly, the system becomes flooded withcontrol messages from too many nodes, which then takes a longertime for the system to process. Vertica uses a reliable message layerfor query dissemination and commit protocol processing [4], whichwe believe has considerable overhead when more than a few dozennodes are involved in the query.

4.3.3 Aggregation TaskOur next task requires each system to calculate the total adRev-

enue generated for each sourceIP in the UserVisits table (20GB/node),grouped by the sourceIP column. We also ran a variant of this querywhere we grouped by the seven-character prefix of the sourceIP col-umn to measure the effect of reducing the total number of groupson query performance. We designed this task to measure the per-formance of parallel analytics on a single read-only table, wherenodes need to exchange intermediate data with one another in ordercompute the final value. Regardless of the number of nodes in thecluster, this tasks always produces 2.5 million records (53 MB); thevariant query produces 2,000 records (24KB).

SQL Commands: The SQL commands to calculate the total adRev-enue is straightforward:

SELECT sourceIP, SUM(adRevenue)FROM UserVisits GROUP BY sourceIP;

The variant query is:

SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);

MapReduce Program: Unlike the previous tasks, the MR programfor this task consists of both a Map and Reduce function. The Mapfunction first splits the input value by the field delimiter, and thenoutputs the sourceIP field (given as the input key) and the adRev-enue field as a new key/value pair. For the variant query, only thefirst seven characters (representing the first two octets, each storedas three digits) of the sourceIP are used. These two Map functionsshare the same Reduce function that simply adds together all of theadRevenue values for each sourceIP and then outputs the prefix andrevenue total. We also used MR’s Combine feature to perform thepre-aggregate before data is transmitted to the Reduce instances,improving the first query’s execution time by a factor of two [8].

Results & Discussion: The results of the aggregation task experi-ment in Figures 7 and 8 show once again that the two DBMSs out-perform Hadoop. The DBMSs execute these queries by having eachnode scan its local table, extract the sourceIP and adRevenue fields,and perform a local group by. These local groups are then merged at


Final consideration: Cost

Source: www.flickr.com/photos/gnusinn/3080378658/

Indexes are a good thing!

Source: Wikipedia (Card Catalog)

Why not in Hadoop?

¢  Non-invasive: requires no changes to Hadoop infrastructure

¢  Useful for speeding up selections on joins

¢  Indexing building itself can be performed using MapReduce

No reason why not

Source: Dittrich et al. (2010) Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). VLDB.

A Major Step Backwards?

¢  MapReduce is a step backward in database access: l  Schemas are good l  Separation of the schema from the application is good

l  High-level access languages are good

¢  MapReduce is poor implementation l  Brute force and only brute force (no indexes, for example)

¢  MapReduce is not novel

¢  MapReduce is missing features l  Bulk loader, indexing, updates, transactions…

¢  MapReduce is incompatible with DMBS tools

✔ ✔

✔

Source: Blog post by DeWitt and Stonebraker

?

?

Pig!

Source: Wikipedia (Pig)

Pig: Example

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Task: Find the top 10 most visited pages in each category

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig Query Plan

Load Visits

Group by url

Foreach url generate count Load Url Info

Join on url

Group by category

Foreach category generate top10(urls)


Pig Script

visits = load ‘/data/visits’ as (user, url, time);

gVisits = group visits by url;

visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;

topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;


Load Visits

Group by url


Join on url

Group by category


Pig Script in Hadoop

Map1

Reduce1 Map2

Reduce2 Map3

Reduce3


Pig: Basics

¢  Sequence of statements manipulating relations (aliases)

¢  Data model

l  atoms l  tuples

l  bags

l  maps

l  json

Pig: Common Operations

¢  LOAD: load data

¢  FOREACH … GENERATE: per tuple processing

¢  FILTER: discard unwanted tuples

¢  GROUP/COGROUP: group tuples

¢  JOIN: relational join

Pig: GROUPing

(1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3)

A = LOAD 'myfile.txt’ AS (f1: int, f2: int, f3: int);

X = GROUP A BY f1;

(1, {(1, 2, 3)}) (4, {(4, 2, 1), (4, 3, 3)}) (7, {(7, 2, 5)}) (8, {(8, 3, 4), (8, 4, 3)})

Pig: COGROUPing

A: (1, 2, 3) (4, 2, 1) (8, 3, 4) (4, 3, 3) (7, 2, 5) (8, 4, 3)

B: (2, 4) (8, 9) (1, 3) (2, 7) (2, 9) (4, 6) (4, 9)

X = COGROUP A BY f1, B BY $0;

(1, {(1, 2, 3)}, {(1, 3)}) (2, {}, {(2, 4), (2, 7), (2, 9)}) (4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)}) (7, {(7, 2, 5)}, {}) (8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})

Pig UDFs

¢  User-defined functions: l  Java l  Python

l  JavaScript

l  Ruby

¢  UDFs make Pig arbitrarily extensible l  Express “core” computations in UDFs

l  Take advantage of Pig as glue code for scale-out plumbing

previous_pagerank = LOAD ‘$docs_in’ USING PigStorage() AS (url: chararray, pagerank: float, links:{link: (url: chararray)}); outbound_pagerank = FOREACH previous_pagerank GENERATE pagerank / COUNT(links) AS pagerank, FLATTEN(links) AS to_url; new_pagerank = FOREACH ( COGROUP outbound_pagerank BY to_url, previous_pagerank BY url INNER ) GENERATE group AS url, (1 – $d) + $d * SUM(outbound_pagerank.pagerank) AS pagerank, FLATTEN(previous_pagerank.links) AS links; STORE new_pagerank INTO ‘$docs_out’ USING PigStorage();

PageRank in Pig

From: http://techblug.wordpress.com/2011/07/29/pagerank-implementation-in-pig/

#!/usr/bin/python from org.apache.pig.scripting import * P = Pig.compile(""" Pig part goes here """) params = { ‘d’: ‘0.5’, ‘docs_in’: ‘data/pagerank_data_simple’ } for i in range(10): out = "out/pagerank_data_" + str(i + 1) params["docs_out"] = out Pig.fs("rmr " + out) stats = P.bind(params).runSingle() if not stats.isSuccessful(): raise ‘failed’ params["docs_in"] = out

Oh, the iterative part too…

From: http://techblug.wordpress.com/2011/07/29/pagerank-implementation-in-pig/

vs.

Why write MapReduce code in Java ever?

Tweaking Hadoop

Source: Wikipedia (Chisel)

Hadoop + DBs = HadoopDB

¢  Why not have the best of both worlds? l  Parallel databases focused on performance l  Hadoop focused on scalability, flexibility, fault tolerance

¢  Key ideas: l  Co-locate a RDBMS on every slave node

l  To the extent possible, “push down” operations into the DB

Source: Abouzeid et al. (2009) HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB.

HadoopDB Architecture

MapReduce best meets the fault tolerance and ability to operate inheterogeneous environment properties. It achieves fault toleranceby detecting and reassigning Map tasks of failed nodes to othernodes in the cluster (preferably nodes with replicas of the input Mapdata). It achieves the ability to operate in a heterogeneous environ-ment via redundant task execution. Tasks that are taking a long timeto complete on slow nodes get redundantly executed on other nodesthat have completed their assigned tasks. The time to complete thetask becomes equal to the time for the fastest node to complete theredundantly executed task. By breaking tasks into small, granulartasks, the effect of faults and “straggler” nodes can be minimized.

MapReduce has a flexible query interface; Map and Reduce func-tions are just arbitrary computations written in a general-purposelanguage. Therefore, it is possible for each task to do anything onits input, just as long as its output follows the conventions definedby the model. In general, most MapReduce-based systems (such asHadoop, which directly implements the systems-level details of theMapReduce paper) do not accept declarative SQL. However, thereare some exceptions (such as Hive).

As shown in previous work, the biggest issue with MapReduceis performance [23]. By not requiring the user to first model andload data before processing, many of the performance enhancingtools listed above that are used by database systems are not possible.Traditional business data analytical processing, that have standardreports and many repeated queries, is particularly, poorly suited forthe one-time query processing model of MapReduce.

Ideally, the fault tolerance and ability to operate in heterogeneousenvironment properties of MapReduce could be combined with theperformance of parallel databases systems. In the following sec-tions, we will describe our attempt to build such a hybrid system.

5. HADOOPDBIn this section, we describe the design of HadoopDB. The goal of

this design is to achieve all of the properties described in Section 3.The basic idea behind HadoopDB is to connect multiple single-

node database systems using Hadoop as the task coordinator andnetwork communication layer. Queries are parallelized acrossnodes using the MapReduce framework; however, as much ofthe single node query work as possible is pushed inside of thecorresponding node databases. HadoopDB achieves fault toleranceand the ability to operate in heterogeneous environments byinheriting the scheduling and job tracking implementation fromHadoop, yet it achieves the performance of parallel databases bydoing much of the query processing inside of the database engine.

5.1 Hadoop Implementation BackgroundAt the heart of HadoopDB is the Hadoop framework. Hadoop

consits of two layers: (i) a data storage layer or the Hadoop Dis-tributed File System (HDFS) and (ii) a data processing layer or theMapReduce Framework.

HDFS is a block-structured file system managed by a centralNameNode. Individual files are broken into blocks of a fixed sizeand distributed across multiple DataNodes in the cluster. TheNameNode maintains metadata about the size and location ofblocks and their replicas.

The MapReduce Framework follows a simple master-slave ar-chitecture. The master is a single JobTracker and the slaves orworker nodes are TaskTrackers. The JobTracker handles the run-time scheduling of MapReduce jobs and maintains information oneach TaskTracker’s load and available resources. Each job is bro-ken down into Map tasks based on the number of data blocks thatrequire processing, and Reduce tasks. The JobTracker assigns tasksto TaskTrackers based on locality and load balancing. It achieves

SMS Planner

SQL Query

MapReduce Job

Master node

Hadoop core

MapReduce FrameworkHDFS

NameNode JobTracker

InputFormat Implementations

Ca

talo

g

Da

ta

Lo

ad

er

Node 1

TaskTracker

DataNodeDatabase

Node 2

TaskTracker

DataNodeDatabase

Node n

TaskTracker

DataNodeDatabase

Database Connector

MapReduce Job

Task with InputFormat

Figure 1: The Architecture of HadoopDB

locality by matching a TaskTracker to Map tasks that process datalocal to it. It load-balances by ensuring all available TaskTrackersare assigned tasks. TaskTrackers regularly update the JobTrackerwith their status through heartbeat messages.

The InputFormat library represents the interface between thestorage and processing layers. InputFormat implementations parsetext/binary files (or connect to arbitrary data sources) and transformthe data into key-value pairs that Map tasks can process. Hadoopprovides several InputFormat implementations including one thatallows a single JDBC-compliant database to be accessed by alltasks in one job in a given cluster.

5.2 HadoopDB’s ComponentsHadoopDB extends the Hadoop framework (see Fig. 1) by pro-

viding the following four components:

5.2.1 Database ConnectorThe Database Connector is the interface between independent

database systems residing on nodes in the cluster and TaskTrack-ers. It extends Hadoop’s InputFormat class and is part of the Input-Format Implementations library. Each MapReduce job supplies theConnector with an SQL query and connection parameters such as:which JDBC driver to use, query fetch size and other query tuningparameters. The Connector connects to the database, executes theSQL query and returns results as key-value pairs. The Connectorcould theoretically connect to any JDBC-compliant database thatresides in the cluster. However, different databases require differentread query optimizations. We implemented connectors for MySQLand PostgreSQL. In the future we plan to integrate other databasesincluding open-source column-store databases such as MonetDBand InfoBright. By extending Hadoop’s InputFormat, we integrateseamlessly with Hadoop’s MapReduce Framework. To the frame-work, the databases are data sources similar to data blocks in HDFS.

5.2.2 CatalogThe catalog maintains metainformation about the databases. This

includes the following: (i) connection parameters such as databaselocation, driver class and credentials, (ii) metadata such as datasets contained in the cluster, replica locations, and data partition-ing properties.

The current implementation of the HadoopDB catalog stores itsmetainformation as an XML file in HDFS. This file is accessed bythe JobTracker and TaskTrackers to retrieve information necessary


HadoopDB: Query Plans


to schedule tasks and process data needed by a query. In the future,we plan to deploy the catalog as a separate service that would workin a way similar to Hadoop’s NameNode.

5.2.3 Data LoaderThe Data Loader is responsible for (i) globally repartitioning data

on a given partition key upon loading, (ii) breaking apart singlenode data into multiple smaller partitions or chunks and (iii) finallybulk-loading the single-node databases with the chunks.

The Data Loader consists of two main components: GlobalHasher and Local Hasher. The Global Hasher executes a custom-made MapReduce job over Hadoop that reads in raw data filesstored in HDFS and repartitions them into as many parts as thenumber of nodes in the cluster. The repartitioning job does notincur the sorting overhead of typical MapReduce jobs.

The Local Hasher then copies a partition from HDFS into thelocal file system of each node and secondarily partitions the file intosmaller sized chunks based on the maximum chunk size setting.

The hashing functions used by both the Global Hasher and theLocal Hasher differ to ensure chunks are of a uniform size. Theyalso differ from Hadoop’s default hash-partitioning function to en-sure better load balancing when executing MapReduce jobs overthe data.

5.2.4 SQL to MapReduce to SQL (SMS) PlannerHadoopDB provides a parallel database front-end to data analysts

enabling them to process SQL queries.The SMS planner extends Hive [11]. Hive transforms HiveQL, a

variant of SQL, into MapReduce jobs that connect to tables storedas files in HDFS. The MapReduce jobs consist of DAGs of rela-tional operators (such as filter, select (project), join, aggregation)that operate as iterators: each operator forwards a data tuple to thenext operator after processing it. Since each table is stored as aseparate file in HDFS, Hive assumes no collocation of tables onnodes. Therefore, operations that involve multiple tables usuallyrequire most of the processing to occur in the Reduce phase ofa MapReduce job. This assumption does not completely hold inHadoopDB as some tables are collocated and if partitioned on thesame attribute, the join operation can be pushed entirely into thedatabase layer.

To understand how we extended Hive for SMS as well as the dif-ferences between Hive and SMS, we first describe how Hive createsan executable MapReduce job for a simple GroupBy-Aggregationquery. Then, we describe how we modify the execution plan forHadoopDB by pushing most of the query processing logic into thedatabase layer.

Consider the following query:SELECT YEAR(saleDate), SUM(revenue)FROM sales GROUP BY YEAR(saleDate);

Hive processes the above SQL query in a series of phases:(1) The parser transforms the query into an Abstract Syntax Tree.(2) The Semantic Analyzer connects to Hive’s internal catalog,

the MetaStore, to retrieve the schema of the sales table. It alsopopulates different data structures with meta information such asthe Deserializer and InputFormat classes required to scan the tableand extract the necessary fields.

(3) The logical plan generator then creates a DAG of relationaloperators, the query plan.

(4) The optimizer restructures the query plan to create a moreoptimized plan. For example, it pushes filter operators closer to thetable scan operators. A key function of the optimizer is to break upthe plan into Map or Reduce phases. In particular, it adds a Repar-tition operator, also known as a Reduce Sink operator, before Joinor GroupBy operators. These operators mark the Map and Reduce

TableScanOperator

sales

SelectOperator

expr:[Col[YEAR(saleDate)],Col

[revenue]]

[0:string,1:double]

GroupByOperator

aggr:[Sum(1)]keys:[Col[0]]mode:

hash

ReduceSinkOperator

PartitionCols:Col[0]

GroupByOperator


mergepartial

FileSinkOperator

file=annual_sales_revenue

SelectOperator

expr:[Col[0],Col[1]]

Reduce

Phase

Map

Phase

TableScanOperator

SELECTYEAR(saleDate),SUM(revenue)

FROMsalesGROUPBYYEAR(saleDate)

[0:string,1:double]

ReduceSinkOperator

PartitionCols:Col[0]

GroupByOperator


mergepartial

FileSinkOperator


SelectOperator

expr:[Col[0],Col[1]]

Reduce

Phase

Map

Phase

TableScanOperator

SELECTYEAR(saleDate),SUM(revenue)

FROMsalesGROUPBYYEAR(saleDate)

[0:string,1:double]

FileSinkOperator


Map

Phase

Only

(b)

(a)

(c)

Figure 2: (a) MapReduce job generated by Hive (b)MapReduce job generated by SMS assuming sales is par-titioned by YEAR(saleDate). This feature is still unsup-ported (c) MapReduce job generated by SMS assumingno partitioning of sales

phases of a query plan. The Hive optimizer is a simple, naı̈ve, rule-based optimizer. It does not use cost-based optimization techniques.Therefore, it does not always generate efficient query plans. This isanother advantage of pushing as much as possible of the query pro-cessing logic into DBMSs that have more sophisticated, adaptive orcost-based optimizers.

(5) Finally, the physical plan generator converts the logical queryplan into a physical plan executable by one or more MapReducejobs. The first and every other Reduce Sink operator marks a tran-sition from a Map phase to a Reduce phase of a MapReduce job andthe remaining Reduce Sink operators mark the start of new MapRe-duce jobs. The above SQL query results in a single MapReducejob with the physical query plan illustrated in Fig. 2(a). The boxesstand for the operators and the arrows represent the flow of data.

(6) Each DAG enclosed within a MapReduce job is serializedinto an XML plan. The Hive driver then executes a Hadoop job.The job reads the XML plan and creates all the necessary operatorobjects that scan data from a table in HDFS, and parse and processone tuple at a time.

The SMS planner modifies Hive. In particular we intercept thenormal Hive flow in two main areas:

(i) Before any query execution, we update the MetaStore withreferences to our database tables. Hive allows tables to exist exter-nally, outside HDFS. The HadoopDB catalog, Section 5.2.2, pro-vides information about the table schemas and required Deserial-izer and InputFormat classes to the MetaStore. We implementedthese specialized classes.

(ii) After the physical query plan generation and before the ex-ecution of the MapReduce jobs, we perform two passes over thephysical plan. In the first pass, we retrieve data fields that are actu-ally processed by the plan and we determine the partitioning keysused by the Reduce Sink (Repartition) operators. In the secondpass, we traverse the DAG bottom-up from table scan operators tothe output or File Sink operator. All operators until the first repar-tition operator with a partitioning key different from the database’skey are converted into one or more SQL queries and pushed intothe database layer. SMS uses a rule-based SQL generator to recre-ate SQL from the relational operators. The query processing logicthat could be pushed into the database layer ranges from none (each

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate);

MapReduce Sucks: Iterative Algorithms

¢  Java verbosity

¢  Hadoop task startup time

¢  Stragglers

¢  Needless data shuffling

¢  Checkpointing at each iteration

HaLoop: MapReduce + Iteration

Ri+1 = R0 [ (Ri on L)Programming Model

HaLoop: Efficient Iterative Data Processingon Large Clusters

Yingyi Bu⇤ Bill Howe Magdalena Balazinska Michael D. ErnstDepartment of Computer Science and Engineering

University of Washington, Seattle, WA, U.S.A.

[email protected], {billhowe, magda, mernst}@cs.washington.edu

ABSTRACTThe growing demand for large-scale data mining and data anal-ysis applications has led both industry and academia to designnew types of highly scalable data-intensive computing platforms.MapReduce and Dryad are two popular platforms in which thedataflow takes the form of a directed acyclic graph of operators.These platforms lack built-in support for iterative programs, whicharise naturally in many applications including data mining, webranking, graph analysis, model fitting, and so on. This paperpresents HaLoop, a modified version of the Hadoop MapReduceframework that is designed to serve these applications. HaLoopnot only extends MapReduce with programming support for it-erative applications, it also dramatically improves their efficiencyby making the task scheduler loop-aware and by adding variouscaching mechanisms. We evaluated HaLoop on real queries andreal datasets. Compared with Hadoop, on average, HaLoop reducesquery runtimes by 1.85, and shuffles only 4% of the data betweenmappers and reducers.

1. INTRODUCTIONThe need for highly scalable parallel data processing platforms

is rising due to an explosion in the number of massive-scale data-intensive applications both in industry (e.g., web-data analysis,click-stream analysis, network-monitoring log analysis) and in thesciences (e.g., analysis of data produced by massive-scale simula-tions, sensor deployments, high-throughput lab equipment).

MapReduce [4] is a well-known framework for programmingcommodity computer clusters to perform large-scale data process-ing in a single pass. A MapReduce cluster can scale to thousandsof nodes in a fault-tolerant manner. Although parallel database sys-tems [5] may also serve these data analysis applications, they canbe expensive, difficult to administer, and lack fault-tolerance forlong-running queries [16]. Hadoop [7], an open-source MapRe-duce implementation, has been adopted by Yahoo!, Facebook, andother companies for large-scale data analysis. With the MapReduceframework, programmers can parallelize their applications simplyby implementing a map function and a reduce function to transform

⇤Work was done while the author was at University of Washington, Seattle. Currentaffiliation: Yingyi Bu - University of California, Irvine.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were presented at The36th International Conference on Very Large Data Bases, September 13-17,2010, Singapore.Proceedings of the VLDB Endowment, Vol. 3, No. 1Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00.

url rankwww.a.com 1.0www.b.com 1.0www.c.com 1.0www.d.com 1.0www.e.com 1.0

url source url destwww.a.com www.b.comwww.a.com www.c.comwww.c.com www.a.comwww.e.com www.d.comwww.d.com www.b.comwww.c.com www.e.comwww.e.com www.c.comwww.a.com www.d.com

(a) Initial Rank Table R0 (b) Linkage Table L

MR

1

8>>><

>>>:

T1 = Ri 1url=url source

L

T2 = �url,rank, rank

COUNT(url dest)!new rank

(T1)

T3 = T2 1url=url source

L

MR

2

nRi+1 = �

url dest!url,SUM(new rank)!rank

(T3)

url rankwww.a.com 2.13www.b.com 3.89www.c.com 2.60www.d.com 2.60www.e.com 2.13

(c) Loop Body (d) Rank Table R3

Figure 1: PageRank example

and aggregate their data, respectively. Many algorithms naturallyfit into the MapReduce model, such as word counting, equi-joinqueries, and inverted list construction [4].

However, many data analysis techniques require iterative com-putations, including PageRank [15], HITS (Hypertext-InducedTopic Search) [11], recursive relational queries [3], clustering,neural-network analysis, social network analysis, and network traf-fic analysis. These techniques have a common trait: data are pro-cessed iteratively until the computation satisfies a convergence orstopping condition. The MapReduce framework does not directlysupport these iterative data analysis applications. Instead, program-mers must implement iterative programs by manually issuing mul-tiple MapReduce jobs and orchestrating their execution using adriver program [12].

There are two key problems with manually orchestrating an iter-ative program in MapReduce. The first problem is that even thoughmuch of the data may be unchanged from iteration to iteration, thedata must be re-loaded and re-processed at each iteration, wastingI/O, network bandwidth, and CPU resources. The second prob-lem is that the termination condition may involve detecting whena fixpoint has been reached — i.e., when the application’s outputdoes not change for two consecutive iterations. This condition mayitself require an extra MapReduce job on each iteration, again in-curring overhead in terms of scheduling extra tasks, reading extradata from disk, and moving data across the network. To illustratethese problems, consider the following two examples.

EXAMPLE 1. (PageRank) PageRank is a link analysis algo-rithm that assigns weights (ranks) to each vertex in a graph byiteratively computing the weight of each vertex based on the weightof its inbound neighbors. In the relational algebra, the PageRankalgorithm can be expressed as a join followed by an update with

Source: Bu et al. (2010) HaLoop: Efficient Iterative Data Processing on Large Clusters. VLDB.

HaLoop Architecture

Task Queue

.

.

.

Task21 Task22 Task23

Task31 Task32 Task33

Task11 Task12 Task13�

�

�

Identical to Hadoop New in HaLoop

Local communication Remote communication

Modified from Hadoop

Figure 3: The HaLoop framework, a variant of HadoopMapReduce framework.

1, job 2, and job 3. Each job has three tasks running concurrentlyon slave nodes.

In order to accommodate the requirements of iterative data anal-ysis applications, we made several changes to the basic HadoopMapReduce framework. First, HaLoop exposes a new applicationprogramming interface to users that simplifies the expression ofiterative MapReduce programs (Section 2.2). Second, HaLoop’smaster node contains a new loop control module that repeatedlystarts new map-reduce steps that compose the loop body, untila user-specified stopping condition is met (Section 2.2). Third,HaLoop uses a new task scheduler for iterative applications thatleverages data locality in these applications (Section 3). Fourth,HaLoop caches and indexes application data on slave nodes (Sec-tion 4). As shown in Figure 3, HaLoop relies on the same filesystem and has the same task queue structure as Hadoop, but thetask scheduler and task tracker modules are modified, and the loopcontrol, caching, and indexing modules are new. The task trackernot only manages task execution, but also manages caches and in-dices on the slave node, and redirects each task’s cache and indexaccesses to local file system.

2.2 Programming ModelThe PageRank and descendant query examples are representative

of the types of iterative programs that HaLoop supports. Here, wepresent the general form of the recursive programs we support anda detailed API.

The iterative programs that HaLoop supports can be distilled intothe following core construct:

Ri+1

= R0

[ (Ri

./ L)

where R0

is an initial result and L is an invariant relation. Aprogram in this form terminates when a fixpoint is reached —when the result does not change from one iteration to the next, i.e.R

i+1

= Ri

. This formulation is sufficient to express a broad classof recursive programs.1

1SQL (ANSI SQL 2003, ISO/IEC 9075-2:2003) queries using theWITH clause can also express a variety of iterative applications, in-cluding complex analytics that are not typically implemented inSQL such as k-means and PageRank; see Section 9.5.

A fixpoint is typically defined by exact equality between iter-ations, but HaLoop also supports the concept of an approximatefixpoint, where the computation terminates when either the differ-ence between two consecutive iterations is less than a user-specifiedthreshold, or the maximum number of iterations has been reached.Both kinds of approximate fixpoints are useful for expressing con-vergence conditions in machine learning and complex analytics.For example, for PageRank, it is common to either use a user-specified convergence threshold ✏ [15] or a fixed number of iter-ations as the loop termination condition.

Although our recursive formulation describes the class of iter-ative programs we intend to support, this work does not developa high-level declarative language for expressing recursive queries.Rather, we focus on providing an efficient foundation API for it-erative MapReduce programs; we posit that a variety of high-levellanguages (e.g., Datalog) could be implemented on this foundation.

To write a HaLoop program, a programmer specifies the loopbody (as one or more map-reduce pairs) and optionally specifiesa termination condition and loop-invariant data. We now discussHaLoop’s API (see Figure 16 in the appendix for a summary). Mapand Reduce are similar to standard MapReduce and are required;the rest of the API is new and is optional.

To specify the loop body, the programmer constructs a multi-stepMapReduce job, using the following functions:

• Map transforms an input hkey, valuei tuple into intermediatehin key, in valuei tuples.

• Reduce processes intermediate tuples sharing the same in key,to produce hout key, out valuei tuples. The interface containsa new parameter for cached invariant values associated with thein key.

• AddMap and AddReduce express a loop body that consists ofmore than one MapReduce step. AddMap (AddReduce) asso-ciates a Map (Reduce) function with an integer indicating theorder of the step.

HaLoop defaults to testing for equality from one iteration to thenext to determine when to terminate the computation. To specify anapproximate fixpoint termination condition, the programmer usesthe following functions.

• SetFixedPointThreshold sets a bound on the distance be-tween one iteration and the next. If the threshold is exceeded,then the approximate fixpoint has not yet been reached, and thecomputation continues.

• The ResultDistance function calculates the distance betweentwo out value sets sharing the same out key. One out value set v

i

is from the reducer output of the current iteration, and the otherout value set v

i�1

is from the previous iteration’s reducer output.The distance between the reducer outputs of the current iterationi and the last iteration i � 1 is the sum of ResultDistance onevery key. (It is straightforward to support additional aggrega-tions besides sum.)

• SetMaxNumOfIterations provides further control of the looptermination condition. HaLoop terminates a job if the maxi-mum number of iterations has been executed, regardless of thedistance between the current and previous iteration’s outputs.SetMaxNumOfIterations can also be used to implement asimple for-loop.

To specify and control inputs, the programmer uses:

• SetIterationInput associates an input source with a specificiteration, since the input files to different iterations may be dif-ferent. For example, in Example 1, at each iteration i + 1, theinput is R

i

[ L.


HaLoop: Loop Aware Scheduling

MapReduce

Stop?

Map Reduce Map Reduce

ApplicationYes

No

Map function

Reduce function

Stop condition

Job Job

HaLoop

Stop?

Map Reduce Map Reduce

No

ApplicationMap function

Reduce function

Stop condition

Yes

Job

submit

Figure 4: Boundary between an iterative application and theframework (HaLoop vs. Hadoop). HaLoop knows and controlsthe loop, while Hadoop only knows jobs with one map-reducepair.

Figure 5: A schedule exhibiting inter-iteration locality. Tasksprocessing the same inputs on consecutive iterations are sched-uled to the same physical nodes.

• AddStepInput associates an additional input source with an in-termediate map-reduce pair in the loop body. The output of pre-ceding map-reduce pair is always in the input of the next map-reduce pair.

• AddInvariantTable specifies an input table (an HDFS file)that is loop-invariant. During job execution, HaLoop will cachethis table on cluster nodes.

This programming interface is sufficient to express a variety ofiterative applications. The appendix sketches the implementationof PageRank (Section 9.2), descendant query (Section 9.3), and k-means (Section 9.4) using this programming interface. Figure 4shows the difference between HaLoop and Hadoop, from the appli-cation’s perspective: in HaLoop, a user program specifies loop set-tings and the framework controls the loop execution, but in Hadoop,it is the application’s responsibility to control the loops.

3. LOOP-AWARE TASK SCHEDULINGThis section introduces the HaLoop task scheduler. The sched-

uler provides potentially better schedules for iterative programsthan Hadoop’s scheduler. Sections 3.1 and 3.2 illustrate the desiredschedules and scheduling algorithm respectively.

3.1 Inter-Iteration LocalityThe high-level goal of HaLoop’s scheduler is to place on the

same physical machines those map and reduce tasks that occur indifferent iterations but access the same data. With this approach,data can more easily be cached and re-used between iterations. Forexample, Figure 5 is a sample schedule for the join step (MR1 inFigure 1(c)) of the PageRank application from Example 1. Thereare two iterations and three slave nodes involved in the job.

The scheduling of iteration 1 is no different than in Hadoop. Inthe join step of the first iteration, the input tables are L and R0.Three map tasks are executed, each of which loads a part of one orthe other input data file (a.k.a., a file split). As in ordinary Hadoop,the mapper output key (the join attribute in this example) is hashedto determine the reduce task to which it should be assigned. Then,

three reduce tasks are executed, each of which loads a partition ofthe collective mapper output. In Figure 5, reducer R00 processesmapper output keys whose hash value is 0, reducer R10 processeskeys with hash value 1, and reducer R20 processes keys with hashvalue 2.

The scheduling of the join step of iteration 2 can take advantageof inter-iteration locality: the task (either mapper or reducer) thatprocesses a specific data partition D is scheduled on the physicalnode where D was processed in iteration 1. Note that the two fileinputs to the join step in iteration 2 are L and R1.

The schedule in Figure 5 provides the feasibility to reuse loop-invariant data from past iterations. Because L is loop-invariant,mappers M01 and M11 would compute identical results to M00

and M10. There is no need to re-compute these mapper outputs,nor to communicate them to the reducers. In iteration 1, if reducerinput partitions 0, 1, and 2 are stored on nodes n3, n1, and n2

respectively, then in iteration 2, L need not be loaded, processedor shuffled again. In that case, in iteration 2, only one mapperM21 for R1-split0 needs to be launched, and thus the three reducerswill only copy intermediate data from M21. With this strategy, thereducer input is no different, but it now comes from two sources:the output of the mappers (as usual) and the local disk.

We refer to the property of the schedule in Figure 5 as inter-iteration locality. Let d be a file split (mapper input partition) or areducer input partition2, and let T i

d be a task consuming d in itera-tion i. Then we say that a schedule exhibits inter-iteration localityif for all i > 1, T i

d and T i�1d are assigned to the same physical node

if T i�1d exists.

The goal of task scheduling in HaLoop is to achieve inter-iteration locality. To achieve this goal, the only restriction is thatHaLoop requires that the number of reduce tasks should be invari-ant across iterations, so that the hash function assigning mapperoutputs to reducer nodes remains unchanged.

3.2 Scheduling AlgorithmHaLoop’s scheduler keeps track of the data partitions processed

by each map and reduce task on each physical machine, and it usesthat information to schedule subsequent tasks taking inter-iterationlocality into account.

More specifically, the HaLoop scheduler works as follows. Uponreceiving a heartbeat from a slave node, the master node tries toassign the slave node an unassigned task that uses data cached onthat node. To support this assignment, the master node maintains amapping from each slave node to the data partitions that this nodeprocessed in the previous iteration. If the slave node already has afull load, the master re-assigns its tasks to a nearby slave node.

Figure 6 gives pseudocode for the scheduling algorithm. Beforeeach iteration, previous is set to current, and then current isset to a new empty HashMap object. In a job’s first iteration, theschedule is exactly the same as that produced by Hadoop (line 2).After scheduling, the master remembers the association betweendata and node (lines 3 and 13). In later iterations, the schedulertries to retain previous data-node associations (lines 11 and 12). Ifthe associations can no longer hold due to the load, the master nodewill associate the data with another node (lines 6–8).

4. CACHING AND INDEXINGThanks to the inter-iteration locality offered by the task sched-

uler, access to a particular loop-invariant data partition is usually2Mapper input partitions are represented by an input file URL plusan offset and length; reducer input partitions are represented by aninteger hash value. Two partitions are assumed to be equal if theirrepresentations are equal.


HaLoop: Optimizations

¢  Loop-aware scheduling

¢  Caching

l  Reducer input for invariant data l  Reducer output speeding up convergence checks


HaLoop: Performance

HaLoop Hadoop0

100

200

300

400

500

Runn

ing T

ime (

s)

Configuration

Reduce Shuffle

(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes

Figure 9: PageRank Performance: HaLoop vs. Hadoop (Livejournal Dataset, 50 nodes)

HaLoop Hadoop0

200

400

600

800

1000

Runn

ing T

ime (

s)

Configuration

Reduce Shuffle


Figure 10: PageRank Performance: HaLoop vs. Hadoop (Freebase Dataset, 90 nodes)

HaLoop Hadoop0

500

1000

1500

2000

Runn

ing T

ime (

s)

Configuration

Reduce Shuffle


Figure 11: Descendant Query Performance: HaLoop vs. Hadoop (Triples Dataset, 90 nodes)


Figure 12: Descendant Query Performance: HaLoop vs. Hadoop (Livejournal Dataset, 50 nodes)

each iteration, there are few new records produced, so the join’sselectivity on F is very low. Thus the cost becomes negligible.By contrast, for PageRank, the index does not help much, becausethe selectivity is high. For the descendants query on Livejournal(Figure 12), in iteration>3, the index does not help either, becausethe selectivity becomes high.

I/O in Shuffle Phase of Join Step. To tell how much shufflingI/O is saved, we compared the amount of shuffled data in the joinstep of each iteration. Since HaLoop caches loop-invariant data, theoverhead of shuffling these invariant data are completely avoided.These savings contribute an important part of the overall perfor-mance improvement. Figure 9(d), Figure 10(d), Figure 11(d), andFigure 12(d) plot the sizes of shuffled data. On average, HaLoop’sjoin step shuffles 4% as much data as Hadoop’s does.

5.2 Evaluation of Reducer Output CacheThis experiment shares the same hardware and dataset as the re-

ducer input cache experiments. To see how effective HaLoop’s re-ducer output cache is, we compared the cost of fixpoint evaluationin each iteration. Since descendant query has a trivial fixpoint eval-uation step that only requires testing to see if a file is empty, we run

the PageRank implementation in Section 9.2 on Livejournal andFreebase. In the Hadoop implementation, the fixpoint evaluation isimplemented by an extra MapReduce job. On average, comparedwith Hadoop, HaLoop reduces the cost of this step to 40%, by tak-ing advantage of the reducer output cache and a built-in distributedfixpoint evaluation. Figure 13(a) and (b) shows the time spent onfixpoint evaluation in each iteration.

5.3 Evaluation of Mapper Input CacheSince the mapper input cache aims to reduce data transportation

between slave nodes but we do not know the disk I/O implemen-tations of EC2 virtual machines, this suite of experiments uses an8-node physical machine cluster. PageRank and descendant querycannot utilize the mapper input cache because their inputs changefrom iteration to iteration. Thus, the application used in the eval-uation is the k-means clustering algorithm. We used two real-world Astronomy datasets (multi-dimensional tuples): cosmo-dark(46GB) and cosmo-gas (54GB). Detailed hardware and dataset de-scriptions are in Section 9.6. We vary the number of total iterations,and plot the algorithm running time in Figure 14. The mapper lo-cality rate is around 95% since there are not concurrent jobs in our


Beyond MapReduce

Source: Wikipedia (Waste container)

Pregel: Computational Model

¢  Based on Bulk Synchronous Parallel (BSP) l  Computational units encoded in a directed graph l  Computation proceeds in a series of supersteps

l  Message passing architecture

¢  Each vertex, at each superstep: l  Receives messages directed at it from previous superstep

l  Executes a user-defined function (modifying state)

l  Emits messages to other vertices (for the next superstep)

¢  Termination:

l  A vertex can choose to deactivate itself l  Is “woken up” if new messages received

l  Computation halts when all vertices are inactive

Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel

superstep t

superstep t+1

superstep t+2


Pregel: Implementation

¢  Master-Slave architecture l  Vertices are hash partitioned (by default) and assigned to workers l  Everything happens in memory

¢  Processing cycle: l  Master tells all workers to advance a single superstep

l  Worker delivers messages from previous superstep, executing vertex computation

l  Messages sent asynchronously (in batches)

l  Worker notifies master of number of active vertices

¢  Fault tolerance

l  Checkpointing l  Heartbeat/revert


Pregel: PageRank

class PageRankVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { if (superstep() >= 1) { double sum = 0; for (; !msgs->Done(); msgs->Next()) sum += msgs->Value(); *MutableValue() = 0.15 / NumVertices() + 0.85 * sum; } if (superstep() < 30) { const int64 n = GetOutEdgeIterator().size(); SendMessageToAllNeighbors(GetValue() / n); } else { VoteToHalt(); } } };


Pregel: SSSP

class ShortestPathVertex : public Vertex<int, int, int> { void Compute(MessageIterator* msgs) { int mindist = IsSource(vertex_id()) ? 0 : INF; for (; !msgs->Done(); msgs->Next()) mindist = min(mindist, msgs->Value()); if (mindist < GetValue()) { *MutableValue() = mindist; OutEdgeIterator iter = GetOutEdgeIterator(); for (; !iter.Done(); iter.Next()) SendMessageTo(iter.Target(), mindist + iter.GetValue()); } VoteToHalt(); } };


Pregel: Combiners

class MinIntCombiner : public Combiner<int> { virtual void Combine(MessageIterator* msgs) { int mindist = INF; for (; !msgs->Done(); msgs->Next()) mindist = min(mindist, msgs->Value()); Output("combined_source", mindist); } };


Load Visits

Group by url


Join on url

Group by category


Dataflows

Map1

Reduce1 Map2

Reduce2 Map3

Reduce3


Dataflows

¢  Pig targets a dataflow at the Hadoop query execution engine l  Everything broken down into maps and reduces

¢  Surely, there’s a richer set of graph operators you can build!

Dryad: Graph Operators

AS = A^n

nA A

BS = B^n

nB B

E = (AS >= C >= BS)

n

n

C

A A

B B

E || (AS >= BS)

n

n

C

A A

B B

(A>=C>=D>=B) || (A>=F>=B)

D

CF

A

B

A A

BB

AS >> BS

n

n B

C D

(B >= C) || (B >= D)AS >= BS

A A

nB B

n

a c d e

f g h

b

Figure 3: The operators of the graph description language. Circles are vertices and arrows are graph edges. A triangle at the bottom of avertex indicates an input and one at the top indicates an output. Boxes (a) and (b) demonstrate cloning individual vertices using the ^ operator.The two standard connection operations are pointwise composition using >= shown in (c) and complete bipartite composition using >> shown in(d). (e) illustrates a merge using ||. The second line of the figure shows more complex patterns. The merge in (g) makes use of a “subroutine”from (f) and demonstrates a bypass operation. For example, each A vertex might output a summary of its input to C which aggregates themand forwards the global statistics to every B. Together the B vertices can then distribute the original dataset (received from A) into balancedpartitions. An asymmetric fork/join is shown in (h).

3.3 Merging two graphsThe final operation in the language is ||, which merges

two graphs. C = A || B creates a new graph:

C = !VA "! VB, EA # EB, IA #! IB, OA #! OB$

where, in contrast to the composition operations, it is notrequired that A and B be disjoint. VA "! VB is the con-catenation of VA and VB with duplicates removed from thesecond sequence. IA #! IB means the union of A and B’s in-puts, minus any vertex that has an incoming edge followingthe merge (and similarly for the output case). If a vertexis contained in VA % VB its input and output edges are con-catenated so that the edges in EA occur first (with lowerport numbers). This simplification forbids certain graphswith “crossover” edges, however we have not found this re-striction to be a problem in practice. The invariant that themerged graph be acyclic is enforced by a run-time check.

The merge operation is extremely powerful and makes iteasy to construct typical patterns of communication such asfork/join and bypass as shown in Figures 3(f)–(h). It alsoprovides the mechanism for assembling a graph “by hand”from a collection of vertices and edges. So for example, atree with four vertices a, b, c, and d might be constructedas G = (a>=b) || (b>=c) || (b>=d).

The graph builder program to construct the query graphin Figure 2 is shown in Figure 4.

3.4 Channel typesBy default each channel is implemented using a tempo-

rary file: the producer writes to disk (typically on its localcomputer) and the consumer reads from that file.

In many cases multiple vertices will fit within the re-sources of a single computer so it makes sense to executethem all within the same process. The graph language hasan “encapsulation” command that takes a graph G and re-turns a new vertex vG. When vG is run as a vertex pro-gram, the job manager passes it a serialization of G as aninvocation parameter, and it runs all the vertices of G si-multaneously within the same process, connected by edgesimplemented using shared-memory FIFOs. While it wouldalways be possible to write a custom vertex program withthe same semantics as G, allowing encapsulation makes it ef-ficient to combine simple library vertices at the graph layerrather than re-implementing their functionality as a newvertex program.

Sometimes it is desirable to place two vertices in the sameprocess even though they cannot be collapsed into a singlegraph vertex from the perspective of the scheduler. For ex-ample, in Figure 2 the performance can be improved byplacing the first D vertex in the same process as the firstfour M and S vertices and thus avoiding some disk I/O,however the S vertices cannot be started until all of the Dvertices complete.

When creating a set of graph edges, the user can option-ally specify the transport protocol to be used. The availableprotocols are listed in Table 1. Vertices that are connectedusing shared-memory channels are executed within a singleprocess, though they are individually started as their inputsbecome available and individually report completion.

Because the dataflow graph is acyclic, scheduling dead-lock is impossible when all channels are either written totemporary files or use shared-memory FIFOs hidden within

Source: Isard et al. (2007) Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. EuroSys.

Dryad: Architecture

At run time each channel is used to transport a finite se-quence of structured items. This channel abstraction hasseveral concrete implementations that use shared memory,TCP pipes, or files temporarily persisted in a file system.As far as the program in each vertex is concerned, channelsproduce and consume heap objects that inherit from a basetype. This means that a vertex program reads and writes itsdata in the same way regardless of whether a channel seri-alizes its data to bu!ers on a disk or TCP stream, or passesobject pointers directly via shared memory. The Dryad sys-tem does not include any native data model for serializa-tion and the concrete type of an item is left entirely up toapplications, which can supply their own serialization anddeserialization routines. This decision allows us to supportapplications that operate directly on existing data includ-ing exported SQL tables and textual log files. In practicemost applications use one of a small set of library item typesthat we supply such as newline-terminated text strings andtuples of base types.

A schematic of the Dryad system organization is shownin Figure 1. A Dryad job is coordinated by a process calledthe “job manager” (denoted JM in the figure) that runseither within the cluster or on a user’s workstation withnetwork access to the cluster. The job manager containsthe application-specific code to construct the job’s commu-nication graph along with library code to schedule the workacross the available resources. All data is sent directly be-tween vertices and thus the job manager is only responsiblefor control decisions and is not a bottleneck for any datatransfers.

Files, FIFO, NetworkJob schedule Data plane

Control plane

D D DNS

V V V

JM

Figure 1: The Dryad system organization. The job manager (JM)consults the name server (NS) to discover the list of available com-puters. It maintains the job graph and schedules running vertices (V)as computers become available using the daemon (D) as a proxy.Vertices exchange data through files, TCP pipes, or shared-memorychannels. The shaded bar indicates the vertices in the job that arecurrently running.

The cluster has a name server (NS) that can be used toenumerate all the available computers. The name serveralso exposes the position of each computer within the net-work topology so that scheduling decisions can take accountof locality. There is a simple daemon (D) running on eachcomputer in the cluster that is responsible for creating pro-cesses on behalf of the job manager. The first time a vertex(V) is executed on a computer its binary is sent from the jobmanager to the daemon and subsequently it is executed froma cache. The daemon acts as a proxy so that the job man-ager can communicate with the remote vertices and monitorthe state of the computation and how much data has been

read and written on its channels. It is straightforward to runa name server and a set of daemons on a user workstationto simulate a cluster and thus run an entire job locally whiledebugging.

A simple task scheduler is used to queue batch jobs. Weuse a distributed storage system, not described here, thatshares with the Google File System [21] the property thatlarge files can be broken into small pieces that are replicatedand distributed across the local disks of the cluster comput-ers. Dryad also supports the use of NTFS for accessing filesdirectly on local computers, which can be convenient forsmall clusters with low management overhead.

2.1 An example SQL queryIn this section, we describe a concrete example of a Dryad

application that will be further developed throughout the re-mainder of the paper. The task we have chosen is representa-tive of a new class of eScience applications, where scientificinvestigation is performed by processing large amounts ofdata available in digital form [24]. The database that weuse is derived from the Sloan Digital Sky Survey (SDSS),available online at http://skyserver.sdss.org.

We chose the most time consuming query (Q18) from apublished study based on this database [23]. The task is toidentify a “gravitational lens” e!ect: it finds all the objectsin the database that have neighboring objects within 30 arcseconds such that at least one of the neighbors has a colorsimilar to the primary object’s color. The query can beexpressed in SQL as:

select distinct p.objIDfrom photoObjAll pjoin neighbors n — call this join “X”on p.objID = n.objIDand n.objID < n.neighborObjIDand p.mode = 1

join photoObjAll l — call this join “Y”on l.objid = n.neighborObjIDand l.mode = 1and abs((p.u-p.g)-(l.u-l.g))<0.05and abs((p.g-p.r)-(l.g-l.r))<0.05and abs((p.r-p.i)-(l.r-l.i))<0.05and abs((p.i-p.z)-(l.i-l.z))<0.05

There are two tables involved. The first, photoObjAllhas 354,254,163 records, one for each identified astronomicalobject, keyed by a unique identifier objID. These recordsalso include the object’s color, as a magnitude (logarithmicbrightness) in five bands: u, g, r, i and z. The second table,neighbors has 2,803,165,372 records, one for each objectlocated within 30 arc seconds of another object. The modepredicates in the query select only “primary” objects. The< predicate eliminates duplication caused by the neighborsrelationship being symmetric. The output of joins “X” and“Y” are 932,820,679 and 83,798 records respectively, and thefinal hash emits 83,050 records.

The query uses only a few columns from the tables (thecomplete photoObjAll table contains 2 KBytes per record).When executed by SQLServer the query uses an index onphotoObjAll keyed by objID with additional columns formode, u, g, r, i and z, and an index on neighbors keyed byobjID with an additional neighborObjID column. SQL-Server reads just these indexes, leaving the remainder of thetables’ data resting quietly on disk. (In our experimentalsetup we in fact omitted unused columns from the table, toavoid transporting the entire multi-terabyte database across


Dryad: Cool Tricks

¢  Channel: abstraction for vertex-to-vertex communication l  File l  TCP pipe

l  Shared memory

¢  Runtime graph refinement l  Size of input is not known until runtime

l  Automatically rewrite graph based on invariant properties


Dryad: Sample Program

the country.) For the equivalent Dryad computation we ex-tracted these indexes into two binary files, “ugriz.bin” and“neighbors.bin,” each sorted in the same order as the in-dexes. The “ugriz.bin” file has 36-byte records, totaling11.8 GBytes; “neighbors.bin” has 16-byte records, total-ing 41.8 GBytes. The output of join “X” totals 31.3 GBytes,the output of join “Y” is 655 KBytes and the final output is649 KBytes.

D D

MM 4n

SS 4n

YY

U U

U N U N

H

n

n

X Xn

Figure 2: The communica-tion graph for an SQL query.Details are in Section 2.1.

We mapped the query tothe Dryad computation shownin Figure 2. Both data filesare partitioned into n approx-imately equal parts (that wecall U1 through Un and N1

through Nn) by objID ranges,and we use custom C++ itemobjects for each data recordin the graph. The verticesXi (for 1 ! i ! n) imple-ment join “X” by taking theirpartitioned Ui and Ni inputsand merging them (keyed onobjID and filtered by the< expression and p.mode=1)to produce records containingobjID, neighborObjID, andthe color columns correspond-ing to objID. The D verticesdistribute their output recordsto the M vertices, partition-ing by neighborObjID usinga range partitioning functionfour times finer than that usedfor the input files. The numberfour was chosen so that fourpipelines will execute in paral-lel on each computer, becauseour computers have four pro-cessors each. The M vertices perform a non-deterministicmerge of their inputs and the S vertices sort on neigh-borObjID using an in-memory Quicksort. The outputrecords from S4i!3 . . . S4i (for i = 1 through n) are fed intoYi where they are merged with another read of Ui to im-plement join “Y”. This join is keyed on objID (from U) =neighborObjID (from S), and is filtered by the remainderof the predicate, thus matching the colors. The outputs ofthe Y vertices are merged into a hash table at the H vertexto implement the distinct keyword in the query. Finally, anenumeration of this hash table delivers the result. Later inthe paper we include more details about the implementationof this Dryad program.

3. DESCRIBING A DRYAD GRAPHWe have designed a simple language that makes it easy

to specify commonly-occurring communication idioms. It iscurrently “embedded” for convenience in C++ as a libraryusing a mixture of method calls and operator overloading.

Graphs are constructed by combining simpler subgraphsusing a small set of operations shown in Figure 3. All of theoperations preserve the property that the resulting graph isacyclic. The basic object in the language is a graph:

G = "VG, EG, IG, OG#.

G contains a sequence of vertices VG, a set of directed edgesEG, and two sets IG $ VG and OG $ VG that “tag” someof the vertices as being inputs and outputs respectively. Nograph can contain a directed edge entering an input vertexin IG, nor one leaving an output vertex in OG, and these tagsare used below in composition operations. The input andoutput edges of a vertex are ordered so an edge connectsspecific “ports” on a pair of vertices, and a given pair ofvertices may be connected by multiple edges.

3.1 Creating new verticesThe Dryad libraries define a C++ base class from which

all vertex programs inherit. Each such program has a tex-tual name (which is unique within an application) and astatic “factory” that knows how to construct it. A graphvertex is created by calling the appropriate static programfactory. Any required vertex-specific parameters can be setat this point by calling methods on the program object.These parameters are then marshaled along with the uniquevertex name to form a simple closure that can be sent to aremote process for execution.

A singleton graph is generated from a vertex v as G ="(v), %, {v}, {v}#. A graph can be cloned into a new graphcontaining k copies of its structure using the ^ operatorwhere C = G^k is defined as:

C = "V 1G & · · ·& V k

G , E1G ' · · · ' Ek

G,

I1G ' · · · ' Ik

G, O1G ' · · · ' Ok

G#.

Here Gn = "V nG , En

G, InG, On

G# is a “clone” of G containingcopies of all of G’s vertices and edges, & denotes sequenceconcatenation, and each cloned vertex inherits the type andparameters of its corresponding vertex in G.

3.2 Adding graph edgesNew edges are created by applying a composition opera-

tion to two existing graphs. There is a family of composi-tions all sharing the same basic structure: C = A ( B createsa new graph:

C = "VA & VB, EA ' EB ' Enew, IA, OB#

where C contains the union of all the vertices and edges inA and B, with A’s inputs and B’s outputs. In addition,directed edges Enew are introduced between vertices in OA

and IB. VA and VB are enforced to be disjoint at run time,and since A and B are both acyclic, C is also.

Compositions di!er in the set of edges Enew that they addinto the graph. We define two standard compositions:

• A >= B forms a pointwise composition as shown in Fig-ure 3(c). If |OA| ) |IB | then a single outgoing edgeis created from each of A’s outputs. The edges areassigned in round-robin to B’s inputs. Some of thevertices in IB may end up with more than one incom-ing edge. If |IB| > |OA|, a single incoming edge iscreated to each of B’s inputs, assigned in round-robinfrom A’s outputs.

• A >> B forms the complete bipartite graph betweenOA and IB and is shown in Figure 3(d).

We allow the user to extend the language by implementingnew composition operations.

GraphBuilder XSet = moduleX^N;GraphBuilder DSet = moduleD^N;GraphBuilder MSet = moduleM^(N*4);GraphBuilder SSet = moduleS^(N*4);GraphBuilder YSet = moduleY^N;GraphBuilder HSet = moduleH^1;

GraphBuilder XInputs = (ugriz1 >= XSet) || (neighbor >= XSet);GraphBuilder YInputs = ugriz2 >= YSet;

GraphBuilder XToY = XSet >= DSet >> MSet >= SSet;for (i = 0; i < N*4; ++i){

XToY = XToY || (SSet.GetVertex(i) >= YSet.GetVertex(i/4));}

GraphBuilder YToH = YSet >= HSet;GraphBuilder HOutputs = HSet >= output;

GraphBuilder final = XInputs || YInputs || XToY || YToH || HOutputs;

Figure 4: An example graph builder program. The communication graph generated by this program is shown in Figure 2.

Channel protocol DiscussionFile (the default) Preserved after vertex execution

until the job completes.TCP pipe Requires no disk accesses, but

both end-point vertices must bescheduled to run at the sametime.

Shared-memoryFIFO

Extremely low communicationcost, but end-point vertices mustrun within the same process.

Table 1: Channel types.

encapsulated acyclic subgraphs. However, allowing the de-veloper to use pipes and “visible” FIFOs can cause dead-locks. Any connected component of vertices communicatingusing pipes or FIFOs must all be scheduled in processes thatare concurrently executing, but this becomes impossible ifthe system runs out of available computers in the cluster.

This breaks the abstraction that the user need not knowthe physical resources of the system when writing the appli-cation. We believe that it is a worthwhile trade-o!, since, asreported in our experiments in Section 6, the resulting per-formance gains can be substantial. Note also that the sys-tem could always avoid deadlock by “downgrading” a pipechannel to a temporary file, at the expense of introducingan unexpected performance cli!.

3.5 Job inputs and outputsLarge input files are typically partitioned and distributed

across the computers of the cluster. It is therefore natural togroup a logical input into a graph G = !VP , ", ", VP # whereVP is a sequence of “virtual” vertices corresponding to thepartitions of the input. Similarly on job completion a setof output partitions can be logically concatenated to form asingle named distributed file. An application will generallyinterrogate its input graphs to read the number of partitionsat run time and automatically generate the appropriatelyreplicated graph.

3.6 Job StagesWhen the graph is constructed every vertex is placed in

a “stage” to simplify job management. The stage topologycan be seen as a “skeleton” or summary of the overall job,

and the stage topology of our example Skyserver query ap-plication is shown in Figure 5. Each distinct type of vertexis grouped into a separate stage. Most stages are connectedusing the >= operator, while D is connected to M using the>> operator. The skeleton is used as a guide for generatingsummaries when monitoring a job, and can also be exploitedby the automatic optimizations described in Section 5.2.

4. WRITING A VERTEX PROGRAM

D

M

S

Y

U

U N

H

X

Figure 5: The stagesof the Dryad compu-tation from Figure 2.Section 3.6 has details.

The primary APIs for writing aDryad vertex program are exposedthrough C++ base classes and ob-jects. It was a design requirementfor Dryad vertices to be able to incor-porate legacy source and libraries, sowe deliberately avoided adopting anyDryad-specific language or sandbox-ing restrictions. Most of the existingcode that we anticipate integratinginto vertices is written in C++, butit is straightforward to implementAPI wrappers so that developers canwrite vertices in other languages, forexample C#. There is also significantvalue for some domains in being ableto run unmodified legacy executablesin vertices, and so we support this asexplained in Section 4.2 below.

4.1 Vertex executionDryad includes a runtime library

that is responsible for setting up andexecuting vertices as part of a dis-tributed computation. As outlined inSection 3.1 the runtime receives a clo-sure from the job manager describingthe vertex to be run, and URIs de-scribing the input and output chan-nels to connect to it. There is cur-rently no type-checking for channelsand the vertex must be able to deter-mine, either statically or from the invocation parameters,the types of the items that it is expected to read and write


DryadLINQ

¢  LINQ = Language INtegrated Query l  .NET constructs for combining imperative and declarative programming

¢  Developers write in DryadLINQ l  Program compiled into computations that run on Dryad

Sound familiar? Source: Yu et al. (2008) DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. OSDI.

DryadLINQ: Word Count

PartitionedTable<LineRecord> inputTable = PartitionedTable.Get<LineRecord>(uri); IQueryable<string> words = inputTable.SelectMany(x => x.line.Split(' ')); IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x); IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count())); IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count); IQueryable<Pair> top = ordered.Take(k);

a = load ’file.txt' as (text: chararray); b = foreach a generate flatten(TOKENIZE(text)) as term; c = group b by term; d = foreach c generate group as term, COUNT(b) as count; store d into 'cnt';

Compare:

Compare and contrast…

Storm: Real-Time Data Processing

Storm

¢  Provides an abstraction for real-time computation: l  Spouts represent (infinite) source of tuples l  Bolts consume tuples and emit new tuples

l  A topology determines how spouts and bolts are connected

¢  Runtime handles: l  Process management

l  Guaranteed message delivery

l  Fault-tolerance

Hadoop Cluster Architecture

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

YARN

¢  Hadoop limitations: l  Can only run MapReduce l  What if we want to run other distributed frameworks?

¢  YARN = Yet-Another-Resource-Negotiator l  Provides API to develop any generic distribution application

l  Handles scheduling and resource request

l  MapReduce (MR2) is one such application in YARN

YARN: Architecture

Today’s Agenda

¢  Making Hadoop more efficient

¢  Tweaking the MapReduce programming model

¢  Beyond MapReduce

Source: Wikipedia (Japanese rock garden)

Questions?