Pig vs Hive: Benchmarking High Level Query Languages · Pig vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London,

Pig vs Hive: Benchmarking High Level QueryLanguages

Benjamin JakobusIBM, Ireland

Dr. Peter McBrienImperial College London, UK

Abstract

This article presents benchmarking results1 of two benchmarking sets (run onsmall clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop0.14.1. The first set of results were obtainted by replicating the Apache Pigbenchmark published by the Apache Foundation on 11/07/07 (which servedas a baseline to compare major Pig Latin releases). The second results wereobtained by applying the TPC-H benchmarks.

The two benchmarks showed conflicting results; the first benchmark indi-cated that Pig outperformed Hive on most operations. However interestingly,TPC-H results provide evidence that Hive is significantly faster than Pig. Thearticle analyzes the two benchmarks, concluding with a set of differences andjustification of the results.

The article presumes that the reader has a basic knowledge about Hadoopand big data. (The article is not intended as an introduction to Hadoop, Pigor Hive).

1Which stem from 2013 when the author spent a year at Imperial College London

1

About the authors

Benjamin Jakobus graduated with a BSc in Computer Sci-ence from University College Cork in 2011, after which heco-founded an Irish start-up. He returned to Universityone year later and graduated with an MSc in AdvancedComputing from Imperial College London in 2013. Sincegraduating, he took up a position as Software Engineer atIBM Dublin (SWG, Collaboration Solutions). This articleis based on his Masters thesis developed under the super-

vision of Dr. Peter McBrien.

Dr. Peter McBrien graduated with a BA in ComputerScience from Cambridge University in 1986. After sometime working at Racal and ICL, heI joined the Depart-ment of Computing at Imperial College as an RA in 1989,working on the Tempora Esprit Project. He obtained hisPhD Implementing Graph Rewriting By Graph Rewrit-ing in 1992, under the supervision of Chris Hankin. In1994, he joined Department of Computing at King’s Col-lege London as a lecturer, and returned to the Department

of Computing at Imperial College in August 1999 as a lecturer. Since thenhe has been promoted to be a Senior Lecturer.

2

Acknowledgements

The authors would like to thank Yu Liu, PhD student at Imperial CollegeLondon, who, over the course of the past year helped us with any technicalproblems that we encountered.

3

1 Introduction

Despite Hadoop’s popularity, the Hadoop user finds it cumbersome to developMap-Reduce (MR). To simplify the task, high-level scripting languages suchas Pig Latin or Hive QL have emerged. Users are often faced with the ques-tion whether to use Pig or Hive. At time of writing, no up-to-date scientificstudies exist to help them answer this question. In addition, performance dif-ferences between Pig and Hive are not really well understood and not muchliterature in the field exists that examines these performance differences isscarce.

The article presents benchmarking results2 of two benchmarking sets (runon small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop0.14.1. The first set of results were obtainted by replicating the Apache Pigbenchmark published by the Apache Foundation on 11/07/07. The secondresults were obtained by applying the TPC-H benchmarks. These test casesconsist of 22 distinct queries, each of which exhibit the same (or higher) de-gree of complexity that is typically found in real-world industry scenarios,consist of varying query parameters and various types of access

Whilst existing literature[7][10][6][4][11][13][8][12] addresses some of thesequestions, the literature suffer from the following shortcomings:

1. The most recent Apache Pig benchmarks stem from 2009.

2. None of the literature cited in footnotes examines how operations scaleover different datasets.

3. Hadoop benchmarks were performed on clusters of 100 nodes or less(Hadoop was designed to run on clusters containing thousands of nodes,therefore small-scale performance analysis may not really do it anyjustice). Naturally, the same argument can be applied against thebenchmarking results presented in this article.

4. The literature fails to indicate the different communication overheadrequired by the various database management systems. (Again, this ar-ticle does not address this concern; rather this article describes bench-mark during runtime.)

2Which stem from 2013 whilst the author spent a year at Imperial College London

4

2 Background: Benchmarking High-level Query

Languages

To date there exist several publications comparing the performance of Pig,HiveQL and other High-level Query Languages (HLQLs). In 2011, Stewartand Trinder et al[13] compared Pig, HiveQL and JAQL using runtime met-rics, and according to how well each language scales and how much shorterqueries really are in comparison to using the Hadoop Java API directly. Us-ing a 32 node Beowulf cluster, Stewart et al found that:

• HiveQL scaled best (both up and out) and that Java was only slightlyfaster (It had the best runtime performance out of the three HLQLs).Java also had better scale-up performance than Pig.

• Pig is the most succinct and compact language of those compared.

• Pig and Hive QL are not Turing Complete.

• Pig and JAQL scaled the same except when using joins: Pig signifi-cantly outperformed JAQL on that regard.

• Pig and Hive are optimised for skewed key distribution and outperformhand-coded Java MR jobs in that regard.

Hive’s performance over Pig is further supported by Apache’s Hive per-formance benchmarks[10].

Moussa[11] from the University of Tunis applied the TPC-H benchmarkto compare Oracle SQL Engine to Pig. It was found that SQL Engine greatlyoutperformed Pig (whereby joins using Pig stood out to be particularly slow.Again, Apache’s own benchmarks[10] confirm this: When executing a join,Hadoop took 470 seconds. Hive took 471 seconds. PIG took 764 seconds(Hive took 0.2% more time than Hadoop, whilst PIG took 63% more timethan Hadoop). Moussa used a dataset of 1.1GB.

While studying the performance of Pig using large astrophysical datasetsLoebman et al[12] also found that a relational database management systemoutperforms Pig joins. In general, their experiments show that relationaldatabase management systems (RDBMSs) performed better than Hadoopand that relational databases especially stood out in terms of memory man-agement (although that was to be expected given that NoSQL systems aredesigned to deal with unstructured rather than structured data). As ac-knowledged by the authors, it should be noted that no more than 8 nodes

5

were used throughout the experiment. Hadoop however is designed to beused with hundreds if not thousands of nodes. Work by Schatzle et al fur-ther underpins this argument: In 2011 the authors proposed PigSPARQL(a framework for translating SPARQL queries to Pig Latin) based on thereasoning that for ”scenarios, which can be characterized by first extract-ing information from a huge data set, second by transforming and loadingthe extracted data into a different format, cluster-based parallelism seems tooutperform parallel databases.”[4] Their reasoning is based on [5] [6], how-ever the authors of [6] acknowledge that they cannot verify the claim thatHadoop would have outperformed the parallel database systems if only it hadmore nodes. That is, having benchmarked Hadoop’s MapReduce with 100nodes against two parallel database systems, it was found that both systemsoutperformed Hadoop:

First, at the scale of the experiments we conducted, both paralleldatabase systems displayed a significant performance advantageover Hadoop MR in executing a variety of data intensive analysisbenchmarks. Averaged across all five tasks at 100 nodes, DBMS-X was 3.2 times faster than MR and Vertica was 2.3 times fasterthan DBMS-X. While we cannot verify this claim, we believe thatthe systems would have the same relative performance on 1,000nodes (the largest Teradata configuration is less than 100 nodesmanaging over four petabytes of data).

3 Running the Apache Benchmark

The experiment follows in the footsteps of the Pig benchmarks3 published bythe Apache Foundation on 11/07/07[7]. Their objective was to have baselinenumbers to compare to before they could make major changes to the system.

3.1 Test Data

We decided to benchmark the execution of load, arithmetic, group, join andfilter operations on 6 datasets (as opposed to just two):

• Dataset size 1: 30,000 records (772KB)

• Dataset size 2: 300,000 records (6.4MB)

• Dataset size 3: 3,000,000 records (63MB)

3With the exception of co-grouping.

6

• Dataset size 4: 30 million records (628MB)

• Dataset size 5: 300 million records (6.2GB)

• Dataset size 6: 3 billion records (62GB)

That is, our datasets scale linearly, whereby the size equates to 3000 *10n.

A seventh dataset consisting of 1,000 records (23KB) was produced toperform join operations on. Its schema is as follows:

name - stringmarks - integergpa - float

The data was generated using the generate data.pl perl script availablefor download on the Apache website.[7] and produced tab delimited text fileswith the following schema:

name - stringage - integergpa - float

It should be noted that the experiment differs slightly to the original inthat the original used only two datasets of 200 million records (200MB) and10 thousand (10KB) records whereas our experiment consists of six separatedatasets with a scaling factor of 10 (i.e. 30,000 records, 300,000 records etc).

3.2 Test Setup

The benchmarks were run on a cluster consisting of 6 nodes (1 dedicatedto Name Node and Job Tracker and 5 compute nodes). Each node wasequippted with a 2 dual-core Intel(R) Xeon(R) CPU @2.13GHz and 4 GB ofmemory. Furthermore, the cluster had Hadoop 0.14.1 installed, configuredto 1024MB memory and 2 map + 2 reduce jobs per node. We modified theoriginal Apache Pig scripts to include the PARALLEL key word, forcing a to-tal of 64 reduce tasks to be created (Pig was tested both with and withoutthe PARALLEL key word). Both the Hive and Pig scripts were re-run in localmode on one of the cluster nodes. Hadoop was configured to use a replicationfactor of 2.

7

As with the original Apache benchmark, the Linux time utility was usedto measure the average wall-clock time of each operation (operations wereexecuted 3 times each).

3.3 Test Cases

As with the original benchmark produced by Apache, we benchmarked thefollowing operations4

1. Loading and storing of data.

2. Filtering the data so that 10% of the records are removed.

3. Filtering the data so that 90% of the records are removed.

4. Performing basic arithmetic on integers and floats (age * gpa + 3, age/ gpa - 1.5).

5. Grouping the data (by name).

6. Joining the data.

7. Distinct select.

8. Grouping the data.

3.4 Results

3.4.1 Pig Benchmark Results

Having determined the optimal number of reducers (8 is optimal in our case;see section 3.4.3), the results of the Pig benchmarks run on the Hadoop clus-ter are as follows:

4Distinct selects and grouping were not part of the original benchmark.

8

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

Arithmetic 32.82 36.21 49.49 83.25 423.63 3900.78Filter 10% 32.94 34.32 44.56 66.68 295.59 2640.52Filter 90% 33.93 32.55 37.86 53.22 197.36 1657.37Group 49.43 53.34 69.84 105.12 497.61 4394.21Join 49.89 50.08 78.55 150.39 1045.34 10258.19

Table 1: Averaged performance of arithmetic, join, join, group, order, dis-tinct select and filter operations on six datasets using Pig. Scripts wereconfigured as to use 8 reduce and 11 map tasks.

Figure 1: Pig runtime plotted in logarithmic scale

3.4.2 Hive Benchmark Results

Having determined the optimal number of reducers (8 is optimal in our case;see section 3.4.3), the results of the Hive benchmarks run on the Hadoopcluster are as follows:

9

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6

Arithmetic 32.84 37.33 72.55 300.08 2633.72 27821.19Filter 10% 32.36 53.28 59.22 209.5 1672.3 18222.19Filter 90% 31.23 32.68 36.8 69.55 331.88 3320.59Group 48.27 47.68 46.87 53.66 141.36 1233.4Join 48.54 56.86 104.6 517.5 4388.34 -Distinct 48.73 53.28 72.54 109.77 - -

Table 2: Averaged performance of arithmetic, join, join, group, distinct selectand filter operations on six datasets using Hive. Scripts were configured asto use 8 reduce and 11 map tasks.

3.4.3 Hive and Pig: JOIN benchmarks using variable number ofreduce tasks

The following results were obtained by varying the number of reduce tasksusing the default JOIN for both Hive and Pig. All jobs were run on theaforementioned Hadoop cluster and each job was run three times. Runtimeswere then averaged as seen below. It should be noted that Map time, Re-duce time and Total time refers to the cluster’s cumulative CPU time.Real time is the ”actual” time as measured using the Unix time command(/usr/bin/time). It is this difference in the two time metrics that causes thediscrepancy between the times in the tables below. That is, the CPU timerequired by a job running on 10 node cluster will (more or less) be the samethan the time required to run the same job on a 1000 node cluster. Howeverthe real time it takes the job to complete on the 1000 node cluster will be100 times less than if it were to run on a 10 node cluster.

The JOIN scripts are as follows (with varied reduce tasks of course): Hive:

set mapred . reduce . ta sk s =4;

SELECT ∗FROM datase t JOIN d a t a s e t j o i nON datase t . name = d a t a s e t j o i n . name ;

Pig:

A = LOAD ’/ user / bj112 / data /4/ dataset ’ us ing PigStorage ( ’\ t ’ )AS (name : chararray , age : int , gpa : f l o a t ) PARALLEL 4 ;

B = LOAD ’/ user / bj112 / data / j o i n / d a t a s e t j o i n ’ us ing PigStorage ( ’\ t ’ )AS (name : chararray , age : int , gpa : f l o a t ) PARALLEL 4 ;

C = JOIN A BY name , B BY name PARALLEL 4 ;

10

STORE C INTO ’ output ’ us ing PigStorage ( ) PARALLEL 4 ;

By default, Pig uses a Hash-join whilst Hive uses an equi-join.

# Reducers # Maptasks

Avg. realtime

Avg.map time

Avg. re-duce time

1 11 328.54 346.67 228.212 11 263.2 340.59 312.824 11 208.24 337.01 350.718 11 168.44 337.53 351.8216 11 180.08 333.87 391.9532 11 205.91 333.78 479.3664 11 252.35 332.89 624.65128 11 355.34 334.64 930.7

Table 3: Pig benchmarks for performing default JOIN operation on Datasetsize 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.

# R % time spenton map tasks

% time spenton reducetasks

Avg. total cputime

1 60.3 39.7 574.892 52.13 47.87 653.414 48.77 50.75 691.058 48.96 51.04 689.3516 46 54 725.8232 41.05 58.95 813.1464 34.76 65.23 957.55128 26.45 73.55 1265.34


11

# R Std. devmap time

Std. devred. time

Std. devtotaltime

Std. devreal time

1 1.71 1.03 0.91 3.092 0.66 49.68 49.86 12.214 0.75 24.61 21.34 19.008 0.65 2.42 2.63 2.7416 1.52 3.24 4.75 0.5432 1.22 2.61 3.77 0.5064 3.62 5.80 9.41 2.44128 0.97 5.58 6.54 5.63


# Reducers # Maptasks

Avg. realtime

Avg.map time

Avg. re-duce time

1 4 846.04 321.98 362.142 4 654.82 263.06 367.44 4 588.65 204.170 607.898 4 541.67 228.13 533.7216 4 551.62 229.66 548.4232 4 555.22 216.48 609.1164 4 580.91 216.12 809.43128 4 647.78 220.16 1187.91

Table 6: Hive benchmarks for performing default JOIN operation on Datasetsize 4 (consisting of 30 million records (628MB)) and a small dataset con-sisting of 1,000 records (23KB). See section 3.1 for info on the datasets.

12

# R % time spenton map tasks

% time spenton reducetasks

Avg. total cputime

1 47.06% 52.94% 684.122 41.73% 58.27% 630.464 25.14% 74.86% 812.0608 39.94% 60.06% 68016 30.00% 70.48% 778.132 26.00% 74.05% 822.5464 21.00% 78.93% 1025.54128 16.00% 84.36% 1408.08


Figure 2: Hive reduce time (CPU time) plotted in logarithmic scale.

13

Figure 3: Pig reduce time (CPU time) plotted in logarithmic scale.

# R Std. devmap time

Std. devred. time

Std. devtotaltime

Std. devreal time

1 60.43 67.86 8.14 7.712 21.78 79.5 74.59 16.754 7.97 2.45 5.53 6.268 7.26 6.86 9.33 11.3516 28.77 4.96 31.04 11.9632 2.81 3.89 2.76 3.1864 2.04 4.73 4.42 4.10128 3.76 7.54 5.18 22.38


The resultant join consisted of 37,987,763 lines (1.5GB). The originaldataset used to perform the join consisted of 1,000 records; the dataset towhich it was joined consisted of 30 million.

Performing a replicated join using Pig and the same datasets resulted ina speed-up of 11% (the replicated join had an average real time runtime of

14

149.96 seconds compared to its hash-join equivalent of 168.44 seconds).

By adjusting the minimum and maximum split size and providing Hadoopwith ”hints”5 as to how many map tasks should be used, we forced Hive touse 11 map tasks (the same as Pig) and arrived at the following results6:

MapCPUTime

ReduceCPUTime

TotalCPUTime

RealTime

279.48 444.27 723.75 524.96265.34 433.02 698.36 501.15265.78 443.01 708.79 497.26

Avg.: 270.2 440.1 710.3 507.79Std. Dev.: 8.04 6.16 12.76 14.99

Table 9: Hive benchmarks for performing default JOIN operation on Dataset4 (consisting of 30 million records (628MB)) and a small dataset consistingof 1,000 records (23KB) whilst using 8 reduce and 11 map tasks.

4 Running the TPC-H Benchmark

As previously noted, the TPC-H benchmark was used to confirm the exis-tence of a performance difference between Pig and Hive. TPC-H is a decisionsupport benchmark published by the Transaction Processing PerformanceCouncil [9] (Transaction Processing Performance Council (TPC) is an orga-nization founded for the purpose to define global database benchmarks). Asstated in the official TPC-H specification:[14]

[TPC-H] consists of a suite of business oriented ad-hoc queries andconcurrent data modifications. The queries and the data popu-lating the database have been chosen to have broad industry-widerelevance while maintaining a sufficient degree of ease of imple-mentation. This benchmark illustrates decision support systemsthat

5The configuration was as follows:SET mapred.reduce.tasks=8;SET mapred.map.tasks=8;SET mapred.min.split.size=1000000;SET mapred.max.split.size=1000000;

6Again, the JOIN was run three times and the result averaged

15

• Examine large volumes of data;

• Execute queries with a high degree of complexity;

• Give answers to critical business questions.

The performance metrics used for these benchmarks are the same as thoseused as part of the aforementioned Apache benchmarks:

• Real time runtime (using the Unix time command)

• Cumulative CPU time

• Map CPU time

• Reduce CPU time

In addition, 4 new metrics were added:

• Number of map tasks launched

• Number of reduce tasks launched

The TPC-H benchmarks differ from the Apache benchmarks (describedearlier) and that a) they consist of more queries and b) the queries are morecomplex and intended to simulate a realistic business environment.

4.1 Test Data

To recap section 3, we first attempted to replicate the Apache Pig bench-mark published by the Apache Foundation on 11/07/07[7]. Consequently,the data was generated using the generate data.pl perl script available fordownload on the Apache website[7]. The Perl script produced tab delimitedtext files with the following schema:

name - stringage - integergpa - float

Six separate datasets were generated7 in an order to measure the perfor-mance of, arithmetic, group, join and filter operations. The datasets scaledscaled linearly; therefore the size equates to 3000 * 10n: dataset size 1 con-sisted of 30,000 records (772KB), dataset size 2 consisted of 300,000 records

7These datasets were joined against seventh dataset consisting of 1,000 records (23KB)

16

(6.4MB), dataset size 3 consisted of 3,000,000 records (63MB), dataset size 4consisted of 30 million records (628MB), dataset size 5 consisted of 300 mil-lion records (6.2GB) and dataset size 6 consisted of 3 billion records (62GB).

One obvious downside to the above datasets is their simplicity: in re-ality, databases tend to be much more complex and most certainly consistof tables containing more than just three columns. Furthermore, databasesusually don’t just consist of one or two tables (the queries executed as part ofthe benchmarks from section 3 involved 2 tables at most. In fact all queries,except the join, involved only 1 table).

The benchmarks produced within this report address these shortcomingsby employing the much richer TPC-H datasets generated using the TPCdbgen utility. This utility produces 8 individual tables (customer.tbl con-sisting of 15,000,000 records (2.3GB), lineitem.tbl consisting of 600,037,902records (75GB), nation.tbl consisting of 25 records (4KB, orders.tbl con-sisting of 150,000,000 records (17GB), partsupp.tbl consisting of 80,000,000records (12GB), part.tbl consisting of 20,000,000 records (2.3GB), region.tblconsisting of 5 records (4KB), supplier.tbl consisting of 1,000,000 records(137MB).

4.2 Test Cases

The TPC-H test cases consist of 22 distinct queries, each of which weredesigned to exhibit the same (or higher) degree of complexity that is typicallyfound in real-world scenarios, consist of varying query parameters and varioustypes of access. They are designed so that each query covers a large part ofeach table/dataset[14].

4.3 Test Setup

Several modifications have been made to the cluster since we ran the firstset of experiments detailed in section 3, and the cluster on which the TPC-Hbenchmarks were run now consist of 9 hosts:

• chewbacca.doc.ic.ac.uk - 3.00GHz Intel(R) Core(TM)2 Duo CPU,3822MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pan-golin.

• queen.doc.ic.ac.uk - 3.20GHz Intel(R) Core(TM) i5 CPU, 7847MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

17

• awake.doc.ic.ac.uk - 3.20GHz Intel(R) Core(TM) i5 CPU, 7847MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

• mavolio.doc.ic.ac.uk - 3.00GHz Intel(R) Core(TM)2 Duo CPU, 5712MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

• zim.doc.ic.ac.uk - 3.00GHz Intel(R) Core(TM)2 Duo CPU, 3824MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

• zorin.doc.ic.ac.uk - 2.66GHz Intel(R) Core(TM)2 Duo CPU, 3872MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

• tiffanycase.doc.ic.ac.uk - 2.66GHz Intel(R) Core(TM)2 Duo CPU,3872MiB system memory. Running Ubuntu 12.04.2 LTS, Precise Pan-golin.

• zosimus.doc.ic.ac.uk - 3.00GHz Intel(R) Core(TM)2 Duo CPU, 3825MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

• artemis.doc.ic.ac.uk - 3.20GHz Intel(R) Core(TM) i5 CPU, 7847MiBsystem memory. Running Ubuntu 12.04.2 LTS, Precise Pangolin.

Both the Hive and Pig TPC-H scripts are available for download fromthe Apache website.

Section 4.6 presents additions to the set of benchmarks from section 3.The datasets and scripts used are identical to those presented in the section 3.

Note: The Linux time utility was used to measure the average wall-clocktime of each operation. For other metrics (CPU time, heap usage, etc) theHadoop logs were used.

18

Results

This section presents the results for both Pig and Hive.

Hive (TPC-H)

Running the TPC-H benchmarks for Hive produced the following results(note: script names were abbreviated):Running the TPC-H benchmarks for Hive produced the following results:

Script Avg.runtime

Std. dev. Avg. cu-mulativeCPUtime

Avg.maptasks

Avg.reducetasks

q1 623.64 26.2 4393.5 309 81q2 516.36 3.94 2015 82 21q3 1063.73 8.22 10144 402.5 102q4 344.88 70.74 0 0 0q5 1472.62 28.67 0 0 0q6 502.27 7.66 2325.5 300 1q7 2303.63 101.51 6 5 2q8 1494.1 0.06 13235 428 111q9 3921.66 239.36 48817 747 192q10 1155.33 44.71 7427 416 103q11 434.28 1.26 1446.5 59.5 15q12 763.14 11.4 4911.5 380 82q13 409.16 11.31 3157 93.5 24q14 515.39 9.82 3231.5 322 80q15 687.81 14.62 3168.5 300 80q16 698.14 69.94 14 3 0q17 1890.16 36.81 10643 300 80q18 2147 38.57 5591 300 69q19 1234.5 13.15 17168.5 322 80q20 1228.72 36.92 91 13 3q21 3327.84 16.3 10588.5 300 80q22 580.18 26.86 158 14 0

Table 10: TPC-H benchmark results for Hive using 6 trials (time is in seconds,unless indicated otherwise).

19

Script Avg.map heapusage

Avg.reduceheapusage

Avg.totalheapusage

Avg.mapCPUtime

Avg.reduceCPUtime

Avg. to-tal CPUtime

q1 1428 57 1486 6225 2890 9115q2 790 162 953 18485 13425 31910q3 1743 241 1985 54985 22820 77805q4 N/A N/A N/A N/A N/A N/Aq5 N/A N/A N/A N/A N/A N/Aq6 N/A N/A N/A N/A N/A N/Aq7 561 174 737 3275 4285 7560q8 1620 469 2092 31625 23975 55600q9 1882 199 2082 18055 12585 30640q10 3960 367 4328 268270 233640 501910q11 1468 254 1722 60365 33730 94095q12 1588 145 1733 5665 4565 10230q13 1663 349 2013 134420 42070 176490q14 1421 57 1478 5525 2180 7705q15 N/A N/A N/A N/A N/A N/Aq16 216 0 216 14435 0 14435q17 N/A N/A N/A N/A N/A N/Aq18 N/A N/A N/A N/A N/A N/Aq19 1421 71 1493 5250 2395 7645q20 N/A N/A N/A N/A N/A N/Aq21 N/A N/A N/A N/A N/A N/Aq22 1202 0 1202 159390 0 159390

Table 11: TPC-H benchmark results for Hive using 6 trials.

20

Figure 4: Real time runtimes of all 22 TPC-H benchmark scripts for Hive.

4.4 Pig (TPC-H)

Running the TPC-H benchmarks for Pig produced the following results (note:script names were abbreviated):

21

Script Avg.runtime

Std. dev.

q1 2192.34 5.88q2 2264.28 48.35q3 2365.21 355.49q4 1947.12 262.88q5 5998.67 250.99q6 589.74 2.65q7 1813.7 148.62q8 5405.69 811.68q9 7999.28 640.31q10 1871.74 93.54q11 824.42 103.37q12 1401.48 120.69q13 818.79 104.89q14 913.31 3.79q15 878.67 0.98q16 925.32 133.34q17 2935.41 178.31q18 4909.62 67.7q19 8375.02 438.12q20 2669.12 299.79q21 9065.29 543.42q22 818.79 14.74

Table 12: TPC-H benchmark results for Pig using 6 trials (time is in seconds,unless indicated otherwise).

22

Script Avg.map heapusage

Avg.reduceheapusage

Avg.totalheapusage

Avg.mapCPUtime

Avg.reduceCPUtime

Avg. to-tal CPUtime

q1 3023 400 3426 187750 12980 200730q2 3221 861 4087 69670 43780 113450q3 1582 631 2218 110120 46090 156210q4 633 363 999 1340 9190 10530q5 3129 878 4010 56020 28550 84570q6 N/A N/A N/A N/A N/A N/Aq7 1336 372 1713 20410 12870 33280q8 5978 882 6865 306900 162330 469230q9 874 348 1222 2670 8620 11290q10 N/A N/A N/A N/A N/A N/Aq11 2875 807 3686 172670 46840 219510q12 1016 751 1771 33880 49830 83710q13 600 346 950 1740 9860 11600q14 115 0 115 710 0 710q15 2561 559 3123 67130 24960 92090q16 3538 576 4116 606220 70250 676470q17 375 174 551 5470 3740 9210q18 965 385 1353 9090 13810 22900q19 328 175 504 4700 5390 10090q20 3232 858 4094 69820 48000 117820q21 1668 486 2160 34440 16800 51240q22 989 417 1409 20260 13080 33340

Table 13: TPC-H benchmark results for Pig using 6 trials.

23

Figure 5: Real time runtimes of all 22 TPC-H benchmark scripts for Pig.

Hive vs Pig (TPC-H)

As shown in figure 7, Hive outperforms Pig in the majority of cases (12 tobe precise). Their performance is roughly equivalent for 3 cases and Pigoutperforms Hive in 6 cases. At first glance, this contradicts all results ofthe experiments from section 3.

24

Figure 6: Real time runtimes of all 22 TPC-H benchmark scripts contrasted.

Upon examining the TPC-H benchmarks more closely, two issues stoodout that explain this discrepancy. The first is that after a script writes re-sults to disk, the output files are immediately deleted using Hadoop’s fs

-rmr command. This process is quite costly and is measured as part thereal-time execution of the script (however the fact that this operation is ex-pensive (in terms of runtime) is not catered for). In contrast, the HiveQLscripts merely drop tables at the beginning of the script. Dropping tablesis cheap as it only involves manipulating the meta-information on the localfilesystem - no interaction with the Hadoop filesystem is required. In fact,omitting the recursive delete operation reduces runtime by about 2%. Incontrast, removing DROP TABLE in Hive does not produce any performancedifference.

The aforementioned issue only accounts for a small percent of inequality.What causes the actual performance difference is the heavy usage of theGroup By operator in all but 3 TPC-H test scripts. Recall from section 3that Pig outperformed Hive in all instances except when using the Group By

operator: when grouping data Pig was 104% slower than Hive.

25

Figure 7: The runtime comparison between Pig and Hive (plotted in log-arithmic scale) for the Group By operator based on the benchmarks fromsection 3.

For example when running the TPC-H benchmarks for Pig, script 21(q21 suppliers who kept orders waiting.pig) had a real-time runtime of 9065.29seconds. 41% (or 3748.32 seconds) were required to execute the first Group

By. In contrast, Hive only required 1031.23 seconds for the grouping of data.The script grouped data 3 times:

-- This Group By took up 41% of the runtime

gl = group lineitem by l_orderkey;

[...]

fo = filter orders by o_orderstatus == ’F’;

[...]

ores = order sres by numwait desc, s_name;

Consequently, the excessive use of the Group By operator skews the bench-mark results significantly. Re-running the scripts and omitting the the group-

26

ing of data produces the expected results. For example, running script 3(q3 shipping priority.pig) and omitting the Group By operator significantlyreduces the runtime (to 1278.49 seconds real time runtime or a total of12,257,630ms CPU time).

Figure 8: The total average heap usage (in bytes) of all 22 TPC-H benchmarkscripts contrasted.

Another interesting artefact is exposed by figure 8: In all instances, Hive’sheap usage is significantly lower than that of Pig. This might be explainedby the fact that Hive does not need to build intermediary data structure, butPig (at the time of writing) does.

4.5 Determining optimal cluster configuration

Manipulating the configuration of the benchmark from section 3 in an effortto determine optimal cluster usage produced interesting results.

For one, data compression is important and significantly impacts runtimeperformance of JOIN and GROUP BY operations in Pig. For example,

27

enabling compression on dataset size 4 (which contains a large amount ofrandom data) produces a 3.2% speed-up in real time runtime.

Compression in Pig can be enabled by setting the pig.tmpfilecompressionflag to true and then specifying the type of compression pig.tmpfilecompression.codec

to either gzip or lzo. Note that gzip produces better compression whilst LZOis much faster in terms of runtime.

By editing the entry for mapred.reduce.slowstart.completed.maps inHadoop’s conf/mapred-site.xml we can tune the percentage of map tasks thatmust be completed before reduce tasks can be created. By default, this valueis set to 5% which was found to be too low for our cluster. Balancing theratio of mappers and reducers is critical to optimizing performance: reduc-ers should be started early enough so that data transfer is spread out overtime and thus preventing network bottlenecks. On the other hand, reducersshouldn’t be started late enough so that they do not use up slots that couldbe used by map tasks. Performance peaked when reduce tasks were startedafter 70% of map jobs completed.

The maximum number of map and reduce tasks for a node can be specifiedusing mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.

reduce.tasks.maximum. Naturally care should be taken when configuringthese: having a node with a maximum of 20 map slots but a script config-ured to use 30 map slots will result in significant performance penalties asthe first 20 map tasks will run in parallel, but the additional 10 will only bespawned once the first 20 map tasks have completed execution (consequentlyrequiring one extra round of computation). The same goes for the numberof reduce tasks: as is illustrated by figure 9, performance peaks when a taskrequires just little below the maximum number of reduce slots per node.

28

Figure 9: Real time runtimes contrasted with a variable number of reducersfor join operations in Pig.

4.6 A small addition to section 3 - CPU runtimes

One outstanding item that our first set of results failed to report was the con-trast between real time runtime and CPU runtime. As expected, cumulativeCPU runtime was higher than real time runtime (since tasks are distributedbetween nodes).

29

Figure 10: Real time runtime contrasted with CPU runtime for the Pigscripts run on dataset size 5.

5 Conclusion

Of specific interest was the finding that Pig consistently outperformed Hive(with the exception of grouping data). Specifically:

• For arithmetic operations, Pig is 46% faster (on average) than Hive

• For filtering 10% of the data, Pig is 49% faster (on average) than Hive

• For filtering 90% of the data, Pig is 18% faster (on average) than Hive

• For joining datasets, Pig is 36% faster (on average) than Hive

This conflicted with existing literature that found Hive to outperform Pig:In 2009, Apache’s own performance benchmarks found that Pig was signif-icantly slower than Hive. These findings were validated in 2011 by Stewartand Trinder et al who also found that Hive map-reduce jobs outperformedthose produced by the Pig compiler.When forced to equal terms (that is, when forcing Hive to use the same num-ber of mappers as Pig), Hive remains 67% slower than Pig when comparing

30

real time runtime (i.e. it takes Pig roughly 1/3 of the time to compute theJOIN. That is, increasing the number of map tasks in Hive from 4 to 11 onlyresulted in a 13% speed-up.It should also be noted that the performance difference between Pig andHive does not scale linearly. That is, initially there is little difference in per-formance (this is due to the large start-up costs). However as the datasetsincrease in size, Hive becomes consistently slower (to the point of crashingwhen attempting to join large datasets).

To conclude, the discussed experiments allowed for the answering of 4core questions:

1. How do Pig and Hive perform as other Hadoop propertiesare varied (e.g. number of map tasks)? Balancing the ratio of mappersand reducers has a big impact on real time runtime and consequently is crit-ical to optimizing performance: reducers should be started early enough sothat data transfer is spread out sufficiently to prevent network congestions.On the other hand, reducers shouldn’t be started so late that they do notuse up slots that could be used by map tasks.

Care should also be taken when setting the maximum allowable map andreduce slots per node. For example having a node with a maximum of 20map slots but a script configured to use 30 map slots will result in significantperformance penalties, because the first 20 map tasks will run in parallel, butthe additional 10 will only be spawned once the first 20 map tasks have com-pleted execution (consequently requiring one extra round of computation).The same goes for the number of reduce tasks: as is illustrated by figure 9.Performance peaks when a task requires a number of reduce slots per nodethat falls just below the maximum number.

2. Do more complex datasets and queries (e.g. TPC-H bench-marks) yield the same results than the Apache benchmarks from11/07/07? At first glance, running the TPC-H benchmarks contradicts theApache benchmark results. In nearly all instances, Hive outperforms Pig.However closer examination revealed that nearly all TPC-H scripts reliedheavily on the Group By operator, an operator which appears to be poorlyimplemented in Pig. Using the Group By operator greatly degrades the per-formance of Pig Latin scripts. The TPC-H benchmark results might be lessrelevant to your decision process if the grouping will be not be a dominantfeature for your application. (Because operators are not evenly distributedthroughout the scripts: if one operator is poorly implemented, then this will

31

skew the entire result set - as can be seen in section 4.4)

3. How does real time runtime scale with regards to CPU run-time? As expected given the cluster configuration (9 nodes). The real timeruntime was between 15%-20% of the cumulative CPU runtime.

4. What should the ratio of map and reduce tasks be? The ratiofor map and reduce tasks can be configured through mapred.reduce.slowstart

.completed.maps Hadoop’s conf/mapred-site.xml. The default value of 0.05(i.e. 5%) was found to be too low. The optimal for our cluster was at about70%.

It should also be noted that the use of the Group By operator withinthe TPC-H benchmarks skews results significantly (recall the Apache bench-marks that showed that Pig outperformed Hive in all instances except whenusing the Group By operator: when grouping data Pig was 104% slower thanHive). Re-running the scripts and omitting the the grouping of data producesthe expected results. For example, running script 3 (q3 shipping priority.pig)and ommitting the Group By operator significantly reduces the runtime (to1278.49 seconds real time runtime or a total of 12,257,630ms CPU time).

As already noted in the introduction, Hadoop was designed to run onclusters containing hundreds / thousands of nodes, therefore running small-scale performance analysis may not really do it any justice. Ideally thebenchmarks presented in this article should be run on much larger clusters.

References

[1] Stewart Robert J.; Trinder P.; Loidl H. (2011), ”Comparing High LevelMapReduce Query Languages”,. Springer Berlin Heidelberg, AdvancedParallel Processing Technologies, pages 58-72.

[2] Moussa, R. (2012), ”TPC-H Benchmarking of Pig Latin on a HadoopCluster”,. Communications and Information Technology (ICCIT), 2012International Conference, pages 85-90.

[3] Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner.J.P. (2012), ”Analyzing massive astrophysical datasets: Can Pig/Hadoopor a relational DBMS help?”,. Cin Proc. of CLUSTER. 2009, pages 1-10.

32

[4] Schatzle A.; Przyjaciel-Zablocki M.; Hornung T.; Lausen G. (2011),”PigSPARQL: Ubersetzung von SPARQL nach Pig Latin”,. Proc. BTW,pages 65-84.

[5] Lin J.; Dyer C. (2010) ”Data-intensive text processing with MapReduce”,.Synthesis Lectures on Human Language Technologies, pages 1-177.

[6] Pavlo A.; Paulson E.; Rasin A.; Abadi D. J.; DeWitt D. J.; Madden S.;Stonebraker. M. (2009) ”A comparison of approaches to large-scale dataanalysis”,. In Proc. SIGMOD, ACM, pages 165-178.

[7] DBPedias. (2013), ”Pig Performance Benchmarks,”.https://issues.apache.org/jira/browse/PIG-200 andhttp://wiki.apache.org/pig/PigPerformance, Visited 15/01/2013

[8] Gates, Alan F. and Natkovich, Olga and Chopra, Shubham and Kamath,Pradeep and Narayanamurthy, Shravan M. and Olston, Christopherand Reed, Benjamin and Srinivasan, Santhosh and Srivastava, Utkarsh(2009), ”Building a High-Level Dataflow System on top of Map-Reduce:The Pig Experience”,. Proc. VLDB Endow., pages 1414-1425.

[9] Transaction Processing Council ”Transaction Processing Council Web-site,”. http://www.tpc.org/, Visited 18/06/2013

[10] Apache Software Foundation. (2009),”Hive, PIG, Hadoop benchmark results”,.https://issues.apache.org/jira/secure/attachment/12411185/hive benchmark 2009-06-18.pdf, Visited 03/01/2013

[11] Moussa, R. (2012), ”TPC-H Benchmarking of Pig Latin on a HadoopCluster”,. Communications and Information Technology (ICCIT), 2012International Conference, pages 85 - 90.

[12] Loebman S.; Nunley D.; Kwon Y.; Howe B.; Balazinska M.; Gardner.J.P. (2012), ”Analyzing massive astrophysical datasets: Can Pig/Hadoopor a relational DBMS help?”,. Cin Proc. of CLUSTER. 2009, pages 1 -10.

[13] Stewart Robert J.; Trinder P.; Loidl H. (2011), ”Comparing High LevelMapReduce Query Languages”,. Springer Berlin Heidelberg, AdvancedParallel Processing Technologies, pages 58-72.

[14] Transaction Processing Performance Council (TPC) (2013), ”TPCBenchmark H”,. Standard Specification, Revision 2.15.0, TransactionProcessing Performance Council (TPC), Presidio of San Francisco

33

Documents

Pig vs Hive: Benchmarking High Level Query Languages · Pig vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London,