Pig vs Hive: Benchmarking High Level Query Languages vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London, UK Abstract

  • View
    214

  • Download
    1

Embed Size (px)

Text of Pig vs Hive: Benchmarking High Level Query Languages vs Hive: Benchmarking High Level Query...

  • Pig vs Hive: Benchmarking High Level QueryLanguages

    Benjamin JakobusIBM, Ireland

    Dr. Peter McBrienImperial College London, UK

    Abstract

    This article presents benchmarking results1 of two benchmarking sets (run onsmall clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop0.14.1. The first set of results were obtainted by replicating the Apache Pigbenchmark published by the Apache Foundation on 11/07/07 (which servedas a baseline to compare major Pig Latin releases). The second results wereobtained by applying the TPC-H benchmarks.

    The two benchmarks showed conflicting results; the first benchmark indi-cated that Pig outperformed Hive on most operations. However interestingly,TPC-H results provide evidence that Hive is significantly faster than Pig. Thearticle analyzes the two benchmarks, concluding with a set of differences andjustification of the results.

    The article presumes that the reader has a basic knowledge about Hadoopand big data. (The article is not intended as an introduction to Hadoop, Pigor Hive).

    1Which stem from 2013 when the author spent a year at Imperial College London

    1

  • About the authors

    Benjamin Jakobus graduated with a BSc in Computer Sci-ence from University College Cork in 2011, after which heco-founded an Irish start-up. He returned to Universityone year later and graduated with an MSc in AdvancedComputing from Imperial College London in 2013. Sincegraduating, he took up a position as Software Engineer atIBM Dublin (SWG, Collaboration Solutions). This articleis based on his Masters thesis developed under the super-

    vision of Dr. Peter McBrien.

    Dr. Peter McBrien graduated with a BA in ComputerScience from Cambridge University in 1986. After sometime working at Racal and ICL, heI joined the Depart-ment of Computing at Imperial College as an RA in 1989,working on the Tempora Esprit Project. He obtained hisPhD Implementing Graph Rewriting By Graph Rewrit-ing in 1992, under the supervision of Chris Hankin. In1994, he joined Department of Computing at Kings Col-lege London as a lecturer, and returned to the Department

    of Computing at Imperial College in August 1999 as a lecturer. Since thenhe has been promoted to be a Senior Lecturer.

    2

  • Acknowledgements

    The authors would like to thank Yu Liu, PhD student at Imperial CollegeLondon, who, over the course of the past year helped us with any technicalproblems that we encountered.

    3

  • 1 Introduction

    Despite Hadoops popularity, the Hadoop user finds it cumbersome to developMap-Reduce (MR). To simplify the task, high-level scripting languages suchas Pig Latin or Hive QL have emerged. Users are often faced with the ques-tion whether to use Pig or Hive. At time of writing, no up-to-date scientificstudies exist to help them answer this question. In addition, performance dif-ferences between Pig and Hive are not really well understood and not muchliterature in the field exists that examines these performance differences isscarce.

    The article presents benchmarking results2 of two benchmarking sets (runon small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop0.14.1. The first set of results were obtainted by replicating the Apache Pigbenchmark published by the Apache Foundation on 11/07/07. The secondresults were obtained by applying the TPC-H benchmarks. These test casesconsist of 22 distinct queries, each of which exhibit the same (or higher) de-gree of complexity that is typically found in real-world industry scenarios,consist of varying query parameters and various types of access

    Whilst existing literature[7][10][6][4][11][13][8][12] addresses some of thesequestions, the literature suffer from the following shortcomings:

    1. The most recent Apache Pig benchmarks stem from 2009.

    2. None of the literature cited in footnotes examines how operations scaleover different datasets.

    3. Hadoop benchmarks were performed on clusters of 100 nodes or less(Hadoop was designed to run on clusters containing thousands of nodes,therefore small-scale performance analysis may not really do it anyjustice). Naturally, the same argument can be applied against thebenchmarking results presented in this article.

    4. The literature fails to indicate the different communication overheadrequired by the various database management systems. (Again, this ar-ticle does not address this concern; rather this article describes bench-mark during runtime.)

    2Which stem from 2013 whilst the author spent a year at Imperial College London

    4

  • 2 Background: Benchmarking High-level Query

    Languages

    To date there exist several publications comparing the performance of Pig,HiveQL and other High-level Query Languages (HLQLs). In 2011, Stewartand Trinder et al[13] compared Pig, HiveQL and JAQL using runtime met-rics, and according to how well each language scales and how much shorterqueries really are in comparison to using the Hadoop Java API directly. Us-ing a 32 node Beowulf cluster, Stewart et al found that:

    HiveQL scaled best (both up and out) and that Java was only slightlyfaster (It had the best runtime performance out of the three HLQLs).Java also had better scale-up performance than Pig.

    Pig is the most succinct and compact language of those compared.

    Pig and Hive QL are not Turing Complete.

    Pig and JAQL scaled the same except when using joins: Pig signifi-cantly outperformed JAQL on that regard.

    Pig and Hive are optimised for skewed key distribution and outperformhand-coded Java MR jobs in that regard.

    Hives performance over Pig is further supported by Apaches Hive per-formance benchmarks[10].

    Moussa[11] from the University of Tunis applied the TPC-H benchmarkto compare Oracle SQL Engine to Pig. It was found that SQL Engine greatlyoutperformed Pig (whereby joins using Pig stood out to be particularly slow.Again, Apaches own benchmarks[10] confirm this: When executing a join,Hadoop took 470 seconds. Hive took 471 seconds. PIG took 764 seconds(Hive took 0.2% more time than Hadoop, whilst PIG took 63% more timethan Hadoop). Moussa used a dataset of 1.1GB.

    While studying the performance of Pig using large astrophysical datasetsLoebman et al[12] also found that a relational database management systemoutperforms Pig joins. In general, their experiments show that relationaldatabase management systems (RDBMSs) performed better than Hadoopand that relational databases especially stood out in terms of memory man-agement (although that was to be expected given that NoSQL systems aredesigned to deal with unstructured rather than structured data). As ac-knowledged by the authors, it should be noted that no more than 8 nodes

    5

  • were used throughout the experiment. Hadoop however is designed to beused with hundreds if not thousands of nodes. Work by Schatzle et al fur-ther underpins this argument: In 2011 the authors proposed PigSPARQL(a framework for translating SPARQL queries to Pig Latin) based on thereasoning that for scenarios, which can be characterized by first extract-ing information from a huge data set, second by transforming and loadingthe extracted data into a different format, cluster-based parallelism seems tooutperform parallel databases.[4] Their reasoning is based on [5] [6], how-ever the authors of [6] acknowledge that they cannot verify the claim thatHadoop would have outperformed the parallel database systems if only it hadmore nodes. That is, having benchmarked Hadoops MapReduce with 100nodes against two parallel database systems, it was found that both systemsoutperformed Hadoop:

    First, at the scale of the experiments we conducted, both paralleldatabase systems displayed a significant performance advantageover Hadoop MR in executing a variety of data intensive analysisbenchmarks. Averaged across all five tasks at 100 nodes, DBMS-X was 3.2 times faster than MR and Vertica was 2.3 times fasterthan DBMS-X. While we cannot verify this claim, we believe thatthe systems would have the same relative performance on 1,000nodes (the largest Teradata configuration is less than 100 nodesmanaging over four petabytes of data).

    3 Running the Apache Benchmark

    The experiment follows in the footsteps of the Pig benchmarks3 published bythe Apache Foundation on 11/07/07[7]. Their objective was to have baselinenumbers to compare to before they could make major changes to the system.

    3.1 Test Data

    We decided to benchmark the execution of load, arithmetic, group, join andfilter operations on 6 datasets (as opposed to just two):

    Dataset size 1: 30,000 records (772KB)

    Dataset size 2: 300,000 records (6.4MB)

    Dataset size 3: 3,000,000 records (63MB)3With the exception of co-grouping.

    6

  • Dataset size 4: 30 million records (628MB)

    Dataset size 5: 300 million records (6.2GB)

    Dataset size 6: 3 billion records (62GB)

    That is, our datasets scale linearly, whereby the size equates to 3000 *10n.

    A seventh dataset consisting of 1,000 records (23KB) was produced toperform join operations on. Its schema is as follows:

    name - stringmarks - integergpa - float

    The data was generated using the generate data.pl perl script availablefor download on the Apache website.[7] and produced tab delimited text fileswith the following schema:

    name - stringage - integergpa - float

    It should be noted that the experiment differs slightly to the original inthat the original used only two datasets of 200 million records (200MB) and10 thousand (10KB) records whereas our experiment consists of six separatedatasets with a scaling factor of 10 (i.e. 30,000 records, 300,000 records etc).

    3.2 Test Setup

    The benchmarks were run on a cluster consisting of 6 nodes (1 dedicatedto Name Node and Jo