Hive and Pig - VGCWikijuliana/courses/BigData2014/Lectures/hive... · Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL •

Hive and Pig!

Juliana Freire New York University

Some slides from J. Lin

Need for High-Level Languages!

•  Hadoop is great for large-data processing! •  But writing Java programs for everything is verbose and slow •  Not everyone wants to (or can) write Java code

•  Solution: develop higher-level data processing languages •  Hive: HQL is like SQL •  Pig: Pig Latin is a bit like Perl

Hive and Pig!

•  Hive: data warehousing application in Hadoop •  Query language is HQL, variant of SQL •  Tables stored on HDFS as flat files •  Developed by Facebook, now open source

•  Pig: large-scale data processing system •  Scripts are written in Pig Latin, a dataflow language •  Developed by Yahoo!, now open source •  Roughly 1/3 of all Yahoo! internal jobs

•  Common idea: •  Provide higher-level language to facilitate large-data processing •  Higher-level language “compiles down” to Hadoop jobs

Hive: Background!

•  Started at Facebook •  Data was collected by nightly cron jobs into Oracle DB •  “ETL” via hand-coded python •  Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x

that

Source: cc-licensed slide by Cloudera

OLTP OLAP ETL

(Extract, Transform, and Load) Hadoop

Hive Components!

•  Shell: allows interactive queries •  Driver: session handles, fetch,

execute •  Compiler: parse, plan, optimize •  Execution engine: DAG of stages

(MR, HDFS, metadata) •  Metastore: schema, location in

HDFS, SerDe


standard input and write out rows to standard output. Thisflexibility does come at a cost of converting rows from andto strings.

We omit more details due to lack of space. For a completedescription of HiveQL see the language manual [5].

2.3 Running Example: StatusMemeWe now present a highly simplified application, Status-

Meme, inspired by Facebook Lexicon [6]. When Facebookusers update their status, the updates are logged into flatfiles in an NFS directory /logs/status updates which are ro-tated every day. We load this data into hive on a daily basisinto a tablestatus updates(userid int,status string,ds string)

using a load statement like below.

LOAD DATA LOCAL INPATH ‘/logs/status_updates’INTO TABLE status_updates PARTITION (ds=’2009-03-20’)

Each status update record contains the user identifier (userid),the actual status string (status), and the date (ds) when thestatus update occurred. This table is partitioned on the ds

column. Detailed user profile information, like the gender ofthe user and the school the user is attending, is available inthe profiles(userid int,school string,gender int) table.

We first want to compute daily statistics on the frequencyof status updates based on gender and school which the userattends. The following multi-table insert statement gener-ates the daily counts of status updates by school (intoschool summary(school string,cnt int,ds string)) and gen-der (into gender summary(gender int,cnt int,ds string)) us-ing a single scan of the join of the status updates andprofiles tables. Note that the output tables are also parti-tioned on the ds column, and HiveQL allows users to insertquery results into a specific partition of the output table.

FROM (SELECT a.status, b.school, b.genderFROM status_updates a JOIN profiles b

ON (a.userid = b.userid anda.ds=’2009-03-20’ )

) subq1INSERT OVERWRITE TABLE gender_summary

PARTITION(ds=’2009-03-20’)SELECT subq1.gender, COUNT(1) GROUP BY subq1.genderINSERT OVERWRITE TABLE school_summary

PARTITION(ds=’2009-03-20’)SELECT subq1.school, COUNT(1) GROUP BY subq1.school

Next, we want to display the ten most popular memesper school as determined by status updates by users whoattend that school. We now show how this computationcan be done using HiveQLs map-reduce constructs. Weparse the result of the join between status updates andprofiles tables by plugging in a custom Python mapperscript meme-extractor.py which uses sophisticated naturallanguage processing techniques to extract memes from sta-tus strings. Since Hive does not yet support the rank ag-gregation function the top 10 memes per school can then becomputed by a simple custom Python reduce script top10.py

REDUCE subq2.school, subq2.meme, subq2.cntUSING ‘top10.py’ AS (school,meme,cnt)

FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cntFROM (MAP b.school, a.status

USING ‘meme-extractor.py’ AS (school,meme)

Figure 1: Hive Architecture

FROM status_updates a JOIN profiles bON (a.userid = b.userid)

) subq1GROUP BY subq1.school, subq1.memeDISTRIBUTE BY school, memeSORT BY school, meme, cnt desc

) subq2;

3. HIVE ARCHITECTUREFigure 1 shows the major components of Hive and its in-

teractions with Hadoop. The main components of Hive are:

• External Interfaces - Hive provides both user inter-faces like command line (CLI) and web UI, and appli-cation programming interfaces (API) like JDBC andODBC.

• The Hive Thrift Server exposes a very simple clientAPI to execute HiveQL statements. Thrift [8] is aframework for cross-language services, where a serverwritten in one language (like Java) can also supportclients in other languages. The Thrift Hive clients gen-erated in di↵erent languages are used to build commondrivers like JDBC (java), ODBC (C++), and scriptingdrivers written in php, perl, python etc.

• The Metastore is the system catalog. All other com-ponents of Hive interact with the metastore. For moredetails see Section 3.1.

• The Driver manages the life cycle of a HiveQL state-ment during compilation, optimization and execution.On receiving the HiveQL statement, from the thriftserver or other interfaces, it creates a session handlewhich is later used to keep track of statistics like exe-

Data Model!

•  Tables: analogous to tables in RDBMS •  Typed columns (int, float, string, boolean) •  Structs: {a INT; b INT}. •  Also, list, arrays : map (for JSON-like data)

•  Partitions •  For example, range-partition tables by date

•  Buckets •  Hash partitions within ranges (useful for sampling, join optimization)


Hive - A Warehousing Solution Over a Map-ReduceFramework

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy

Facebook Data Infrastructure Team

1. INTRODUCTIONThe size of data sets being collected and analyzed in the

industry for business intelligence is growing rapidly, mak-ing traditional warehousing solutions prohibitively expen-sive. Hadoop [3] is a popular open-source map-reduce im-plementation which is being used as an alternative to storeand process extremely large data sets on commodity hard-ware. However, the map-reduce programming model is verylow level and requires developers to write custom programswhich are hard to maintain and reuse.

In this paper, we present Hive, an open-source data ware-housing solution built on top of Hadoop. Hive supportsqueries expressed in a SQL-like declarative language - HiveQL,which are compiled into map-reduce jobs executed on Hadoop.In addition, HiveQL supports custom map-reduce scripts tobe plugged into queries. The language includes a type sys-tem with support for tables containing primitive types, col-lections like arrays and maps, and nested compositions ofthe same. The underlying IO libraries can be extended toquery data in custom formats. Hive also includes a systemcatalog, Hive-Metastore, containing schemas and statistics,which is useful in data exploration and query optimization.In Facebook, the Hive warehouse contains several thousandtables with over 700 terabytes of data and is being used ex-tensively for both reporting and ad-hoc analyses by morethan 100 users.

The rest of the paper is organized as follows. Section 2describes the Hive data model and the HiveQL languagewith an example. Section 3 describes the Hive system ar-chitecture and an overview of the query life cycle. Section 4provides a walk-through of the demonstration. We concludewith future work in Section 5.

2. HIVE DATABASE

2.1 Data ModelData in Hive is organized into:

• Tables - These are analogous to tables in relational

Permission to copy without fee all or part of this material is granted provided

that the copies are not made or distributed for direct commercial advantage,

the VLDB copyright notice and the title of the publication and its date appear,

and notice is given that copying is by permission of the Very Large Data

Base Endowment. To copy otherwise, or to republish, to post on servers

or to redistribute to lists, requires a fee and/or special permission from the

publisher, ACM.

VLDB ‘09, August 24-28, 2009, Lyon, France

Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

databases. Each table has a corresponding HDFS di-rectory. The data in a table is serialized and stored infiles within that directory. Users can associate tableswith the serialization format of the underlying data.Hive provides builtin serialization formats which ex-ploit compression and lazy de-serialization. Users canalso add support for new data formats by defining cus-tom serialize and de-serialize methods (called SerDe’s)written in Java. The serialization format of each tableis stored in the system catalog and is automaticallyused by Hive during query compilation and execution.Hive also supports external tables on data stored inHDFS, NFS or local directories.

• Partitions - Each table can have one or more parti-tions which determine the distribution of data withinsub-directories of the table directory. Suppose datafor table T is in the directory /wh/T. If T is partitionedon columns ds and ctry, then data with a particulards value 20090101 and ctry value US, will be stored infiles within the directory /wh/T/ds=20090101/ctry=US.

• Buckets - Data in each partition may in turn be dividedinto buckets based on the hash of a column in the table.Each bucket is stored as a file in the partition directory.

Hive supports primitive column types (integers, floatingpoint numbers, generic strings, dates and booleans) andnestable collection types — array and map. Users can alsodefine their own types programmatically.

2.2 Query LanguageHive provides a SQL-like query language called HiveQL

which supports select, project, join, aggregate, union all andsub-queries in the from clause. HiveQL supports data defi-nition (DDL) statements to create tables with specific seri-alization formats, and partitioning and bucketing columns.Users can load data from external sources and insert queryresults into Hive tables via the load and insert data manip-ulation (DML) statements respectively. HiveQL currentlydoes not support updating and deleting rows in existing ta-bles.

HiveQL supports multi-table insert, where users can per-form multiple queries on the same input data using a singleHiveQL statement. Hive optimizes these queries by sharingthe scan of the input data.

HiveQL is also very extensible. It supports user definedcolumn transformation (UDF) and aggregation (UDAF) func-tions implemented in Java. In addition, users can embedcustom map-reduce scripts written in any language usinga simple row-based streaming interface, i.e., read rows from

[Thusoo et al., VLDB 2009]!

Metastore!

•  Database: namespace containing a set of tables •  Holds table definitions (column types, physical layout) •  Holds partitioning information •  Can be stored in Derby, MySQL, and many other relational databases


Physical Layout!

•  Warehouse directory in HDFS •  E.g., /user/hive/warehouse

•  Tables stored in subdirectories of warehouse •  Partitions form subdirectories of tables •  Each table has a corresponding HDFS directory

•  Actual data stored in flat files •  Users can associate a table with a serialization format •  Control char-delimited text, or SequenceFiles •  With custom SerDe, can use arbitrary format


Hive: Example!•  Hive looks similar to an SQL database •  Relational join on two tables:

•  Table of word counts from Shakespeare collection •  Table of word counts from the bible

Source: Material drawn from Cloudera training VM

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 �� ORDER BY s.freq DESC LIMIT 10;

the 25848 62394 I 23031 8854 and 19671 38985 to 18038 13526 of 16700 34654 a 14170 8057 you 12702 2720 my 11297 4135 in 10797 12445 is 8882 6884

Hive: Another Example!






















) subq2;




























) subq2;







Hive: Another Example!•  HiveQL provides MapReduce constructs

REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school,meme,cnt)

FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.status USING ‘meme-extractor.py’ AS (school,meme) FROM status_updates a JOIN profiles b

ON (a.userid = b.userid) ) subq1 GROUP BY subq1.school, subq1.meme DISTRIBUTE BY school, meme SORT BY school, meme, cnt desc ) subq2;

Source: Material drawn from Cloudera training VM

Example Data Analysis Task!

user url time Amy www.cnn.com 8:00 Amy www.crap.com 8:05 Amy www.myblog.com 10:00 Amy www.flickr.com 10:05 Fred cnn.com/index.htm 12:00

url pagerank www.cnn.com 0.9 www.flickr.com 0.9 www.myblog.com 0.7 www.crap.com 0.2

Find users who tend to visit “good” pages.

Pages Visits

. . .

. . .

Pig Slides adapted from Olston et al.

Conceptual Dataflow!

Canonicalize URLs

Join url = url

Group by user

Compute Average Pagerank

Filter avgPR > 0.5

Load Pages(url, pagerank)

Load Visits(user, url, time)


System-Level Dataflow!

. . . . . .

Visits Pages . . .

. . . join by url

the answer

load load

canonicalize

compute average pagerank filter

group by user


MapReduce Code!import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobC ontrol; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring( firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so w e know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.to String(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1));

reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', first Comma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class);

lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf( MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFi leOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputF ormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }


Pig Latin Script!

Visits = load ‘/data/visits’ as (user, url, time);!Visits = foreach Visits generate user, Canonicalize(url), time;!!Pages = load ‘/data/pages’ as (url, pagerank);!!VP = join Visits by url, Pages by url;!UserVisits = group VP by user;!UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr;!GoodUsers = filter UserPageranks by avgpr > ‘0.5’;!!store GoodUsers into '/data/good_users';!


Java vs. Pig Latin!

020406080100120140160180

Hadoop Pig

1/20 the lines of code

050100150200250300

Hadoop PigMinutes

1/16 the development time

Performance on par with raw Hadoop!


Pig takes care of…!

•  Schema and type checking •  Translating into efficient physical dataflow

•  (i.e., sequence of one or more MapReduce jobs) •  Exploiting data reduction opportunities

•  (e.g., early partial aggregation via a combiner) •  Executing the system-level dataflow

•  (i.e., running the MapReduce jobs) •  Tracking progress, errors, etc.

References!

•  Getting started with Pig: http://pig.apache.org/docs/r0.11.1/start.html •  Pig Tutorial: http://pig.apache.org/docs/r0.7.0/tutorial.html •  Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/Tutorial

Questions!

Documents

Hive and Pig - VGCWikijuliana/courses/BigData2014/Lectures/hive... · Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL •