Hive – SQL on top of Hadoop. Content Background Concepts Hive Architecture Examples

Hive – SQL on top of Hadoop

Content Background

Concepts

Hive Architecture

Examples

Background

Version 2008

◦ Apache Hive

2009/4/29◦ Stable version 0.3.0

2013/1/11◦ 0.10.0

2014/11/12◦ 0.14.0

2015/2/4◦ 1.0.0(0.14.1)

2015/3/18◦ 1.2.0

Concepts

Map-Reduce and SQL Map-Reduce

◦ Map-Reduce is scalable

SQL◦ SQL has a huge user base◦ SQL is easy to code

Solution: Combine SQL and Map-Reduce◦ Hive on top of Hadoop (open source)◦ Aster Data (proprietary)◦ Green Plum (proprietary)

What is Hive A database/data warehouse on top of Hadoop

◦ Rich data types◦ Efficient implementations of SQL on top of map reduce

Allows users to access Hive data without using Hive

Support Analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem.

Provides an SQL-like language called HiveQL with schema.

Converts queries to Map-Reduce, Apache Tez and Spark jobs.

What Hive Is NOT Hive aims to provide acceptable (but not optimal) latency for interactive data browsing.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).

Data Units Databases

◦ Namespaces that separate tables and other data units from naming confliction.

Tables◦ Homogeneous units of data which have the same schema. An example of a

table could be page_views table, where each row could comprise of the following columns (schema):◦ timestamp - which is of INT type that corresponds to a unix timestamp of when the page was

viewed.◦ userid - which is of BIGINT type that identifies the user who viewed the page.◦ page_url - which is of STRING type that captures the location of the page.◦ referrer_url - which is of STRING that captures the location of the page from where the user

arrived at the current page.◦ IP - which is of STRING type that captures the IP address from where the page request was

made.

Data Units Partitions

◦ Each Table can have one or more partition Keys which determines how the data is stored.

◦ Apart from being storage units, partitions also allow the user to efficiently identify the rows that satisfy a certain criteria.

◦ A date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table.◦ create table partition_test (userid int, page_url string, refer_url string, IP string)

partitioned by (timestamp int, country string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

◦ alter table partition_test add partition (timestamp='2009-12-23', country='US');◦ HDFS: /user/hive/warehouse/partition_test/date=2009-12-23/country=US

Data Units Buckets

◦ Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.

◦ For example, the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

◦ HDFS: /user/hive/warehouse/partition_test/date=2009-12-23/country=US/part_00001

Type System Primitive Types

◦ Integers◦ TINYINT, SMALLINT, INT, BIGINT

◦ Boolean◦ BOOLEAN

◦ Point numbers◦ FLOAT, DOUBLE

◦ String type◦ STRING

Complex Types◦ Structs, Maps, Array

Build In Operators and Functions

Built in Operators◦ Relational Operators

◦ =, !=, <, <=, >, >=, IS NULL, IS NOT NULL, LIKE, RLIKE/REGEXP

◦ Arithmetic Operators◦ +, -, *, /, %, |, ^, ~

◦ Logical Operators◦ AND/&&, OR/||, NOT/!

◦ Operators on Complex Types◦ A[n], M[key], S.x

Build In Operators and Functions

Built In Functions◦ Basic

◦ round, floor, ceil◦ Rand◦ concat, substr, upper/ucase, lower/lcase, trim/ltrim/rtrim, regexp_replace◦ Size◦ cast, from_unixtime, to_date, year, month, day, get_json_object

◦ Aggregation◦ count, sum, avg, min, max

Usage and Examples Creating Tables

◦ CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties

MAP<STRING, STRING>, ip STRING COMMENT 'IP Address of the User')◦ COMMENT 'This is the page view table'◦ PARTITIONED BY(date STRING, country STRING)◦ CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS◦ ROW FORMAT DELIMITED FIELDS TERMINATED BY '1'◦ STORED AS SEQUENCEFILE;

Browsing Tables and Partitions◦ SHOW TABLES;◦ SHOW TABLES 'page.*';◦ SHOW PARTITIONS page_view;◦ DESCRIBE (EXTENDED) page_view;

Usage and Examples Altering Tables

◦ ALTER TABLE old RENAME TO new;◦ ALTER TABLE old REPLACE COLUMNS (c1 TYPE, …);◦ ALTER TABLE old ADD COLUMNS (c1 INT COMMENT 'a new int column',

c2 STRING COMMENT DEFAULT 'def val');

Dropping Tables and Partitions◦ DROP TABLE pv_users;◦ ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')

Loading Data Method1

◦ CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRIN, country

STRING)◦ COMMENT 'This is the staging page view table'◦ ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY

'12'◦ STORED AS TEXTFILE◦ LOCATION '/user/data/staging/page_view';◦ hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view◦ FROM page_view_stg pvs◦ INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',

country='US')◦ SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,

null, null, pvs.ip WHERE pvs.country = 'US';

Loading Data Method2

◦ LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION (date='2008-06-08', country='US')

Method3◦ LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE

page_view PARTITION (date='2008-06-08', country='US')

Querying and Inserting Data

Simple Query◦ Insert

◦ INSERT OVERWRITE TABLE user_active◦ SELECT user.*◦ FROM user◦ WHERE user.active = 1;

◦ Select◦ SELECT user.*◦ FROM user◦ WHERE user.active = 1;


Partition Based Query◦ INSERT OVERWRITE TABLE xyz_com_page_views◦ SELECT page_views.*◦ FROM page_views◦ WHERE page_views.date >= '2008-03-01' AND page_views.date <=

'2008-03-31' AND page_views.referrer_url like '%xyz.com';

Joins◦ INSERT OVERWRITE TABLE pv_friends◦ SELECT pv.*, u.gender, u.age, f.friends◦ FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN

friend_list f ON (u.id = f.uid)◦ WHERE pv.date = '2008-03-03';


Aggregations◦ Allowed

◦ INSERT OVERWRITE TABLE pv_gender_agg◦ SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT

pv_users.userid)◦ FROM pv_users◦ GROUP BY pv_users.gender;

◦ Not allowed◦ INSERT OVERWRITE TABLE pv_gender_agg◦ SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)◦ FROM pv_users◦ GROUP BY pv_users.gender;


Multi Table/File Inserts◦ FROM pv_users◦ INSERT OVERWRITE TABLE pv_gender_sum◦ SELECT pv_users.gender, count_distinct(pv_users.userid)◦ GROUP BY pv_users.gender◦ INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'◦ SELECT pv_users.age, count_distinct(pv_users.userid)◦ GROUP BY pv_users.age;

Dynamic-Partition Insert◦ FROM page_view_stg pvs◦ INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)◦ SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,

null, null, pvs.ip, pvs.country


Sampling◦ Choose 3rd bucket out of 32 buckets

◦ INSERT OVERWRITE TABLE pv_gender_sum_sample◦ SELECT pv_gender_sum.*◦ FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

◦ Choose 3rd and 19th bucket out of 32 buckets◦ TABLESAMPLE(BUCKET 3 OUT OF 16)

◦ Choose half of the 3rd buckets◦ TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

◦ The buckets are numbered starting from 0

Hive Architecture

Hive Architecture 1. User issues SQL Query

2. Hive parses and plans query

3. Query converted to Map-Reduce

4. Map-Reduce run by Hadoop

CompilerOptimizerExecutor

Services CLI

◦ Command Line Interface

HiveServer◦ Allows a remote client to submit requests to Hive◦ Exports Thrift

◦ For scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently among different languages

◦ HiveServer cannot handle concurrent requests from more than one client, removed starting in Hive 0.14.1 (1.0.0)

◦ HiveServer2 is a rewrite of HiveServer that addresses these problems, starting with Hive 0.11.0

HWI◦ Hive Web Interface

Driver Compiler

◦ Parser◦ Semantic Analyzer◦ Logic Plan Generator◦ Local Optimizer◦ Physical Plan Generator◦ Physical Optimizer

Optimizer

Executor

Hive Architecture

SerDe Built-in SerDes

◦ Avro◦ ORC◦ RegEx◦ Thrift◦ Parquet◦ CSV

Third-party SerDes◦ jsonserde.jar

Examples – SQL to Map-Reduce

Join SQL:

◦ INSERT INTO TABLE pv_users◦ SELECT pv.pageid, u.age◦ FROM page_view pv JOIN user u ON (pv.userid = u.userid);

pageid userid time1 111 09:08:012 111 09:08:131 222 09:08:14

page_viewuserid age gender

111 25 female

222 32 male

userpageid age

1 25

2 25

1 32

pv_users

X =

Join – in Map Reduce

pageid userid time1 111 09:08:012 111 09:08:131 222 09:08:14

page_view

userid age gender

111 25 female

222 32 male

user

pageid age

1 25

2 25

pv_users

Map

key value

111 <1, 1>

111 <1, 2>

222 <1, 1>

key value

111 <2, 25>

222 <2, 32>

Shuffle Sort

key value

111 <1, 1>

111 <1, 2>

111 <2, 25>

key value

222 <1, 1>

222 <2, 32>

Reduce

pageid age

1 32

Group By SQL:

◦ INSERT INTO TABLE pageid_age_sum◦ SELECT pageid, age, count(1)◦ FROM pv_users◦ GROUP BY pageid, age;

pageid age1 252 251 322 25

pv_userspageid age count

1 25 1

2 25 2

1 32 1

pageid_age_sum

Group By – in Map Reduce

pageid age1 252 25

pv_userspageid age count

1 25 1

1 32 1

pageid_age_sum

pageid age

1 32

2 25

MapShuffle

Sort

key value

<1, 25> 1

<2, 25> 1

key value

<1, 32> 1

<2, 25> 1

key value

<1, 25> 1

<1, 32> 1

key value

<2, 25> 1

<2, 25> 1

Reduce

pageid age count

2 25 2

Group By with Distinct SQL:

◦ SELECT pageid, COUNT(DISTINCT userid)◦ FROM page_view GROUP BY pageid

pageid userid time1 111 09:08:012 111 09:08:131 222 09:08:142 111 09:08:20

page_viewpageid count_distinct_userid

1 2

2 1

result

Group By with Distinct – in Map Reduce

pageid userid time1 111 09:08:012 111 09:08:13

page_view

pageid userid time

1 222 09:08:14

2 111 09:08:20

Shuffle Sort Reduce

key value

<1, 111>

<1, 222>

key value

<2, 111>

<2, 111>

pageid count_distinct_userid

1 2

result

pageid count_distinct_userid

2 1

Order By SQL:

◦ SELECT * FROM page_view◦ ORDER BY time;

pageid userid time2 111 09:08:131 111 09:08:012 111 09:08:201 222 09:08:14

page_viewpageid userid time

1 111 09:08:01

2 111 09:08:13

1 222 09:08:14

2 111 09:08:20

page_view

Order by – in Map Reduce


page_view

pageid userid time

2 111 09:08:20

1 222 09:08:14

Shuffle Sort

key value

<1, 111> 09:08:01

<2, 222> 09:08:13

key value

<1, 222> 09:08:14

<2, 111> 09:08:20

Reduce

pageid userid time

1 111 09:08:01

2 111 09:08:13

1 222 09:08:14

2 111 09:08:20

page_view

Sort by – in Map Reduce


page_viewpageid userid time

1 111 09:08:01

1 222 09:08:14

page_view

pageid userid time

2 111 09:08:20

1 222 09:08:14

Shuffle Sort

key value

<1, 111> 09:08:01

<1, 222> 09:08:14

key value

<2, 111> 09:08:13

<2, 111> 09:08:20

Reduce

pageid userid time

2 111 09:08:13

2 111 09:08:20

Merge Sequential Map Reduce Jobs

SQL:◦ SELECT ……◦ FROM (a join b on a.key = b.jey) join c on a.key = c.key

key av bv1 111 222

key av1 111

key bv

1 222

Map Reduce

Map Reduce

key cv1 333

key av bv cv1 111 222 333

Share Common Read Operations

Extended SQL:◦ FROM pv_users◦ INSERT INTO TABLE pv_pageid_sum

◦ SELECT pageid, count(1)◦ GROUP BY pageid

◦ INSERT INTO TABLE pv_age_sum◦ SELECT age, count(1)◦ GROUP BY age;

pageid age1 252 32

Map Reduce

pageid count1 12 1

age count25 132 1

Documents

Hive – SQL on top of Hadoop. Content Background Concepts Hive Architecture Examples