68
©2012, Cognizant Data Warehouse and Query Language for Hadoop August 2013 By Someshwar Kale

Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

Embed Size (px)

Citation preview

Page 1: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

©2012, Cognizant

Data Warehouse and Query Language for Hadoop

August 2013By Someshwar Kale

Page 2: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 2

HIVE

Data Warehousing Solution built on top of Hadoop

Provides SQL-like query language named HiveQL– Minimal learning curve for people with SQL expertise– Data analysts are target audience

Early Hive development work started at Facebook in 2007Today, Facebook counts 29% of its employees (and growing!) as Hive users.

https://www.facebook.com/note.php?note_id=114588058858

Today Hive is an Apache project under Hadoop– http://hive.apache.org

Page 3: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| 2012 Cognizant Technology Solutions

Hive Provides

3

• Ability to bring structure to various data Formats

• Simple interface for ad hoc querying,analyzing and summarizing large amounts of data

• Access to files on various data stores suchas HDFS and HBase

Page 4: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 4

Hive Hive does NOT provide low latency or realtime queries.

Even querying small amounts of data may take minutes.

Designed for scalability and ease-of-use rather than low latency responses

Page 5: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 5

Hive

Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster.

Page 6: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 6

Hive Metastore

To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database

Packaged with Derby, a lightweight embedded SQL DB

Default Derby based is good for evaluation an testing

Schema is not shared between users as each user has their own instance of embedded Derby Stored in metastore_db directory which resides in the directory that hive was started from• Can easily switch another SQL installation such as MySQL

Page 7: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 7

Metastore Deployment Modes : Embedded Mode

Default metastore deployment mode for CDH.

Both the database and the metastore service run embedded in the main HiveServer process

Both are started for you when you start the HiveServer process.

Support only one active user at a time and is not certified for production use.

Page 8: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 8

Metastore Deployment Modes : Local Mode

Hive metastore service runs in the same process as the main HiveServer process.

The metastore database runs in a separate process, and can be on a separate host.

The embedded metastore service communicates with the metastore database over JDBC.

Page 9: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 9

Metastore Deployment Modes : Remote Mode

Page 10: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 10

Hive Architecture

Page 11: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 11

Hive Interface Options

Command Line Interface (CLI)– Will use exclusively in these slides

• Hive Web Interfacehttps://cwiki.apache.org/confluence/display/Hive/HiveWebInterface

• Java Database Connectivity (JDBC)– https://cwiki.apache.org/confluence/display/Hive/HiveClient

BEELINE for Hivesrver2 (new in CDH4)- http://sqlline.sourceforge.net/#manual

Page 12: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 12

Data Types

[cts318692@aster4 ~]$ hiveLogging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-log4j.propertiesHive history file=/tmp/cts318692/hive_job_log_cts318692_201308071622_2005272769.txthive>

Launch Hive Command Line Interface (CLI)

Location of the session’s log file

hive> !cat data/user-posts.txt;user1,Funny Story,1343182026191user2,Cool Deal,1343182133839user4,Interesting Post,1343182154633user5,Yet Another Blog,13431839394hive>

Can execute local commandswithin CLI, place a commandin between ! and ;

Page 14: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 14

Complex Data Types

Page 15: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 15

Check physical storage of hive

[cts318692@aster4 ~]$ hive -S -e "set" | grep warehousehive.metastore.warehouse.dir=/user/hive/warehousehive.warehouse.subdir.inherit.perms=true

This is the location where hive stores its data.

Page 16: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 16

Creating DataBase

hive> CREATE DATABASE IF NOT EXISTS som COMMENT 'my database' > LOCATION '/user/cts318692/someshwar/hivestore/' > WITH DBPROPERTIES ('creator'='someshwar kale','date'='2013-06-08');OKTime taken: 0.046 seconds

Used to suppress warnings

Database name, Hive opens default database when u open a

new session

You can override ‘/usr/hive/warehouse’ default location for the new directory

Table propertiesPhysical storage for som database

Page 17: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 17

Exploring Data

STRUCT<street:STRING, city:STRING,

state:STRING, zip:INT>

For complex data types map, arrays,structures

field

Page 18: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 18

Creating Table

For complex data types map, arrays,structures

For map key and value eg. ‘key’ ^C ’value’ (\003=ctrlC=^C)

Column seperator Definition

Page 19: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 19

hive> DESCRIBE FORMATTED som.employees;

Page 20: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 20

Creating External Table

Page 21: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 21

Create ..like

If you omit the EXTERNAL keyword and the original table is external, the new table will also be external.

If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.

Page 22: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 22

Select Clause

Page 23: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant 23

Describe External Table

Page 24: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Dropping DataBase and Table

By default, Hive won’t permit you to drop a database if it contains tables. You can eitherdrop the tables first or append the CASCADE keyword to the command, which will causethe Hive to drop the tables in the database first.

Page 25: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Partitions

To increase performance Hive has the capability to partition data– The values of partitioned column divide a table intosegments– Entire partitions can be ignored at query time– Similar to relational databases’ indexes but not asGranular

Partitions have to be properly crated by users– When inserting data must specify a partition

At query time, whenever appropriate, Hive will automatically filter out partitions

Page 26: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Creating Partitioned Table

Partition table based onthe value of a country and state

Page 27: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Page 28: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Loading data to table

LOAD DATA LOCAL ... copies the local data to the final location in thedistributed filesystem, while LOAD DATA ... (i.e., without LOCAL) movesthe data to the final location.

Necessary if table to which we are loading the data is partitioned. This is known as Static partitioning as we are providing the partition value in the query

Partitions are physically stored underseparate directories

Page 29: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Schema Violations

hive> LOAD DATA LOCAL INPATH> 'data/user-posts-inconsistentFormat.txt'> OVERWRITE INTO TABLE posts;OKTime taken: 0.612 seconds

hive> select * from posts;OKuser1 Funny Story 1343182026191user2 Cool Deal NULLuser4 Interesting Post 1343182154633user5 Yet Another Blog 13431839394Time taken: 0.136 seconds

null is set for any value thatviolates pre-defined schema

Page 30: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

External Partitioned Tables

Page 31: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

There is no difference in syntax• When partitioned column is specified in thewhere clause entire directories/partitions couldbe ignored

Page 32: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Bucketing

• Break data into a set of buckets based on a hashfunction of a "bucket column"– Capability to execute queries on a sub-set of random data

• Doesn’t automatically enforce bucketing– User is required to specify the number of buckets by setting hash ofReducer

hive> mapred.reduce.tasks = 256;ORhive> hive.enforce.bucketing = true;

Either manually set the hash ofreducers to be the number ofbuckets or you can use‘hive.enforce.bucketing’ whichwill set it on your behalf.

Page 33: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Create and Use Table with Buckets

Page 34: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

ALTER TABLE

Page 35: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Page 36: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Page 37: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Partition columns are not deleted

Page 38: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Inserting Data into Tables from Queries

Page 39: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Dynamic Partition Inserts

Page 40: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Page 41: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Exporting Data

Page 42: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Functions

Page 43: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Page 44: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Cntd…

Page 45: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Table generating functionsReturn 0 to many rows, one row for each element fromthe input array

Page 46: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Table generating functions

Only a single expression in the SELECT clause is supported with UDTF's'.

Page 47: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

LIMIT clause

Page 48: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

CASE … WHEN … THEN Statements

Page 49: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Where and Group by .. having clause

Page 50: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Joins

Page 51: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Outer Join

Page 52: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Points to remember

Only equality joins are allowed.

More than 2 tables can be joined in the same query e.g.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

is a valid join.

A single map/reduce job if for every table the same column is used in the join clause -

ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

ON (a.key = b.key1) JOIN c ON (c.key = b.key2)is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one.

Page 53: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

ORDER BY and SORT BY

ORDER BY uses single reducer to sort the data, which may take an unacceptably long time to execute for larger data sets.

Hive adds an alternative, SORT BY, that orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted.

Page 54: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Casting

If a salary value was not a valid string for a floating-point number? In this case, Hive returns NULL.

Page 55: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

UNION ALL and Nested select

Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position.

Page 56: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

View

• similar to writing a function in a programming language.

• Views are virtual.

Page 57: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Lateral view

Lateral view is used in conjunction with user-defined table generating functions such as explode().

A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

Syntax-1. LATERAL VIEW udtf(expression) tableAlias AS columnAlias

Page 58: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Lateral view Example

Page 59: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

UDF

Page 60: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

UDF

Hive actually uses reflection to find methods whose names are evaluate and matches the arguments used in the HiveQL function call.

Hive can work with both the Hadoop Writables and the Java primitives, but it’s recommended to work with the Writables since they can be reused.

Input arguments type and return type must be same.

Page 61: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

UDF

Page 62: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

UDF vs. GenericUDF

Page 63: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

between operator

hive> select name,salary from employees2 where salary between 80000 and 100000;Total MapReduce jobs = 1Launching Job 1 out of 1....OKJohn Doe 100000.0John Doe 100000.0Mary Smith 80000.0Mary Smith 80000.0Time taken: 14.39 seconds

Both values (lower and upper) are inclusive.

Page 64: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

HiveServer2

As of CDH4.1, you can deploy HiveServer2, an improved version of HiveServer that supports a new Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency.

There is also a new CLI for HiveServer2 named BeeLine.

HiveServer2 Connection URL ===== jdbc:hive2://<host>:<port>

Driver Class =========== org.apache.hive.jdbc.HiveDriver

HiveServer1 Connection URL ===== jdbc:hive://<host>:<port>

Driver Class ========org.apache.hadoop.hive.jdbc.HiveDriver

Page 65: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

BEELINE

$ /usr/lib/hive/bin/beelinebeeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver0: jdbc:hive2://localhost:10000>

Page 66: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

Connecting database using properties file

Page 67: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2012, Cognizant

References

HiveEdward Capriolo (Author), Dean Wampler (Author), JasonRutherglen (Author)O'Reilly Media; 1 edition (October 3, 2012)

Chapter About HiveHadoop in ActionChuck Lam (Author)Manning Publications; 1st Edition (December, 2010)

Page 68: Learning Apache HIVE - Data Warehouse and Query Language for Hadoop

| ©2011, Cognizant 68

Thank You