View
1.850
Download
5
Category
Preview:
DESCRIPTION
Learn Apache HIve Step By Step
Citation preview
Apache Hive Tutorials HQL – Hive Query Language
Prashant Kumar Pandey
Learn more with us at http://www.thriveschool.blogspot.in/
We have learned creating hive tables and loading hive data into those tables. Next step should
be executing hive queries on those tables. I am not going to explain basics of SQL as I assume
you are already familiar with SQL for relational databases. If you do not know SQL, step back
and learn SQL basics before you start leaning hive query language. I am assuming that you have
gone through all of my previous articles on this blog so I will not cover those concepts which I
have already explained in other articles. If you are directly landing at this page first time, I
recommend you to start reading from beginning to learn hive. I have followed a logical
sequence for hive learning in this blog.
Editing HQL on hive CLI is troublesome. So we will write our HQL in a file and execute script
from Hive CLI, in case of any errors or changes, we go back to our script and change it there and
execute it again.
If you have already executed HQL on CLI, you must have noticed that Hive CLI throws lot of
messages before printing query results. I feel little lost in those messages and want to disable it
and see only query output on the screen. To do this we have to start hive in silent mode which
you can do using hive –S to start hive CLI instead of using only hive command.
Let’s test it. I have a file named q.hql which I have stored in my home directory. It has a simple
select statement as shown in below screen.
Now, I will start hive in silent mode using hive –S and execute this file using hive CLI as shown in
below screen.
Ok, now you know how to start hive in silent mode and how to execute hql from a source file.
Hive query language is also known as HQL. It mostly resembles with SQL syntaxes of MYSQL
database but there are lots of differences. I need a suitable table, loaded with appropriate data
to demonstrate HQL. For this purpose, I created an employee 100+ employee records. I have
shared data file for your reference and download here.
First step is to create an external table and map this data. I will not load it further into managed
table and omit partitioning as our objective is to focus on HQL. You have already learned
creating a partitioned managed table and loading data into it from external table. If you want,
you can do it and execute all HQLs demonstrated here on your partitioned hive managed
tables. You should not expect and differences in final output.
Learn more with us at http://www.thriveschool.blogspot.in/
Command to create and external table is shown below.
You need to execute this command in hive CLI and copy data file in HDFS at location /user/pkp/hql
before you start executing hql statements. Our table employee is mostly simple except three columns
contacts, address and job_history.
Contacts column is a map which means it can store multiple key-value pairs like office phone, email,
mobile etc. we will see it in more detail when we execute hql statements on them.
Address column is a struct with three members as city, street and zip. We will see how we can
manipulate them in hql.
job_history is the most complex one. It’s an array which means it can contain multiple values starting
with array index 0. Array and maps are similar in hive as they both can store multiple values. Difference
is how they represent values. Array will use index starting from 0 whereas map will use key to identify
value. If you don’t get it now, don’t worry, we you will get it well with examples shown below.
To make this array more complex and to showcase that hive support nesting of complex data types, I
made it an array of struc. We will see how we can work with nested data types.
Let’s execute first and most simple hql statement, selecting only columns with primitive data type i.e.
string and int in our case.
And output is as below.
Learn more with us at http://www.thriveschool.blogspot.in/
Simple, no explanation required. Working with primitive data types is simple and most of your SQL skills
will work here. Complex data types like maps, structure and array are the one we will focus and learn
how to work with them.
Let’s select a map type column i.e. contacts for these two records an output is shown below.
You will notice following.
1. Hive returns them using JSON notation. You will see it same for other complex data types aslo.
2. There are three map-key pairs for each record. These are office, email and mobile.
3. This clarifies why I mentioned maps are similar to array because we can see that single column
contacts actually contains three values. Does it not appear like an array of three elements?
4. How do I access an individual element of a map?
Let’s do it.
Okay, I used array notation to access single element but instead of using index number, I used key. I
have used where clause in above example just to emphasize that your SQL skills will work mostly in
hql. I will not mention such things further and you should notice it yourself.
Let’s select a struct column.
Learn more with us at http://www.thriveschool.blogspot.in/
How do I access an individual element of a struct? Check below.
Let’s select an array column.
Did you notice that there are two elements in the array? An entire record is enclosed in [ ]. Array
elements are enclosed in { }. If you look at the output there are two sets of { } separated by comma,
each representing one array element. Let’s select only first element of the array in next screen.
Now you understand that there were two elements in the array and looking at only one element is
simple as using array notation with index value.
Each element of the array is again a struct of three members as you can see in the above screen.
Let’s select only one member of first element of the array.
I hope by now you learned how to access complex columns in hive and how to access if they are
nested. I will leave you at this stage to explore your SQL skills in HQL.
Learn more with us at http://www.thriveschool.blogspot.in/
Few notes before we conclude this post.
You can use where clause in HQL as you might be using in SQL. Most of the operators are same like =, !=,
<, IS NULL, IS NOT NULL LIKE etc. You can also use mathematical functions like round, floor, ceil, pow,
sqrt etc. You can use aggregate functions like count, sum, avg, min, max etc. HQL also supports having
clause along with aggregate functions. HQL also supports string functions like length, reverse, upper,
lower, substr, trim, concat etc. You can use CASE WHEN THEN clause also. You can use sub queries like
you use in SQL. You can use order by and union as well.
Joins are also supported with some limitations (only equijoins with AND, no OR is supported). Hive goes
left to right when joining multiple tables and WHERE clauses are evaluated after joins are performed.
In summary, most of you SQL skills will work here. Let me show another example of nested query syntax
supported in HQL which you may find different than some of the RDBMS.
There are lot of other things to cover in HQL which we will cover individually later. Keep learning.
Recommended