23
Hive + HCatalog By Amru Eliwat CS 157B at San Jose State University www.linkedin.com/in/amrue/

Hive + HCatalog

Embed Size (px)

DESCRIPTION

Big Data ETL with Hive and HCatalog, using the public StackOverflow dataset. Includes installation instructions.

Citation preview

Page 1: Hive + HCatalog

Hive + HCatalogBy Amru Eliwat

CS 157B at San Jose State Universitywww.linkedin.com/in/amrue/

Page 2: Hive + HCatalog

Agenda• What is Hive?

• What is HCatalog?- Using it with Hive

• Setting up Hive + HCat locally

• Setting up Hive + HCat in a Virtual Machine

• Demo- Loading data into HCat manually

- Loading data into HCat using Hive

- Basic Hive queries

Total time: Approximately 30 minutes

Page 3: Hive + HCatalog

Hive

• “Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.”

Page 4: Hive + HCatalog

Hive

• Runs SQL-like queries using HiveQL, which are implicitly converted into map-reduce jobs.

• Because of HiveQL’s declarative nature, Hive excels at Ad-Hoc analysis.

Page 5: Hive + HCatalog

Hive• Hive acts on metadata in the Hive metastore.

• Metadata is stored in an Apache Derby database by default, but a MySQL database can be used instead.

- When using the default Derby database, only one process can connect to the metastore at a time, so this is only ideal for testing purposes.

• Although queries are run on data in the metastore, we do not get the efficiency and optimization of an RDBMS since the queries are converted into map-reduce jobs.

Page 6: Hive + HCatalog

Hive• Peter Jamack at the IBM Developer Works blog asks,

Hive for ETL or ELT?

• You can extract, transform, then load your data with Hive, but Jamack suggests it is better to extract, load, then transform with Hive.

• Hive works better for some types of data then others.

“Obviously, choosing between adopting an ELT or ETL philosophy requires thought. This decision can account for more than 70 percent of the planning time required for many data warehouse, master data management, and other database projects.”

Page 7: Hive + HCatalog

HCatalog

• “Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid.”

Page 8: Hive + HCatalog

HCatalog• HCatalog presents a relational view of data. Data

is stored in tables and these tables can be placed into databases.

• Hive can read data in HCatalog directly, because HCatalog is based on Hive’s metastore. Other tools require interfaces, such as ‘HCatLoader’ and ‘HCatStorer’ for Pig.

• In other words, HCatalog can be seen as a project enabling non-Hive scripts to access Hive metastore tables.

Page 9: Hive + HCatalog
Page 10: Hive + HCatalog

HCatalog

• As mentioned earlier, we do not get the efficiency of RDBMS despite the data being presented in a relation view.

Page 11: Hive + HCatalog

SetupInstalling Hive requires some work, however it comes with HCatalog out-of-the-box to use as the metastore.

1. Download and unpack the tarball (.tar.gz).

2. Set the environment variable HIVE_HOME to point to the installation directory:

export HIVE_HOME=“/usr/local/hive-0.12.0”

3. Add $HIVE_HOME/bin to your PATH:

export PATH=$HIVE_HOME/bin:$PATH

4. You will need Hadoop installed to continue. Create the following folders for Hive’s metastore then set them to chmod g+w in HDFS like so:$HADOOP_HOME/bin/hadoop fs -mkdir /tmp$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Page 12: Hive + HCatalog

Setup5. Copy the file hive-default.xml.template in /hive-0.12.0/conf and rename it hive-site.xml.

6. Finally, run $HIVE_HOME/hcatalog/sbin/hcat_server.sh start in the terminal.

7. Create a table in Hive using CREATE TABLE.

8. Load data:

LOAD DATA LOCAL INPATH './files/stackoverflow.txt' OVERWRITE INTO TABLE posts;

Page 13: Hive + HCatalog
Page 14: Hive + HCatalog

Hive + HCatalog• If you’ve made it this far, you can use your knowledge of

SQL to run HiveQL queries.

SELECT column_name FROM table_name;

SELECT a.* FROM a JOIN b ON (a.id = b.id);

“Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.”

http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

Page 15: Hive + HCatalog

Setup #2

• Download the correct version of HortonWorks Sandbox for your virtual machine setup from the HortonWorks webpage.

• Double click the HortonWorks Sandbox virtual machine launch file (its the only file in the folder you just downloaded).

• Point your browser to 127.0.0.1:8888

Page 16: Hive + HCatalog

DemoLoading data into HCat and analyzing it with Hive

Page 17: Hive + HCatalog

Demo

• Fire up HortonWorks in VirtualBox

• Click on “HCatalog” on the upper toolbar

Page 18: Hive + HCatalog

• On the left hand side, choose “Create New Table Manually” and give it a name. Then:

Page 19: Hive + HCatalog
Page 20: Hive + HCatalog

• Finally, select a file to upload.

• Alternatively, use Hive for the whole process:

Page 21: Hive + HCatalog
Page 22: Hive + HCatalog

• Once the data is loaded into HCat, we can take a closer look at the data.

SELECT * FROM post_etl WHERE posttypeid = 1

Page 23: Hive + HCatalog

Q&A