27
Introduction to Hive © 2009 Cloudera, Inc.

Hadoop Training #6: Introduction to Hive

Embed Size (px)

DESCRIPTION

Hive is a powerful data warehousing application built on top of Hadoop which allows you to use SQL to access your data. This lecture will give an overview of Hive and the query language. This lecture includes a work-along exercise. The next video is a screencast of a user performing this exercise.Check http://www.cloudera.com/hadoop-training-basic for training videos.

Citation preview

Page 1: Hadoop Training #6: Introduction to Hive

Introduction to Hive

© 2009 Cloudera, Inc.

Page 2: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Outline

• Motivation

• Overview

• Data Model

• Working with Hive

• Wrap up & Conclusions

Page 3: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Background

• Started at Facebook

• Data was collected by nightly cron jobs into Oracle DB

• “ETL” via hand-coded python

• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that.

����������������������� ���������

��� ��������� �� ������������� ��

Page 4: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Hadoop as Enterprise Data

Warehouse

• Scribe and MySQL data loaded into Hadoop HDFS

• Hadoop MapReduce jobs to process data

• Missing components:

– Command-line interface for “end users”

– Ad-hoc query support

• … without writing full MapReduce jobs

– Schema information

Page 5: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Hive Applications

• Log processing

• Text mining

• Document indexing

• Customer-facing business intelligence (e.g., Google Analytics)

• Predictive modeling, hypothesis testing

Page 6: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Hive Components

• Shell: allows interactive queries like MySQL shell connected to database– Also supports web and JDBC clients

• Driver: session handles, fetch, execute

• Compiler: parse, plan, optimize

• Execution engine: DAG of stages (M/R, HDFS, or metadata)

• Metastore: schema, location in HDFS, SerDe

Page 7: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Data Model

• Tables– Typed columns (int, float, string, date,

boolean)

– Also, list: map (for JSON-like data)

• Partitions– e.g., to range-partition tables by date

• Buckets– Hash partitions within ranges (useful for

sampling, join optimization)

Page 8: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Metastore

• Database: namespace containing a set of tables

• Holds table definitions (column types, physical layout)

• Partition data

• Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases

Page 9: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Physical Layout

• Warehouse directory in HDFS– e.g., /home/hive/warehouse

• Tables stored in subdirectories of warehouse– Partitions, buckets form subdirectories of

tables

• Actual data stored in flat files– Control char-delimited text, or SequenceFiles

– With custom SerDe, can use arbitrary format

Page 10: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Starting the Hive shell

• Start a terminal and run

$ cd /usr/share/cloudera/hive/

$ bin/hive

• Should see a prompt like:

hive>

Page 11: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Creating tables

hive> SHOW TABLES;

hive> CREATE TABLE shakespeare (freq

INT, word STRING) ROW FORMAT

DELIMITED FIELDS TERMINATED BY ‘\t’

STORED AS TEXTFILE;

hive> DESCRIBE shakespeare;

Page 12: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Generating Data

• Let’s get (word, frequency) data from the Shakespeare data set:

$ hadoop jar \

$HADOOP_HOME/hadoop-*-examples.jar \

grep input shakespeare_freq ‘\w+’

Page 13: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Loading data

• Remove the MapReduce job logs:

$ hadoop fs -rmr shakespeare_freq/_logs

• Load dataset into Hive:

hive> LOAD DATA INPATH “shakespeare_freq”

INTO TABLE shakespeare;

Page 14: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Selecting data

hive> SELECT * FROM shakespeare LIMIT 10;

hive> SELECT * FROM shakespeare

WHERE freq > 100 SORT BY freq ASC

LIMIT 10;

Page 15: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Most common frequency

hive> SELECT freq, COUNT(1) AS f2

FROM shakespeare GROUP BY freq

SORT BY f2 DESC LIMIT 10;

hive> EXPLAIN SELECT freq, COUNT(1) AS f2

FROM shakespeare GROUP BY freq

SORT BY f2 DESC LIMIT 10;

Page 16: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Joining tables

• A powerful feature of Hive is the ability to create queries that join tables together

• We have (freq, word) data for Shakespeare

• Can also calculate it for KJV

• Let’s see what words show up a lot in both

Page 17: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Create the dataset:

$ tar zxf ~/bible.tar.gz –C ~

$ hadoop fs -put ~/bible bible

$ hadoop jar \

$HADOOP_HOME/hadoop-*-examples.jar \

grep bible bible_freq ‘\w+’

Page 18: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Create the new table

hive> CREATE TABLE kjv (freq INT,

word STRING) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\t’ STORED AS

TEXTFILE;

hive> SHOW TABLES;

hive> DESCRIBE kjv;

Page 19: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Import data to Hive

$ hadoop fs –rmr bible_freq/_logs

hive> LOAD DATA INPATH “bible_freq”

INTO TABLE kjv;

Page 20: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Create an intermediate table

hive> CREATE TABLE merged

(word STRING, shake_f INT,

kjv_f INT);

Page 21: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Running the join

hive> INSERT OVERWRITE TABLE merged

SELECT s.word, s.freq, k.freq FROM

shakespeare s JOIN kjv k ON

(s.word = k.word)

WHERE s.freq >= 1 AND k.freq >= 1;

hive> SELECT * FROM merged LIMIT 20;

Page 22: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Most common intersections

• What words appeared most frequently in both corpuses?

hive> SELECT word, shake_f, kjv_f,

(shake_f + kjv_f) AS ss

FROM merged SORT BY ss DESC

LIMIT 20;

Page 23: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Some more advanced features…

• “TRANSFORM:” Can use MapReduce in SQL statements

• Custom SerDe: Can use arbitrary file formats

• Metastore check tool

• Structured query log

Page 24: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Project Status

• Open source, Apache 2.0 license

• Official subproject of Apache Hadoop

• 4 committers (all from Facebook)

• First release candidate coming soon

Page 25: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Related work

• Parallel databases: Gamma, Bubba, Volcano

• Google: Sawzall

• Yahoo: Pig

• IBM Research: JAQL

• Microsoft: DryadLINQ, SCOPE

• Greenplum: YAML MapReduce

• Aster Data: In-database MapReduce

• Business.com: CloudBase

Page 26: Hadoop Training #6: Introduction to Hive

© 2009 Cloudera, Inc.

Conclusions

• Supports rapid iteration of ad-hoc queries

• Can perform complex joins with minimal code

• Scales to handle much more data than many similar systems

Page 27: Hadoop Training #6: Introduction to Hive