40
CS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1

CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

CS 626 Large Scale Data Science

Jun ZhangMarch 5, 2020

Originally prepared by Licong Cui

Lecture 11 – Apache Hive

1

Page 2: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Review: Hadoop Ecosystem

2

Page 3: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Review: Pig

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

Pig Engine

Pig Latin3

Page 4: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Review: WordCount Using Pig

lines = LOAD ‘cs626/words.txt' AS (line:chararray);words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;grouped = GROUP words BY word;wordcount = FOREACH grouped GENERATE group, COUNT(words);DUMP wordcount;

4

Page 5: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Pig vs MapReduce vs Hive

1https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163

1

5

Page 6: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Outline

What is Hive?Data TypesData ModelsHive ArchitectureHive vs Traditional DatabaseHive vs Pig

6

Page 7: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive

Facebook Data warehousing infrastructure based on

Hadoop Designed to Enable easy data summarization Ad-hoc querying Analysis of large volumes of data

HiveQL - Hive’s query language7

Page 8: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive

Organize data into tables

Metastore

Metadata (table schemas)

8

Page 9: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Run Hive

Interactive mode Hive shell: hive

Non-interactive mode Local: hive –f script.q

Hive web interface

JDBC (Java Database Connectivity)

9

Page 10: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

HiveQL Data Types

Primitive data types

Complex data types

10

Page 11: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Primitive Data Types

Boolean

Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL

String: STRING, VARCHAR

Timestamp: TIMESTAMP, DATE

11

Page 12: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Primitive Data Types (cont.)

12

Page 13: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Primitive Data Types (cont.)

13

Page 14: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Complex Data Types

14

Page 15: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Example Complex Data Types

15

Page 16: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Example Complex Data Types (cont.)

16

Page 17: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

HiveQL Example

17

Page 18: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

HiveQL Example (cont.)

18

Page 19: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

HiveQL Example (cont.)

19

Page 20: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Data Model

Tables

Partitions

Buckets

20

Page 21: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Tables

Analogous to relational tables Hive moves data to its warehouse directory Each table has corresponding directory in HDFS Hive does not check that the files in the table

directory conform to the schema at the loading time Example:

hdfs://user/tom/data.txt ==> hdfs://user/hive/warehouse/managed_table

21

Page 22: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

External Tables

Does not move data to Hive’s warehouse directory

Point to existing data directories in HDFS

Data is assumed to be in Hive-compatible format

Dropping external table drops only the metadata

Example:

22

Page 23: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Partitions

Hive organize tables into partitions

Partitions determine distribution of data within subdirectories

Example:

23

Page 24: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Partitions (cont.)

Example:

SELECT statements:

24

Page 25: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Buckets

Data in each partition divided into buckets

Each bucket is stored as a file in partition directory

Hash function: H(column) mod num_buckets = bucket_number

Example:

25

Page 26: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive Architecture

26

Page 27: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive Architecture

27

Page 28: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Thrift Server

Framework for cross language services

Server written in Java

Support for clients written in different languages JDBC(java), ODBC(c++), php, perl, python scripts

28

Page 29: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Metastore

System catalog which contains metadata about the Hive tables

Stored in RDBMS/local file system HDFS too slow (not optimized for random access) Derby, MySQL

Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage

29

Page 30: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Driver

Driver Maintains the lifecycle of HiveQL statement

Query Compiler Compiles HiveQL in a DAG of map reduce tasks

Executor Executes the tasks plan generated by the compiler in

proper dependency order Interacts with the underlying Hadoop instance

30

Page 31: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive vs Traditional Database

Schema on Read Versus Schema on Write Traditional database: schema on write table’s schema is enforced at data load time

Hive: schema on read does not verify the data when it is loaded, but rather when

a query is issued

31

Page 32: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive vs Traditional Database (cont.)

Updates, Transactions, and Indexes Mainstays of traditional databases

HDFS does not provide in-place file updates Changes resulting from inserts, updates, and deletes are

stored in small delta files

Delta files are periodically merged into the base table files by MapReduce jobs

32

Page 33: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive vs. RDBMS

33

Page 34: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

SQL vs HiveQL

34

Cheat sheet at class webpage

Page 35: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

SQL vs HiveQL (cont.)

35

Page 36: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Pig vs HivePig Hive

Procedural data flow language

Declarative SQLishLanguage

36

Page 37: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Pig vs Hive (cont.)

Pig Mainly for data transformations and processing

Unstructured and structured data

Hive Mainly for data warehousing and querying data

Structured data

Lower learning curve than Pig or MapReduce

HiveQL is much closer to SQL than Pig37

Page 38: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Hive, Pig, and Hadoop Benchmark

Version: Hadoop – 0.18x, Pig:786346, Hive:786346

38

Page 39: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

Bigger Picture

Store large amounts of data to HDFS

Process raw data using Pig

Build schema using Hive

Querying data using Hive

39

Page 40: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver

References

Hadoop: The Definitive Guide (By Tom White) https://courses.engr.illinois.edu/cs525/sp201

5/ http://www.slideshare.net/chirag064/hive-

warehousing-over-hadoop https://acadgild.com/blog/working-with-

hive-complex-data-types/

40