CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner,...

Preview:

Citation preview

CS 626 Large Scale Data Science

Jun ZhangMarch 5, 2020

Originally prepared by Licong Cui

Lecture 11 – Apache Hive

1

Review: Hadoop Ecosystem

2

Review: Pig

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

Pig Engine

Pig Latin3

Review: WordCount Using Pig

lines = LOAD ‘cs626/words.txt' AS (line:chararray);words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;grouped = GROUP words BY word;wordcount = FOREACH grouped GENERATE group, COUNT(words);DUMP wordcount;

4

Pig vs MapReduce vs Hive

1https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163

1

5

Outline

What is Hive?Data TypesData ModelsHive ArchitectureHive vs Traditional DatabaseHive vs Pig

6

Hive

Facebook Data warehousing infrastructure based on

Hadoop Designed to Enable easy data summarization Ad-hoc querying Analysis of large volumes of data

HiveQL - Hive’s query language7

Hive

Organize data into tables

Metastore

Metadata (table schemas)

8

Run Hive

Interactive mode Hive shell: hive

Non-interactive mode Local: hive –f script.q

Hive web interface

JDBC (Java Database Connectivity)

9

HiveQL Data Types

Primitive data types

Complex data types

10

Primitive Data Types

Boolean

Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL

String: STRING, VARCHAR

Timestamp: TIMESTAMP, DATE

11

Primitive Data Types (cont.)

12

Primitive Data Types (cont.)

13

Complex Data Types

14

Example Complex Data Types

15

Example Complex Data Types (cont.)

16

HiveQL Example

17

HiveQL Example (cont.)

18

HiveQL Example (cont.)

19

Data Model

Tables

Partitions

Buckets

20

Tables

Analogous to relational tables Hive moves data to its warehouse directory Each table has corresponding directory in HDFS Hive does not check that the files in the table

directory conform to the schema at the loading time Example:

hdfs://user/tom/data.txt ==> hdfs://user/hive/warehouse/managed_table

21

External Tables

Does not move data to Hive’s warehouse directory

Point to existing data directories in HDFS

Data is assumed to be in Hive-compatible format

Dropping external table drops only the metadata

Example:

22

Partitions

Hive organize tables into partitions

Partitions determine distribution of data within subdirectories

Example:

23

Partitions (cont.)

Example:

SELECT statements:

24

Buckets

Data in each partition divided into buckets

Each bucket is stored as a file in partition directory

Hash function: H(column) mod num_buckets = bucket_number

Example:

25

Hive Architecture

26

Hive Architecture

27

Thrift Server

Framework for cross language services

Server written in Java

Support for clients written in different languages JDBC(java), ODBC(c++), php, perl, python scripts

28

Metastore

System catalog which contains metadata about the Hive tables

Stored in RDBMS/local file system HDFS too slow (not optimized for random access) Derby, MySQL

Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage

29

Driver

Driver Maintains the lifecycle of HiveQL statement

Query Compiler Compiles HiveQL in a DAG of map reduce tasks

Executor Executes the tasks plan generated by the compiler in

proper dependency order Interacts with the underlying Hadoop instance

30

Hive vs Traditional Database

Schema on Read Versus Schema on Write Traditional database: schema on write table’s schema is enforced at data load time

Hive: schema on read does not verify the data when it is loaded, but rather when

a query is issued

31

Hive vs Traditional Database (cont.)

Updates, Transactions, and Indexes Mainstays of traditional databases

HDFS does not provide in-place file updates Changes resulting from inserts, updates, and deletes are

stored in small delta files

Delta files are periodically merged into the base table files by MapReduce jobs

32

Hive vs. RDBMS

33

SQL vs HiveQL

34

Cheat sheet at class webpage

SQL vs HiveQL (cont.)

35

Pig vs HivePig Hive

Procedural data flow language

Declarative SQLishLanguage

36

Pig vs Hive (cont.)

Pig Mainly for data transformations and processing

Unstructured and structured data

Hive Mainly for data warehousing and querying data

Structured data

Lower learning curve than Pig or MapReduce

HiveQL is much closer to SQL than Pig37

Hive, Pig, and Hadoop Benchmark

Version: Hadoop – 0.18x, Pig:786346, Hive:786346

38

Bigger Picture

Store large amounts of data to HDFS

Process raw data using Pig

Build schema using Hive

Querying data using Hive

39

References

Hadoop: The Definitive Guide (By Tom White) https://courses.engr.illinois.edu/cs525/sp201

5/ http://www.slideshare.net/chirag064/hive-

warehousing-over-hadoop https://acadgild.com/blog/working-with-

hive-complex-data-types/

40

Recommended