CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner,...

CS 626 Large Scale Data Science

Jun ZhangMarch 5, 2020

Originally prepared by Licong Cui

Lecture 11 – Apache Hive

Review: Hadoop Ecosystem

Review: Pig

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

Pig Engine

Pig Latin3

Review: WordCount Using Pig

lines = LOAD ‘cs626/words.txt' AS (line:chararray);words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;grouped = GROUP words BY word;wordcount = FOREACH grouped GENERATE group, COUNT(words);DUMP wordcount;

Pig vs MapReduce vs Hive

1https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163

Outline

What is Hive?Data TypesData ModelsHive ArchitectureHive vs Traditional DatabaseHive vs Pig

Facebook Data warehousing infrastructure based on

Hadoop Designed to Enable easy data summarization Ad-hoc querying Analysis of large volumes of data

HiveQL - Hive’s query language7

Organize data into tables

Metastore

Metadata (table schemas)

Run Hive

Interactive mode Hive shell: hive

Non-interactive mode Local: hive –f script.q

Hive web interface

JDBC (Java Database Connectivity)

HiveQL Data Types

Primitive data types

Complex data types

Primitive Data Types

Boolean

Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL

String: STRING, VARCHAR

Timestamp: TIMESTAMP, DATE

Primitive Data Types (cont.)

Complex Data Types

Example Complex Data Types

Example Complex Data Types (cont.)

HiveQL Example

HiveQL Example (cont.)

Data Model

Tables

Partitions

Buckets

Tables

Analogous to relational tables Hive moves data to its warehouse directory Each table has corresponding directory in HDFS Hive does not check that the files in the table

directory conform to the schema at the loading time Example:

hdfs://user/tom/data.txt ==> hdfs://user/hive/warehouse/managed_table

External Tables

Does not move data to Hive’s warehouse directory

Point to existing data directories in HDFS

Data is assumed to be in Hive-compatible format

Dropping external table drops only the metadata

Example:

Partitions

Hive organize tables into partitions

Partitions determine distribution of data within subdirectories

Example:

Partitions (cont.)

Example:

SELECT statements:

Buckets

Data in each partition divided into buckets

Each bucket is stored as a file in partition directory

Hash function: H(column) mod num_buckets = bucket_number

Example:

Hive Architecture

Thrift Server

Framework for cross language services

Server written in Java

Support for clients written in different languages JDBC(java), ODBC(c++), php, perl, python scripts

Metastore

System catalog which contains metadata about the Hive tables

Stored in RDBMS/local file system HDFS too slow (not optimized for random access) Derby, MySQL

Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage

Driver

Driver Maintains the lifecycle of HiveQL statement

Query Compiler Compiles HiveQL in a DAG of map reduce tasks

Executor Executes the tasks plan generated by the compiler in

proper dependency order Interacts with the underlying Hadoop instance

Hive vs Traditional Database

Schema on Read Versus Schema on Write Traditional database: schema on write table’s schema is enforced at data load time

Hive: schema on read does not verify the data when it is loaded, but rather when

a query is issued

Hive vs Traditional Database (cont.)

Updates, Transactions, and Indexes Mainstays of traditional databases

HDFS does not provide in-place file updates Changes resulting from inserts, updates, and deletes are

stored in small delta files

Delta files are periodically merged into the base table files by MapReduce jobs

Hive vs. RDBMS

SQL vs HiveQL

Cheat sheet at class webpage

SQL vs HiveQL (cont.)

Pig vs HivePig Hive

Procedural data flow language

Declarative SQLishLanguage

Pig vs Hive (cont.)

Pig Mainly for data transformations and processing

Unstructured and structured data

Hive Mainly for data warehousing and querying data

Structured data

Lower learning curve than Pig or MapReduce

HiveQL is much closer to SQL than Pig37

Hive, Pig, and Hadoop Benchmark

Version: Hadoop – 0.18x, Pig:786346, Hive:786346

Bigger Picture

Store large amounts of data to HDFS

Process raw data using Pig

Build schema using Hive

Querying data using Hive

References

Hadoop: The Definitive Guide (By Tom White) https://courses.engr.illinois.edu/cs525/sp201

5/ http://www.slideshare.net/chirag064/hive-

warehousing-over-hadoop https://acadgild.com/blog/working-with-

hive-complex-data-types/

CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner,...

Documents

Filegroup “Stage A” Filegroup “Stage A” Filegroup “A” Partition 1,2 Filegroup “B” Partition 3,4 Filegroup “C” Partition 5,6 Filegroup “D” Partition

SERDE: ARTISTS IN RESIDENCIES 2012

Partition Managerâ„¢ 9 - PARAGON Software Group - partition

Erik Satie : Gnossiene partition piano n°1 · PDF fileTitle: Erik Satie : Gnossiene partition piano n°1 Author: Partitions-piano.fr Subject: Partition piano Keywords "Partition,

Partition piano d'albeniz : Chants d'Espagne - CordobaTitle Partition piano d'albeniz : Chants d'Espagne - Cordoba Author Partitions-piano.fr Subject Partition piano Keywords "Partition,

SERDE: ARTIST IN RESIDENCIES 2012

Partition Manager 12 Free - PARAGON Software Group - partition

TECHNOLOGY INTEGRATED INTO PARTITION · glazed partition, OMEGA - classic framed partition or RAVA® - the trendy, frameless partition with negative gap. Learn more about our partition

Panasonic recommends Windows. TOUGHBOOK 20 mk2€¦ · SOFTWARE n Panasonic Utilities (including Dashboard), Recovery Partition nEnterprise ready driver packages including CAB files

Partition de l'étude de Chopin - Sheet MusicTitle Partition de l'étude de Chopin Author Partitions-piano.fr Subject Partition piano Keywords Partition, Piano, Partition piano, piano

Partition chromatography & partition paper chromatography

Partition functions, loop measure, and versions of SLElawler/jsp.pdf · Partition functions, loop measure, and versions of SLE ... partition functions and the normalized partition

File System Internals - AndroBenchcsl.skku.edu/uploads/SSE3044F12/19-fsimpl.pdf · 2012. 12. 2. · Partition 1 (active) dependent Partition 2 Partition 3 boot code partition table

Partition Managerâ„¢ Personal - PARAGON Software Group - partition

SERDE: FRONTIERS IN RETREAT

BinFS in WinCE. BinFS = Binary + FSD Relative technology Binary WinCE Image architecture Executable Format struct FSD Storage manager Partition driver

Partition disparue disque inaccessible (RAW): TUTOlmsav.free.fr/TUTOS/DEPANNAGE/Partition disparue disque... · c) ouvrir PTEDIT32 et analyser le descripteur de partition de la partition

Partition of Bengal: Partition Agitations and the Role of ...shodhganga.inflibnet.ac.in/bitstream/10603/61378/9/09_chapter 3.pdf · Partition of Bengal: Partition Agitations and the

Download Active@ Partition Manager Help (PDF) - …pcdisk.com/download/partman_manual.pdf · | Partition Manager Overview | 4 Partition Manager Overview Partition Manager is advanced

Refinement of Some Partition Inequalities Partitions, The Partition Counting Function Restricted Partition Functions Ferrers Diagram, Durfee Square Partition Generating Functions Partition