Apache Hive for modern DBAs

Apache Hive for modern

DBAsLuís Marques

About me

Oracle ACE

Data and Linux geek

Long time opensource

supporter

works for @redgluept as

Data Architect

@drune

Big Data Thinking Strategy

●Think small

●Think big

●Don’t think at all (hype is here)

What is Apache Hive?

●open source, TB/PB scale date warehousing

framework based on Hadoop

●The first and more complete SQL-on-”Hadoop”

●SQL:2003 and SQL:2011 compatible

●Data store on several formats

●Several execution engines available

●Interactive Query support (In-memory cache)

Apache Hive - Before you ask

●Datawarehouse/OLAP activities (data mining, data

exploration, batch processing, ETL, etc) - “The

heavy lifting of data”

●Low cost scaling, built as extensibility in mind

●Use large datasets (gigabytes/terabytes) scale

●Don’t use Hive for any OLTP activities

●ACID exists, not recommended yet

The reason behind Hive

I had written, as part of working with the Feed team - what became - a rather complicated MR job to rank friends by mutual friends.

In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how hard it was to write an optimal MR job (particularly on large data sets).

Assembling data into complex data structures was also painful.

I really wanted to see these types of operators exposed in a high level declarative form so that the average user would never have to go through this. Fortunately - our team had Oracle veterans well versed in the art of SQL.

Joydeep Sen Sarma (Facebook)

The reason behind Hive

Instead of complex MR jobs

You have declarative language...

Apache Hive versions & branches

master branch-1

Version 2.x

New code

New features

Version 1.x

Stable

Backwards

compatibility

Critical

bugs

Hadoop 1.x and 2.x

supportedHadoop 2.x

supported

stable features

Data Model (data units & types)

●Supports primitive column types (integers,

numbers, strings, date time and booleans)

●Supports complex types: Structs, Maps and

Arrays

●Concept of databases, tables, partitions and

buckets

●SerDe: serialize and deserialized API is used to

move data in and out of tables

Data Model (partitions & bucketing)

● Partitioning: used for distributing load horizontally,performance benefit,

organization data

PARTITIONED BY (flightName STRING, AircraftName STRING)

/employees/flightName=ABC/AircraftName=XYZ

● Buckets (clusters): decomposing data sets into more manageable parts, help

on map-side joins, and correct sampling on the same bucket

“Records with the same flightID will always be stored in the same bucket.

Assuming the number of flightID is much greater than the number of buckets, each

bucket will have many flightIDs”

CLUSTERED BY (flightID) INTO XX BUCKETS;

Data Model (complex data types)

Array Ordered collection of

fields. Fields of the

same type

array(1,2)

Map Unordered key value

pairs. Keys are

primitives, values are

any type

Map (‘a’, 1, ‘b’, 2)

Struct A collection of named

fields

Struct(‘a’,10, 2.5)

Data model

HiveQL

●HiveQL is an SQL-like query language for Hive

●Supports DDL and DML

●Supports multi-table inserts

●Possible to write custom map-reduce scripts

●Supports UDF, UDAF UDTF

DDL (some examples)

HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX

HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX

HIVE> TRUNCATE TABLE

HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW

HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,

PARTITIONS, FUNCTIONS

HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name

File formats

● Parquet: compressed, efficient columnar data

representation available to any project in the Hadoop

● ORC: made for Hive, support Hive type model,columnar

storage, block compression, predicate pushdown, ACID*,

etc

● Avro: JSON for defining data types and protocols, and

serializes data in a compact binary format

● Compressed file formats (LZO, *.GZIP)

● Plain Text Files

● Any other type to data subject to a format is possible to be

read (csv, json, xml, etc)

ORC

●Stored as columns and compressed = smaller disk

reads

●ORC has a built-in index, min/max values, and

other aggregates (eg: sum,max) = skip entire

blocks to speed up reads

●ORC implements predicate pushdown and bloom

filters

●ORC scale

●You should use it :-)

Indexing

● Not recommended because of ORC;

● ORC has build in Indexes which allow the format to skip

blocks of data during read

● Hive indexes are implemented as tables

● Compact indexes and bitmap indexes supported

● Tables that provide information about which data is in

which blocks and are used to skip data (like ORC already

does)

● Not supported on Tez engine - ignored

● Indexes in Hive are not like indexes in other databases.

File formats & Indexing

Hive Architecture

Hive Web

InterfaceHive CLI (beeline, hive)

Hive JDBC/ODBC

Driver

Compiler (Parser, Semantic Analyser,

Logical Plan Generator, Query plan

Generator)

Executor

OptimizerMetastore

client

Trift Server (HiveServer2)

Metastore RDBMS

Execution

EnginesMap Reduce Tez Spark

Resource Management YARN

Storage HDFS HBaseAzure Storage

Amazon S3

Metastore● Typically stored in a RDBMS (MySQL; SQLServer;

PostgreSQL, Derby*) - ACID and concurrency on metadata

querys

● Contains: metadata for databases, tables or partitions

● Provides two features: data discovery and data abstraction

● Data abstraction: provide information about data formats,

extractors and loaders in table creation and reused, (ex:

dictionary tables - Oracle)

● Data discovery: discover relevant and specific data, allow

other tools to use metadata to explore data (Ex: SparkSQL)

See it in action

Execution engines● 3 execution engines are available:

○ MapReduce (mr)

○ Tez

○ Spark

MR: The original, most stable and more reliable, batch oriented, disk-

based parallel (like traditional Hadoop MR jobs).

Tez: High performance batch and interactive data processing. Stable in

99% of the time. The one that you should use. Default on HDP.

Spark: Uses Apache Spark (in-memory computing platform), High-

performance (like Tez), not used in production (yet), good progress

MapReduce vs Tez/SparkMapReduce:

● One pair of map and reduce does one level of aggregation over the

data. Complex computations typically require multiple such steps.

Tez/Spark:

● DAG (Directed Acyclic Graph)

● The graph does not have cycles because the fault tolerance

mechanism used by Tez is re-execution of failed tasks

● The limitations of MapReduce in Hadoop became a key point to

introduce DAG

● Pipelining consecutive map steps into one

● Enforce concurrency and serialization between MapReduce jobs

Tez & DAGsDAG Definition:

● Data processing is expressed in the form of a directed acyclic graph

(DAG)

Two main components:

● vertices - in the graph representing processing of data

○ user logic, that analyses and modifies the data, sits in the vertices

● edges - representing movement of data between the processing

○ Defines routing of data between tasks (One-To-One, Broadcast

Scatter-Gather)

○ Defines when a consumer task is scheduled (Sequential,

Concurrent)

○ Defines the lifetime/reliability of a task output

Hive Cost Based Optimizer - Why

● Distributed SQL query processing in Hadoop differs from conventional

relational query engine when it comes to handling of intermediate

result sets

● Query processing requires sorting and reassembling of intermediate

result set - shuffling

● Most of the existing optimizations in Hive are about minimizing

shuffling cost and logical optimizations like filter push down,

projection pruning and partition pruning

● Join reordering and join algorithm possible with cost based optimizer.

Hive CBO - What to get

● Based on a project called Apache Calcite (https://calcite.apache.org/)

● You can get using a Cost Based Optimizer:

○ How to order Join (join reordering)

○ Algorithm to use for a Join

○ Intermediate result be persisted or should it be recomputed on

failure

○ degree of parallelism at any operator (number of mappers and

reducers

○ Semi Join selection

○ (others optimizer tricks like histograms)

Execution Engines

Hive - The present-future

● Tez and Spark head to head on performance and stability

● LLAP (Long Live and Process) - Hive interactive querys

● ACID

Hive next big thing: LLAP

● Sub second querys (Interactive Querys)

● In-memory caching layer with async I/O

● Fast concurrent execution

● Move from disk oriented to memory oriented execution (trend)

● Disks are connect to CPU via network - data locality is not relevant

Thank you

Questions?

@drune

https://www.linkedin.com/in/lc

marques/

[email protected]

@redgluept

www.redglue.eu

https://www.linkedin.com/in/lcmarques/

mailto:[email protected]

www.redglue.eu

Presentations & Public Speaking

Apache Hive for modern DBAs