33

Apache Hive for modern DBAs

Embed Size (px)

Citation preview

Page 1: Apache Hive for modern DBAs
Page 2: Apache Hive for modern DBAs
Page 3: Apache Hive for modern DBAs

Apache Hive for modern

DBAsLuís Marques

Page 4: Apache Hive for modern DBAs

About me

Oracle ACE

Data and Linux geek

Long time opensource

supporter

works for @redgluept as

Data Architect

@drune

Page 5: Apache Hive for modern DBAs

Big Data Thinking Strategy

●Think small

●Think big

●Don’t think at all (hype is here)

Page 6: Apache Hive for modern DBAs

What is Apache Hive?

●open source, TB/PB scale date warehousing

framework based on Hadoop

●The first and more complete SQL-on-”Hadoop”

●SQL:2003 and SQL:2011 compatible

●Data store on several formats

●Several execution engines available

●Interactive Query support (In-memory cache)

Page 7: Apache Hive for modern DBAs

Apache Hive - Before you ask

●Datawarehouse/OLAP activities (data mining, data

exploration, batch processing, ETL, etc) - “The

heavy lifting of data”

●Low cost scaling, built as extensibility in mind

●Use large datasets (gigabytes/terabytes) scale

●Don’t use Hive for any OLTP activities

●ACID exists, not recommended yet

Page 8: Apache Hive for modern DBAs

The reason behind Hive

I had written, as part of working with the Feed team - what became - a rather complicated MR job to rank friends by mutual friends.

In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how hard it was to write an optimal MR job (particularly on large data sets).

Assembling data into complex data structures was also painful.

I really wanted to see these types of operators exposed in a high level declarative form so that the average user would never have to go through this. Fortunately - our team had Oracle veterans well versed in the art of SQL.

Joydeep Sen Sarma (Facebook)

Page 9: Apache Hive for modern DBAs

The reason behind Hive

Instead of complex MR jobs

You have declarative language...

Page 10: Apache Hive for modern DBAs

Apache Hive versions & branches

master branch-1

Version 2.x

New code

New features

Version 1.x

Stable

Backwards

compatibility

Critical

bugs

Hadoop 1.x and 2.x

supportedHadoop 2.x

supported

stable features

Page 11: Apache Hive for modern DBAs

Data Model (data units & types)

●Supports primitive column types (integers,

numbers, strings, date time and booleans)

●Supports complex types: Structs, Maps and

Arrays

●Concept of databases, tables, partitions and

buckets

●SerDe: serialize and deserialized API is used to

move data in and out of tables

Page 12: Apache Hive for modern DBAs

Data Model (partitions & bucketing)

● Partitioning: used for distributing load horizontally,performance benefit,

organization data

PARTITIONED BY (flightName STRING, AircraftName STRING)

/employees/flightName=ABC/AircraftName=XYZ

● Buckets (clusters): decomposing data sets into more manageable parts, help

on map-side joins, and correct sampling on the same bucket

“Records with the same flightID will always be stored in the same bucket.

Assuming the number of flightID is much greater than the number of buckets, each

bucket will have many flightIDs”

CLUSTERED BY (flightID) INTO XX BUCKETS;

Page 13: Apache Hive for modern DBAs

Data Model (complex data types)

Array Ordered collection of

fields. Fields of the

same type

array(1,2)

Map Unordered key value

pairs. Keys are

primitives, values are

any type

Map (‘a’, 1, ‘b’, 2)

Struct A collection of named

fields

Struct(‘a’,10, 2.5)

Page 14: Apache Hive for modern DBAs

Data model

Page 15: Apache Hive for modern DBAs

HiveQL

●HiveQL is an SQL-like query language for Hive

●Supports DDL and DML

●Supports multi-table inserts

●Possible to write custom map-reduce scripts

●Supports UDF, UDAF UDTF

Page 16: Apache Hive for modern DBAs

DDL (some examples)

HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX

HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX

HIVE> TRUNCATE TABLE

HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW

HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,

PARTITIONS, FUNCTIONS

HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name

Page 17: Apache Hive for modern DBAs

File formats

● Parquet: compressed, efficient columnar data

representation available to any project in the Hadoop

● ORC: made for Hive, support Hive type model,columnar

storage, block compression, predicate pushdown, ACID*,

etc

● Avro: JSON for defining data types and protocols, and

serializes data in a compact binary format

● Compressed file formats (LZO, *.GZIP)

● Plain Text Files

● Any other type to data subject to a format is possible to be

read (csv, json, xml, etc)

Page 18: Apache Hive for modern DBAs

ORC

●Stored as columns and compressed = smaller disk

reads

●ORC has a built-in index, min/max values, and

other aggregates (eg: sum,max) = skip entire

blocks to speed up reads

●ORC implements predicate pushdown and bloom

filters

●ORC scale

●You should use it :-)

Page 19: Apache Hive for modern DBAs

Indexing

● Not recommended because of ORC;

● ORC has build in Indexes which allow the format to skip

blocks of data during read

● Hive indexes are implemented as tables

● Compact indexes and bitmap indexes supported

● Tables that provide information about which data is in

which blocks and are used to skip data (like ORC already

does)

● Not supported on Tez engine - ignored

● Indexes in Hive are not like indexes in other databases.

Page 20: Apache Hive for modern DBAs

File formats & Indexing

Page 21: Apache Hive for modern DBAs

Hive Architecture

Hive Web

InterfaceHive CLI (beeline, hive)

Hive JDBC/ODBC

Driver

Compiler (Parser, Semantic Analyser,

Logical Plan Generator, Query plan

Generator)

Executor

OptimizerMetastore

client

Trift Server (HiveServer2)

Metastore RDBMS

Execution

EnginesMap Reduce Tez Spark

Resource Management YARN

Storage HDFS HBaseAzure Storage

Amazon S3

Page 22: Apache Hive for modern DBAs

Metastore● Typically stored in a RDBMS (MySQL; SQLServer;

PostgreSQL, Derby*) - ACID and concurrency on metadata

querys

● Contains: metadata for databases, tables or partitions

● Provides two features: data discovery and data abstraction

● Data abstraction: provide information about data formats,

extractors and loaders in table creation and reused, (ex:

dictionary tables - Oracle)

● Data discovery: discover relevant and specific data, allow

other tools to use metadata to explore data (Ex: SparkSQL)

Page 23: Apache Hive for modern DBAs

See it in action

Page 24: Apache Hive for modern DBAs

Execution engines● 3 execution engines are available:

○ MapReduce (mr)

○ Tez

○ Spark

MR: The original, most stable and more reliable, batch oriented, disk-

based parallel (like traditional Hadoop MR jobs).

Tez: High performance batch and interactive data processing. Stable in

99% of the time. The one that you should use. Default on HDP.

Spark: Uses Apache Spark (in-memory computing platform), High-

performance (like Tez), not used in production (yet), good progress

Page 25: Apache Hive for modern DBAs

MapReduce vs Tez/SparkMapReduce:

● One pair of map and reduce does one level of aggregation over the

data. Complex computations typically require multiple such steps.

Tez/Spark:

● DAG (Directed Acyclic Graph)

● The graph does not have cycles because the fault tolerance

mechanism used by Tez is re-execution of failed tasks

● The limitations of MapReduce in Hadoop became a key point to

introduce DAG

● Pipelining consecutive map steps into one

● Enforce concurrency and serialization between MapReduce jobs

Page 26: Apache Hive for modern DBAs

Tez & DAGsDAG Definition:

● Data processing is expressed in the form of a directed acyclic graph

(DAG)

Two main components:

● vertices - in the graph representing processing of data

○ user logic, that analyses and modifies the data, sits in the vertices

● edges - representing movement of data between the processing

○ Defines routing of data between tasks (One-To-One, Broadcast

Scatter-Gather)

○ Defines when a consumer task is scheduled (Sequential,

Concurrent)

○ Defines the lifetime/reliability of a task output

Page 27: Apache Hive for modern DBAs

Hive Cost Based Optimizer - Why

● Distributed SQL query processing in Hadoop differs from conventional

relational query engine when it comes to handling of intermediate

result sets

● Query processing requires sorting and reassembling of intermediate

result set - shuffling

● Most of the existing optimizations in Hive are about minimizing

shuffling cost and logical optimizations like filter push down,

projection pruning and partition pruning

● Join reordering and join algorithm possible with cost based optimizer.

Page 28: Apache Hive for modern DBAs

Hive CBO - What to get

● Based on a project called Apache Calcite (https://calcite.apache.org/)

● You can get using a Cost Based Optimizer:

○ How to order Join (join reordering)

○ Algorithm to use for a Join

○ Intermediate result be persisted or should it be recomputed on

failure

○ degree of parallelism at any operator (number of mappers and

reducers

○ Semi Join selection

○ (others optimizer tricks like histograms)

Page 29: Apache Hive for modern DBAs

Execution Engines

Page 30: Apache Hive for modern DBAs

Hive - The present-future

● Tez and Spark head to head on performance and stability

● LLAP (Long Live and Process) - Hive interactive querys

● ACID

Page 31: Apache Hive for modern DBAs

Hive next big thing: LLAP

● Sub second querys (Interactive Querys)

● In-memory caching layer with async I/O

● Fast concurrent execution

● Move from disk oriented to memory oriented execution (trend)

● Disks are connect to CPU via network - data locality is not relevant

Page 32: Apache Hive for modern DBAs

Thank you

Questions?

@drune

https://www.linkedin.com/in/lc

marques/

[email protected]

@redgluept

www.redglue.eu

Page 33: Apache Hive for modern DBAs