Upload
luis-marques
View
311
Download
0
Embed Size (px)
Citation preview
Apache Hive for modern
DBAsLuís Marques
About me
Oracle ACE
Data and Linux geek
Long time opensource
supporter
works for @redgluept as
Data Architect
@drune
Big Data Thinking Strategy
●Think small
●Think big
●Don’t think at all (hype is here)
What is Apache Hive?
●open source, TB/PB scale date warehousing
framework based on Hadoop
●The first and more complete SQL-on-”Hadoop”
●SQL:2003 and SQL:2011 compatible
●Data store on several formats
●Several execution engines available
●Interactive Query support (In-memory cache)
Apache Hive - Before you ask
●Datawarehouse/OLAP activities (data mining, data
exploration, batch processing, ETL, etc) - “The
heavy lifting of data”
●Low cost scaling, built as extensibility in mind
●Use large datasets (gigabytes/terabytes) scale
●Don’t use Hive for any OLTP activities
●ACID exists, not recommended yet
The reason behind Hive
I had written, as part of working with the Feed team - what became - a rather complicated MR job to rank friends by mutual friends.
In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how hard it was to write an optimal MR job (particularly on large data sets).
Assembling data into complex data structures was also painful.
I really wanted to see these types of operators exposed in a high level declarative form so that the average user would never have to go through this. Fortunately - our team had Oracle veterans well versed in the art of SQL.
Joydeep Sen Sarma (Facebook)
The reason behind Hive
Instead of complex MR jobs
You have declarative language...
Apache Hive versions & branches
master branch-1
Version 2.x
New code
New features
Version 1.x
Stable
Backwards
compatibility
Critical
bugs
Hadoop 1.x and 2.x
supportedHadoop 2.x
supported
stable features
Data Model (data units & types)
●Supports primitive column types (integers,
numbers, strings, date time and booleans)
●Supports complex types: Structs, Maps and
Arrays
●Concept of databases, tables, partitions and
buckets
●SerDe: serialize and deserialized API is used to
move data in and out of tables
Data Model (partitions & bucketing)
● Partitioning: used for distributing load horizontally,performance benefit,
organization data
PARTITIONED BY (flightName STRING, AircraftName STRING)
/employees/flightName=ABC/AircraftName=XYZ
● Buckets (clusters): decomposing data sets into more manageable parts, help
on map-side joins, and correct sampling on the same bucket
“Records with the same flightID will always be stored in the same bucket.
Assuming the number of flightID is much greater than the number of buckets, each
bucket will have many flightIDs”
CLUSTERED BY (flightID) INTO XX BUCKETS;
Data Model (complex data types)
Array Ordered collection of
fields. Fields of the
same type
array(1,2)
Map Unordered key value
pairs. Keys are
primitives, values are
any type
Map (‘a’, 1, ‘b’, 2)
Struct A collection of named
fields
Struct(‘a’,10, 2.5)
Data model
HiveQL
●HiveQL is an SQL-like query language for Hive
●Supports DDL and DML
●Supports multi-table inserts
●Possible to write custom map-reduce scripts
●Supports UDF, UDAF UDTF
DDL (some examples)
HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> TRUNCATE TABLE
HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW
HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,
PARTITIONS, FUNCTIONS
HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name
File formats
● Parquet: compressed, efficient columnar data
representation available to any project in the Hadoop
● ORC: made for Hive, support Hive type model,columnar
storage, block compression, predicate pushdown, ACID*,
etc
● Avro: JSON for defining data types and protocols, and
serializes data in a compact binary format
● Compressed file formats (LZO, *.GZIP)
● Plain Text Files
● Any other type to data subject to a format is possible to be
read (csv, json, xml, etc)
ORC
●Stored as columns and compressed = smaller disk
reads
●ORC has a built-in index, min/max values, and
other aggregates (eg: sum,max) = skip entire
blocks to speed up reads
●ORC implements predicate pushdown and bloom
filters
●ORC scale
●You should use it :-)
Indexing
● Not recommended because of ORC;
● ORC has build in Indexes which allow the format to skip
blocks of data during read
● Hive indexes are implemented as tables
● Compact indexes and bitmap indexes supported
● Tables that provide information about which data is in
which blocks and are used to skip data (like ORC already
does)
● Not supported on Tez engine - ignored
● Indexes in Hive are not like indexes in other databases.
File formats & Indexing
Hive Architecture
Hive Web
InterfaceHive CLI (beeline, hive)
Hive JDBC/ODBC
Driver
Compiler (Parser, Semantic Analyser,
Logical Plan Generator, Query plan
Generator)
Executor
OptimizerMetastore
client
Trift Server (HiveServer2)
Metastore RDBMS
Execution
EnginesMap Reduce Tez Spark
Resource Management YARN
Storage HDFS HBaseAzure Storage
Amazon S3
Metastore● Typically stored in a RDBMS (MySQL; SQLServer;
PostgreSQL, Derby*) - ACID and concurrency on metadata
querys
● Contains: metadata for databases, tables or partitions
● Provides two features: data discovery and data abstraction
● Data abstraction: provide information about data formats,
extractors and loaders in table creation and reused, (ex:
dictionary tables - Oracle)
● Data discovery: discover relevant and specific data, allow
other tools to use metadata to explore data (Ex: SparkSQL)
See it in action
Execution engines● 3 execution engines are available:
○ MapReduce (mr)
○ Tez
○ Spark
MR: The original, most stable and more reliable, batch oriented, disk-
based parallel (like traditional Hadoop MR jobs).
Tez: High performance batch and interactive data processing. Stable in
99% of the time. The one that you should use. Default on HDP.
Spark: Uses Apache Spark (in-memory computing platform), High-
performance (like Tez), not used in production (yet), good progress
MapReduce vs Tez/SparkMapReduce:
● One pair of map and reduce does one level of aggregation over the
data. Complex computations typically require multiple such steps.
Tez/Spark:
● DAG (Directed Acyclic Graph)
● The graph does not have cycles because the fault tolerance
mechanism used by Tez is re-execution of failed tasks
● The limitations of MapReduce in Hadoop became a key point to
introduce DAG
● Pipelining consecutive map steps into one
● Enforce concurrency and serialization between MapReduce jobs
Tez & DAGsDAG Definition:
● Data processing is expressed in the form of a directed acyclic graph
(DAG)
Two main components:
● vertices - in the graph representing processing of data
○ user logic, that analyses and modifies the data, sits in the vertices
● edges - representing movement of data between the processing
○ Defines routing of data between tasks (One-To-One, Broadcast
Scatter-Gather)
○ Defines when a consumer task is scheduled (Sequential,
Concurrent)
○ Defines the lifetime/reliability of a task output
Hive Cost Based Optimizer - Why
● Distributed SQL query processing in Hadoop differs from conventional
relational query engine when it comes to handling of intermediate
result sets
● Query processing requires sorting and reassembling of intermediate
result set - shuffling
● Most of the existing optimizations in Hive are about minimizing
shuffling cost and logical optimizations like filter push down,
projection pruning and partition pruning
● Join reordering and join algorithm possible with cost based optimizer.
Hive CBO - What to get
● Based on a project called Apache Calcite (https://calcite.apache.org/)
● You can get using a Cost Based Optimizer:
○ How to order Join (join reordering)
○ Algorithm to use for a Join
○ Intermediate result be persisted or should it be recomputed on
failure
○ degree of parallelism at any operator (number of mappers and
reducers
○ Semi Join selection
○ (others optimizer tricks like histograms)
Execution Engines
Hive - The present-future
● Tez and Spark head to head on performance and stability
● LLAP (Long Live and Process) - Hive interactive querys
● ACID
Hive next big thing: LLAP
● Sub second querys (Interactive Querys)
● In-memory caching layer with async I/O
● Fast concurrent execution
● Move from disk oriented to memory oriented execution (trend)
● Disks are connect to CPU via network - data locality is not relevant
Thank you
Questions?
@drune
https://www.linkedin.com/in/lc
marques/
@redgluept
www.redglue.eu