Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Phoenix + Apache HBaseAn Enterprise Grade Data WarehouseAnkit Singhal , Rajeshbabu , Josh ElserJune, 30 2016


About us!!

– Committer and member of Apache Phoenix PMC– MTS at Hortonworks.

Ankit Singhal

– Committer and member of Apache Phoenix PMC– Committer in Apache HBase– MTS at Hortonworks.

RajeshBabu

– Committer in Apache Phoenix– Committer and Member of Apache Calcite PMC– MTS at Hortonworks.

Josh Elser


AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query server

Q&A


Data WarehouseEDW helps organize and aggregate analytical data from various functional domains and serves as a critical repository for organizations’ operations.

STAGING

Files

IOTdata

Data Warehouse

Mart

OLTP

ETL Visualization or BI


Phoenix Offerings and Interoperability:-

ETL Data Warehouse Visualization & BI


Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coproc

ZooKeeper

Table,b,123

Table,a,123Phx coproc

Table,c,123

Table,b,123Phx coproc

RegionServer RegionServer

Application

HBase & Phoenix HBase , a distributed NoSQL storePhoenix , provides OLTP and Analytics over HBase


Open Source Data Warehouse

Hardware cost

Softw

are

cost

Specialized H/WCommodity H/W

Lice

nsin

g co

stN

o Co

stSMPMPP

Open Source MPP

HBase+ Phoenix


Phoenix & HBase as a Data Warehouse

Architecture

Run on commodity

H/WTrue MPP

O/S and H/W

flexibility

Support OLTP and

ROLAP



Scalability

Linear scalability for storage

Linear scalability

for memory

Open to Third party

storage



Reliability

Highly Available

Replication for disaster

recovery

Fully ACID for Data Integrity



Manageability

Performance Tuning

Data Modeling &

Schema Evolution

Data pruning

Online expansion

Or upgradeData Backup and recovery



Use cases


Who uses Phoenix !!


Analytics Use case - (Web Advertising company)

Functional Requirements– Create a single source of truth– Cross dimensional query on 50+ dimension and 80+ metrics– Support fast Top-N queries

Non-functional requirements– Less than 3 second Response time for slice and dice– 250+ concurrent users – 100k+ Analytics queries/day– Highly available– Linear scalability


Data Warehouse Capacity

Data Size(ETL Input)– 24TB/day of raw data system wide– 25 Billion of impressions

HBase Input(cube)– 6 Billion rows of aggregated data(100GB/day)

HBase Cluster size– 65 Nodes of HBase– 520 TB of disk– 4.1 TB of memory


Use Case Architecture

AdServer

Click Tracking

KafkaInput

KafkaInput

ETL Filter Aggregate

In- Memory Store

ETL Filter Aggregate

Real-time

KafkaCAMUS

HDFSETL

HDFSData

Uploader

DATA

API

HBaseViews

ANALYTICS

UI

Batch Processing

Data Ingestion Analytics

ApacheKafka


Cube Generation

Cubes are stored in HBase

ANALYTICS

UI

Convert slice and

dice query to SQL query

Data API

Analytics Data Warehouse Architecture

Bulk Load

HDFS

ETL

Backup and

recovery


Time Series Use Case- (Apache Ambari)

Functional requirements– Store all cluster metrics collected every second(10k to 100k metrics/second)– Optimize storage/access for time series data

Non-functional requirements– Near real time response time – Scalable– Real time ingestion

Ambari Metrics System (AMS)


AMS architecture

Metric Monitors

Hosts

Hadoop Sinks

HBase

Phoenix

Metric Collector

Ambari Server



Use Cases

Optimizations


Schema Design

Most important criteria for driving overall performance of queries on the table Primary key should be composed from most-used predicate columns in the queries In most cases, leading part of primary key should help to convert queries into point

lookups or range scans in HBase

Primary key design


Schema Design

Use salting to alleviate write hot-spotting

CREATE TABLE …(

…

) SALT_BUCKETS = N

– Number of buckets should be equal to number of RegionServers

Otherwise, try to presplit the table if you know the row key data set

CREATE TABLE …(

…

) SPLITS(…)

Salting vs pre-split


Schema Design

Use block encoding and/or compression for better performance

CREATE TABLE …(

…

) DATA_BLOCK_ENCODING= ‘FAST_DIFF’, COMPRESSION=‘SNAPPY’

Use region replication for read high availability

CREATE TABLE …(

…

) “REGION_REPLICATION” = “2”

Table properties


Schema Design

Set UPDATE_CACHE_FREQUENCY to bigger value to avoid frequently touching server for metadata updates

CREATE TABLE …(

…

) UPDATE_CACHE_FREQUENCY = 300000

Table properties


Schema Design

Divide columns into multiple column families if there are rarely accessed columns– HBase reads only the files of column families specified in the query to reduce I/O

pk1 pk2CF1 CF2

Col1 Col2 Col3 Col4 Col5 Col6 Col7

Frequently accessing columns Rarely accessing columns


Secondary Indexes

Global indexes– Optimized for read heavy use casesCREATE INDEX idx on table(…)

Local Indexes– Optimized for write heavy and space constrained use casesCREATE LOCAL INDEX idx on table(…)

Functional indexes– Allow you to create indexes on arbitrary expressions.CREATE INDEX UPPER_NAME_INDEX ON EMP(UPPER(FIRSTNAME||’ ’|| LASTNAME ))


Secondary Indexes

Use covered indexes to efficiently scan over the index table instead of primary table.

CREATE INDEX idx ON table(…) include(…) Pass index hint to guide query optimizer to select the right index for query

SELECT /*+INDEX(<table> <index>)*/..


Row Timestamp Column

Maps HBase native row timestamp to a Phoenix column Leverage optimizations provided by HBase like setting the minimum and maximum time

range for scans to entirely skip the store files which don’t fall in that time range. Perfect for time series use cases. Syntax

CREATE TABLE …(CREATED_DATE NOT NULL DATE

…

CONSTRAINT PK PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP…

)


Use of Statistics

Region A

Region F

Region L

Region R

Chunk A

Chunk C

Chunk F

Chunk I

Chunk L

Chunk O

Chunk R

Chunk U

A

F

R

L

A

F

R

L

C

I

O

U

Client Client


Skip Scan Phoenix supports skip scan to jump to matching keys directly when the query has key

sets in predicate

SELECT * FROM METRIC_RECORD WHERE METRIC_NAME LIKE 'abc%' AND HOSTNAME in ('host1’, 'host2');

CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 RANGES OVER METRIC_RECORD ['abc','host1'] - ['abd','host2']

Region1

Region2

Region3

Region4

Client

RS 3

RS 2

RS 1

Skip scan


Join optimizations

Hash Join– Hash join outperforms other types of join algorithms when one of the relations is smaller or

records matching the predicate should fit into memory

Sort-Merge join– When the relations are very big in size then use the sort-merge join algorithm

NO_STAR_JOIN hint– For multiple inner-join queries, Phoenix applies a star-join optimization by default. Use this hint in

the query if the overall size of all right-hand-side tables would exceed the memory size limit.

NO_CHILD_PARENT_OPTIMIZATION hint– Prevents the usage of child-parent-join optimization.


Optimize Writes

Upsert values– Call it multiple times before commit for batching mutations– Use prepared statement when you run the query multiple times

Upsert select– Configure phoenix.mutate.batchSize based on row size– Set auto-commit to true for writing scan results directly to HBase.– Set auto-commit to true while running upsert selects on the same table so that writes happen at

server.


Hints

SERIAL SCAN, RANGE SCAN SERIAL SMALL SCAN

Some important hints


Additional References

For some more optimizations you can refer to these documents– http://phoenix.apache.org/tuning.html– https://hbase.apache.org/book.html#performance

http://phoenix.apache.org/tuning.html

http://phoenix.apache.org/tuning.html

https://hbase.apache.org/book.html%23performance

https://hbase.apache.org/book.html%23performance



Use Cases

Optimizations

Phoenix Query Server


Apache Phoenix Query Server

A standalone service that proxies user requests to HBase/Phoenix– Optional

Reference client implementation via JDBC– ”Thick” versus “Thin”

First introduced in Apache Phoenix 4.4.0 Built on Apache Calcite’s Avatica

– ”A framework for building database drivers”


Traditional Apache Phoenix RPC Model

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coprocZooKeeper Table,b,123


Table,c,123



Application


Query Server Model

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coprocZooKeeper Table,b,123


Table,d,123



Query Server

Application


Query Server Technology

HTTP Server and wire API definition Pluggable serialization

– Google Protocol Buffers

“Thin” JDBC Driver (over HTTP) Other goodies!

– Pluggable metrics system– TCK (technology compatibility kit)– SPNEGO for Kerberos authentication– Horizontally scalable with load balancing


Query Server Clients

Go language database/sql/driver– https://github.com/Boostport/avatica

.NET driver– https://github.com/Azure/hdinsight-phoenix-sharp– https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview

ODBC– Built by http://www.simba.com/, also available from Hortonworks

Python DB API v2.0 (not “battle tested”)– https://bitbucket.org/lalinsky/python-phoenixdb

Client enablement

https://github.com/Boostport/avatica

https://github.com/Boostport/avatica

https://github.com/Azure/hdinsight-phoenix-sharp

https://github.com/Azure/hdinsight-phoenix-sharp

https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview

https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview

http://www.simba.com/



https://bitbucket.org/lalinsky/python-phoenixdb

https://bitbucket.org/lalinsky/python-phoenixdb



Use Cases

Optimizations

Phoenix Query Server

Q&A


We hope to see you all migrating to Phoenix & HBase and expecting more questions on the user mailing lists.

Get involved in mailing lists:[email protected]@hbase.apache.org

You can reach us on:[email protected]@[email protected]

Phoenix & HBase


Thank You

Software

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse