Upload
josh-elser
View
33.904
Download
0
Embed Size (px)
Citation preview
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Phoenix + Apache HBaseAn Enterprise Grade Data WarehouseAnkit Singhal , Rajeshbabu , Josh ElserJune, 30 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About us!!
– Committer and member of Apache Phoenix PMC– MTS at Hortonworks.
Ankit Singhal
– Committer and member of Apache Phoenix PMC– Committer in Apache HBase– MTS at Hortonworks.
RajeshBabu
– Committer in Apache Phoenix– Committer and Member of Apache Calcite PMC– MTS at Hortonworks.
Josh Elser
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaPhoenix & HBase as an Enterprise Data Warehouse
Use Cases
Optimizations
Phoenix Query server
Q&A
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data WarehouseEDW helps organize and aggregate analytical data from various functional domains and serves as a critical repository for organizations’ operations.
STAGING
Files
IOTdata
Data Warehouse
Mart
OLTP
ETL Visualization or BI
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Phoenix Offerings and Interoperability:-
ETL Data Warehouse Visualization & BI
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Table,a,123
Table,,123
RegionServer
HDFS
HBase client
Phoenix client
Phx coproc
ZooKeeper
Table,b,123
Table,a,123Phx coproc
Table,c,123
Table,b,123Phx coproc
RegionServer RegionServer
Application
HBase & Phoenix HBase , a distributed NoSQL storePhoenix , provides OLTP and Analytics over HBase
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Open Source Data Warehouse
Hardware cost
Softw
are
cost
Specialized H/WCommodity H/W
Lice
nsin
g co
stN
o Co
stSMPMPP
Open Source MPP
HBase+ Phoenix
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Phoenix & HBase as a Data Warehouse
Architecture
Run on commodity
H/WTrue MPP
O/S and H/W
flexibility
Support OLTP and
ROLAP
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Phoenix & HBase as a Data Warehouse
Scalability
Linear scalability for storage
Linear scalability
for memory
Open to Third party
storage
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Phoenix & HBase as a Data Warehouse
Reliability
Highly Available
Replication for disaster
recovery
Fully ACID for Data Integrity
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Phoenix & HBase as a Data Warehouse
Manageability
Performance Tuning
Data Modeling &
Schema Evolution
Data pruning
Online expansion
Or upgradeData Backup and recovery
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaPhoenix & HBase as an Enterprise Data Warehouse
Use cases
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who uses Phoenix !!
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Analytics Use case - (Web Advertising company)
Functional Requirements– Create a single source of truth– Cross dimensional query on 50+ dimension and 80+ metrics– Support fast Top-N queries
Non-functional requirements– Less than 3 second Response time for slice and dice– 250+ concurrent users – 100k+ Analytics queries/day– Highly available– Linear scalability
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Warehouse Capacity
Data Size(ETL Input)– 24TB/day of raw data system wide– 25 Billion of impressions
HBase Input(cube)– 6 Billion rows of aggregated data(100GB/day)
HBase Cluster size– 65 Nodes of HBase– 520 TB of disk– 4.1 TB of memory
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case Architecture
AdServer
Click Tracking
KafkaInput
KafkaInput
ETL Filter Aggregate
In- Memory Store
ETL Filter Aggregate
Real-time
KafkaCAMUS
HDFSETL
HDFSData
Uploader
DATA
API
HBaseViews
ANALYTICS
UI
Batch Processing
Data Ingestion Analytics
ApacheKafka
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cube Generation
Cubes are stored in HBase
ANALYTICS
UI
Convert slice and
dice query to SQL query
Data API
Analytics Data Warehouse Architecture
Bulk Load
HDFS
ETL
Backup and
recovery
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series Use Case- (Apache Ambari)
Functional requirements– Store all cluster metrics collected every second(10k to 100k metrics/second)– Optimize storage/access for time series data
Non-functional requirements– Near real time response time – Scalable– Real time ingestion
Ambari Metrics System (AMS)
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AMS architecture
Metric Monitors
Hosts
Hadoop Sinks
HBase
Phoenix
Metric Collector
Ambari Server
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaPhoenix & HBase as an Enterprise Data Warehouse
Use Cases
Optimizations
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Design
Most important criteria for driving overall performance of queries on the table Primary key should be composed from most-used predicate columns in the queries In most cases, leading part of primary key should help to convert queries into point
lookups or range scans in HBase
Primary key design
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Design
Use salting to alleviate write hot-spotting
CREATE TABLE …(
…
) SALT_BUCKETS = N
– Number of buckets should be equal to number of RegionServers
Otherwise, try to presplit the table if you know the row key data set
CREATE TABLE …(
…
) SPLITS(…)
Salting vs pre-split
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Design
Use block encoding and/or compression for better performance
CREATE TABLE …(
…
) DATA_BLOCK_ENCODING= ‘FAST_DIFF’, COMPRESSION=‘SNAPPY’
Use region replication for read high availability
CREATE TABLE …(
…
) “REGION_REPLICATION” = “2”
Table properties
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Design
Set UPDATE_CACHE_FREQUENCY to bigger value to avoid frequently touching server for metadata updates
CREATE TABLE …(
…
) UPDATE_CACHE_FREQUENCY = 300000
Table properties
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Design
Divide columns into multiple column families if there are rarely accessed columns– HBase reads only the files of column families specified in the query to reduce I/O
pk1 pk2CF1 CF2
Col1 Col2 Col3 Col4 Col5 Col6 Col7
Frequently accessing columns Rarely accessing columns
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Secondary Indexes
Global indexes– Optimized for read heavy use casesCREATE INDEX idx on table(…)
Local Indexes– Optimized for write heavy and space constrained use casesCREATE LOCAL INDEX idx on table(…)
Functional indexes– Allow you to create indexes on arbitrary expressions.CREATE INDEX UPPER_NAME_INDEX ON EMP(UPPER(FIRSTNAME||’ ’|| LASTNAME ))
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Secondary Indexes
Use covered indexes to efficiently scan over the index table instead of primary table.
CREATE INDEX idx ON table(…) include(…) Pass index hint to guide query optimizer to select the right index for query
SELECT /*+INDEX(<table> <index>)*/..
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row Timestamp Column
Maps HBase native row timestamp to a Phoenix column Leverage optimizations provided by HBase like setting the minimum and maximum time
range for scans to entirely skip the store files which don’t fall in that time range. Perfect for time series use cases. Syntax
CREATE TABLE …(CREATED_DATE NOT NULL DATE
…
CONSTRAINT PK PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP…
)
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use of Statistics
Region A
Region F
Region L
Region R
Chunk A
Chunk C
Chunk F
Chunk I
Chunk L
Chunk O
Chunk R
Chunk U
A
F
R
L
A
F
R
L
C
I
O
U
Client Client
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Skip Scan Phoenix supports skip scan to jump to matching keys directly when the query has key
sets in predicate
SELECT * FROM METRIC_RECORD WHERE METRIC_NAME LIKE 'abc%' AND HOSTNAME in ('host1’, 'host2');
CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 RANGES OVER METRIC_RECORD ['abc','host1'] - ['abd','host2']
Region1
Region2
Region3
Region4
Client
RS 3
RS 2
RS 1
Skip scan
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Join optimizations
Hash Join– Hash join outperforms other types of join algorithms when one of the relations is smaller or
records matching the predicate should fit into memory
Sort-Merge join– When the relations are very big in size then use the sort-merge join algorithm
NO_STAR_JOIN hint– For multiple inner-join queries, Phoenix applies a star-join optimization by default. Use this hint in
the query if the overall size of all right-hand-side tables would exceed the memory size limit.
NO_CHILD_PARENT_OPTIMIZATION hint– Prevents the usage of child-parent-join optimization.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Optimize Writes
Upsert values– Call it multiple times before commit for batching mutations– Use prepared statement when you run the query multiple times
Upsert select– Configure phoenix.mutate.batchSize based on row size– Set auto-commit to true for writing scan results directly to HBase.– Set auto-commit to true while running upsert selects on the same table so that writes happen at
server.
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hints
SERIAL SCAN, RANGE SCAN SERIAL SMALL SCAN
Some important hints
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Additional References
For some more optimizations you can refer to these documents– http://phoenix.apache.org/tuning.html– https://hbase.apache.org/book.html#performance
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaPhoenix & HBase as an Enterprise Data Warehouse
Use Cases
Optimizations
Phoenix Query Server
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Phoenix Query Server
A standalone service that proxies user requests to HBase/Phoenix– Optional
Reference client implementation via JDBC– ”Thick” versus “Thin”
First introduced in Apache Phoenix 4.4.0 Built on Apache Calcite’s Avatica
– ”A framework for building database drivers”
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Traditional Apache Phoenix RPC Model
Table,a,123
Table,,123
RegionServer
HDFS
HBase client
Phoenix client
Phx coprocZooKeeper Table,b,123
Table,a,123Phx coproc
Table,c,123
Table,b,123Phx coproc
RegionServer RegionServer
Application
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Server Model
Table,a,123
Table,,123
RegionServer
HDFS
HBase client
Phoenix client
Phx coprocZooKeeper Table,b,123
Table,a,123Phx coproc
Table,d,123
Table,b,123Phx coproc
RegionServer RegionServer
Query Server
Application
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Server Technology
HTTP Server and wire API definition Pluggable serialization
– Google Protocol Buffers
“Thin” JDBC Driver (over HTTP) Other goodies!
– Pluggable metrics system– TCK (technology compatibility kit)– SPNEGO for Kerberos authentication– Horizontally scalable with load balancing
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Server Clients
Go language database/sql/driver– https://github.com/Boostport/avatica
.NET driver– https://github.com/Azure/hdinsight-phoenix-sharp– https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview
ODBC– Built by http://www.simba.com/, also available from Hortonworks
Python DB API v2.0 (not “battle tested”)– https://bitbucket.org/lalinsky/python-phoenixdb
Client enablement
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaPhoenix & HBase as an Enterprise Data Warehouse
Use Cases
Optimizations
Phoenix Query Server
Q&A
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We hope to see you all migrating to Phoenix & HBase and expecting more questions on the user mailing lists.
Get involved in mailing lists:[email protected]@hbase.apache.org
You can reach us on:[email protected]@[email protected]
Phoenix & HBase
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You