View
5.452
Download
1
Category
Preview:
Citation preview
Page 1 © Hortonworks Inc. 2014
SQL on HBase with Phoenix
Page 2 © Hortonworks Inc. 2014
Agenda What Is Apache HBase • High Level Overview. • Technical Detail.
What Is Apache Phoenix • Overview. • What’s New.
• Secondary Index Demo.
Page 3 © Hortonworks Inc. 2014
New Data Requires a New Data Architecture
Source: IDC
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sen>ment, Web Data
Sensor, Machine Data
Geoloca>on
Modern Database Needs More Scalable
Handle New Data Types Intelligent and Predic>ve
Page 4 © Hortonworks Inc. 2014
What Is Apache HBase?
100% Open Source Store and Process Petabytes of Data Flexible Schema Scale out on Commodity Servers High Performance, High Availability Integrated with YARN SQL and NoSQL Interfaces
YARN : Data OperaGng System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS (Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop
Page 5 © Hortonworks Inc. 2014
Kinds of Apps Built with HBase
Interested? See HBase Case Studies later in this document.
Write Heavy Low-Latency
Search / Indexing
Messaging
Audit / Log Archive Advertising Data Cubes
Time Series Sensor / Device
Page 6 © Hortonworks Inc. 2014
HBase is Deeply Integrated with Hadoop
• Data is stored in HDFS. You can store more data and re-‐use exis>ng HDFS exper>se.
• HBase is integrated with YARN. • Analy>cs in-‐place using Hive, Pig,
Spark and more.
Page 7 © Hortonworks Inc. 2014
Who’s Using HBase?
Page 8 © Hortonworks Inc. 2014
HBase Technical Details
Spring 2014 Version 1.0
Page 9 © Hortonworks Inc. 2014
HBase Technical Details Based on Google BigTable • Dynamic schema. • Good for very sparse datasets.
• All data is range-partitioned for trivial horizontal scaling across commodity hardware.
Directly integrated with HDFS and Hadoop • Analyze data in HBase with any Hadoop ecosystem tools (Hive, Pig, MapReduce, Tez, etc.) • Re-use existing Hadoop skills to run HBase.
Page 10 © Hortonworks Inc. 2014
Page 11 © Hortonworks Inc. 2014
Logical ArchitectureDistributed, persistent partitions of a BigTable
ab
dc
ef
hg
ij
lk
mn
po
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7Table A, Region 1Table A, Region 2
Table G, Region 1070Table L, Region 25
Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776
Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52
Table P, Region 1116
Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.
Page 12 © Hortonworks Inc. 2014
Logical Data ModelA sparse, multi-dimensional, sorted map
Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 71368394261 "hello"
"bar"
1368394583 221368394925 13.61368393847 "world"
"foo"
cf21368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey columnfamily
columnqualifier timestamp value
Page 13 © Hortonworks Inc. 2014
HBase HA Overview (Introduced in HDP 2.1)
HMaster
Zookeeper
Client Client Client Client
HBase RegionServer
Region: 100-‐199 (Standby)
Region: 200-‐299 (Standby)
Region: 0-‐99
(Primary)
HBase RegionServer
Region: 100-‐199 (Primary)
Region: 0-‐99
(Standby)
Region: 200-‐299 (Primary)
HFile HFile HFile HFile HFile HFile
HDFS
HBase HA: Real-‐Time Replica>on
Low-‐Latency Reads and Writes
In-‐Memory Cache In-‐Memory Cache
Hive, Pig, MapReduce Hive, Pig, MapReduce
Data Stored to HDFS
Read or Write Directly from Hadoop Tools
Cluster Topology, Data Placement
Page 14 © Hortonworks Inc. 2014
Apache Phoenix
Spring 2014 Version 1.0
The SQL Skin for HBase
Page 15 © Hortonworks Inc. 2014
Apache Phoenix A SQL Skin for HBase • Provides a SQL interface for managing data in HBase. • Large subset of SQL:1999 mandatory featureset.
• Create tables, insert and update data and perform low-latency point lookups through JDBC. • Phoenix JDBC driver easily embeddable in any app that supports JDBC.
Phoenix Makes HBase Better • Oriented toward online / semi-transactional apps. • If HBase is a good fit for your app, Phoenix makes it even better.
• Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.
Page 16 © Hortonworks Inc. 2014
Apache Phoenix: Current Capabilities
Feature Supported? Common SQL Datatypes Yes Inserts and Updates Yes SELECT, DISTINCT, GROUP BY, HAVING Yes NOT NULL and Primary Key constrants Yes Inner and Outer JOINs Yes Views Yes Subqueries HDP 2.2 Robust Secondary Indexes HDP 2.2
Page 17 © Hortonworks Inc. 2014
Apache Phoenix: Future Capabilities
Feature Supported? Multi-Table Transactions Future Scalable Joins (Fact-to-Fact) Future Analytics, Windowing Functions Future
Page 18 © Hortonworks Inc. 2014
Phoenix Provides Familiar SQL Constructs Compare: Phoenix versus Native API
Code Notes // HBase Native API. HBaseAdmin hbase = new HBaseAdmin(conf); HTableDescriptor desc = new HTableDescriptor("us_population"); HColumnDescriptor state = new HColumnDescriptor("state".getBytes()); HColumnDescriptor city = new HColumnDescriptor("city".getBytes()); HColumnDescriptor population = new HColumnDescriptor("population".getBytes()); desc.addFamily(state); desc.addFamily(city); desc.addFamily(population); hbase.createTable(desc);
// Phoenix DDL. CREATE TABLE us_population ( state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT CONSTRAINT my_pk PRIMARY KEY (state, city));
• Familiar SQL syntax. • Provides additional constraint
checking.
Page 19 © Hortonworks Inc. 2014
Phoenix: Architecture
HBase Cluster
Phoenix Coprocessor
Phoenix Coprocessor
Phoenix Coprocessor
Java Applica>on
Phoenix JDBC Driver
User Application
Page 20 © Hortonworks Inc. 2014
Phoenix Performance Phoenix Performance Characterization: • Suitable for 10s of thousands of point-lookups per second. • Suitable for thousands of aggregations / filtered searches per second.
• Supports extremely high concurrency.
Phoenix Performance Optimizations • Column skipping. • Table salting.
• Skip scans.
Performance characteristics: • Index point lookups in milliseconds.
• Aggregation and Top-N queries in a few seconds over large datasets.
Page 21 © Hortonworks Inc. 2014
Phoenix Use Cases Phoenix is for: • Rapidly and easily building an application backed by HBase. • Making use of your existing SQL skills and investment.
• High performing aggregations of moderately-sized datasets inside HBase.
Phoenix is not for: • Sophisticated SQL queries involving large joins or advanced SQL features. • Queries requiring large scans that do not use indexes. • ETL.
Page 22 © Hortonworks Inc. 2014
Phoenix: Futures Short-term focus: • Transactions. • Scalable joins.
• Analytical capabilities.
Long-term focus: Primary interface for HBase. • Build HBase applications using Phoenix. • Configure cluster security and replication using Phoenix. • Integration with BI tools like Microstrategy.
Page 23 © Hortonworks Inc. 2014
What’s New in Apache Phoenix
Page 24 © Hortonworks Inc. 2014
What’s New in Apache Phoenix Phoenix in HDP 2.2 • Based on Apache Phoenix 4.2. • 8 new features, 143 total improvements and fixes.
Notable new features. • Robust secondary indexes. • Sub-joins.
• Basic window functions. • Bulk loader improvements.
Page 25 © Hortonworks Inc. 2014
Robust Secondary Index Background / Refresher • Phoenix supports local and global secondary indexes. • Updating a global index may require coordination with another RegionServer.
• See Phoenix docs if you need info on which to use when.
Before Phoenix 4.1 (HDP 2.1): • Using global indexes, if the RegionServer serving the index key was down, regionservers would abort. • Note: Does not affect local indexes.
Phoenix 4.1+: • If the global index cannot be updated:
• The index is temporarily disabled. • Background job is launched to rebuild the index.
• Reads will go directly to base tables rather than accessing the index.
• Writes will continue to update the index.
• Controlled by: phoenix.index.failure.handling.rebuild
Page 26 © Hortonworks Inc. 2014
Improved SQL: Sub Joins Example: select * from A
left join (B join C on B.bc_id = C.bc_id)
on A.ab_id = B.ab_id and A.ac_id = C.ac_id;
Caveats related to joins still apply: • Still broadcast joins only.
Page 27 © Hortonworks Inc. 2014
Phoenix: Basic Window Functions FIRST_VALUE, LAST_VALUE, NTH_VALUE
• No OVER or PARTITION BY.
• Function applied to each group based on GROUP BY.
Example: SELECT
FIRST_VALUE(“column1”)
WITHIN GROUP
(ORDER BY column2 ASC)
FROM
table
GROUP BY
column3;
Page 28 © Hortonworks Inc. 2014
ENCODE, DECODE DECODE • Supports hexadecimal format. DECODE('000000008512af277ffffff8', 'hex')
ENCODE • Supports hexadecimal and Base62 ENCODE(1, 'base62')
What is base 62??? • Used to encode data using only letters and numbers.
• Commonly used for things like URL shorteners.
Page 29 © Hortonworks Inc. 2014
Demo Phoenix Secondary Indexes
Page 30 © Hortonworks Inc. 2014
Secondary Index Recap Index Management via JDBC: • CREATE INDEX my_index ON my_table (v1); • DROP INDEX my_index ON my_table;
• ALTER INDEX my_index ON my_table DISABLE / REBUILD;
Index population during bulk import: • Uses the CsvBulkLoadTool utility (not psql.py). • Adds the --index-table argument to specify your target index.
HADOOP_CLASSPATH=/path/to/hbase-‐protocol.jar:/path/to/hbase/conf
hadoop jar phoenix-‐4.0.0.jar \
org.apache.phoenix.mapreduce.CsvBulkLoadTool \
-‐-‐table EXAMPLE -‐-‐input /data/example.csv
Recommended