36
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov

Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Embed Size (px)

Citation preview

Page 1: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Federated SQL on Hadoop and Beyond: Leveraging Apache

Geode to Build a Poor Man's SAP HANA

by Christian Tzolov @christzolov

Page 2: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Whoami

Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD, Apache Committer, Crunch PMC member

[email protected] blog.tzolov.net @christzolov

Page 3: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Contents• Data Systems - Principles

• Use Case: OLTP and OLAP Data Systems Integration

• Passive Data Synchronization (Demo)

• Federated Queries With HAWQ

• HAWQ Web Tables

• HAWQ PXF Architecture

• Geode PXF (Demo)

Page 4: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Data Systems

Page 5: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Compute Arbitrary Functions on Arbitrary Data

Page 6: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Architectural Patterns• Data Lake

• Lambda

• Kappa

• Tachyon

• …

Page 7: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Integration Stack

Apache HDFS Data Lake - PHD or HDP HadoopApache HAWQ SQL on Hadoop (OLAP)Apache Geode In-memory data grid (OLTP)Spring XD Integration and Streaming RuntimeApache Ambari Manages All ClustersApache Zeppelin Web UI for interaction with Data Systems

Hadoop/HDFS

Geode HAWQ

SpringXD

Ambari

Zeppelin

Page 8: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Apache Geode (OLTP)• Cache - Performance / Consistency / Resiliency

• Region - Highly available, redundant, distributed Map

China Railway Corporation

5,700 train stations 4.5 million tickets per day 20 million daily users 1.4 billion page views per day 40,000 visits per second

Indian Railways

7,000 stations 72,000 miles of track 23 million passengers daily 120,000 concurrent users 10,000 transactions per minute

Page 9: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Apache HAWQ (OLAP)• Built around a Greenplum MPP DB (C and C++)

• Hadoop Native: Parquet, HDFS and YARN

• 100% ANSI SQL compliant: SQL-92/99/2003…

• Extensible - Web Tables, PXF

• Connectivity: ODBC and JDBC

• Access internal store: HAWQ(Parquet)InputFormat

Page 10: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

HAWQ - TPC-DS• Outperforms Impala by overall 454%

• 344% of performance improvement over Hive/Tez

• Runs 100% of the TPC-DS queries. Unlike Impala or Hive

• References: http://bit.ly/1NUDcLl, https://github.com/dbbaskette/pivbench

Page 11: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Spring XDOrchestrates and automates all steps across multiple data stream pipelines

• HTTP • Tail • File • Mail • Twitter• Gemfire • Syslog • TCP • UDP • JMS • RabbitMQ • MQTT • Kafka• Reactor TCP/UDP

• Filter • Transformer • Object-to-JSON • JSON-to-Tuple • Splitter • Aggregator • HTTP Client • Groovy Scripts • Java Code • JPMML Evaluator • Spark Streaming

• File • HDFS • JDBC • TCP • Log • Mail • RabbitMQ • Gemfire • Splunk • MQTT • Kafka• Dynamic Router • Counters

Page 12: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Ambari Management

Page 13: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Use Case: Join OLTP and OLAP

Data Systems

Page 14: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Use Case

• Integrate Geode with HAWQ

• Unified data view

• Slowly Changing Dimensions (SCDs)

• Keep the Operational and Historical data in Sync

Page 15: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Passive Data Synchronization

Page 16: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Passive Sync Architecture

Page 18: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Passive Sync Improved (gpfdist)

Page 19: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Passive Sync Improved Demo

Page 20: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Federated Queries With HAWQ

Page 21: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

HAWQ Web Tables• HAWQ Web Table - access dynamic data sources

on a web server or by executing OS scripts

• Leverage Geode REST API and OQL

• SpringBoot Controller to convert JSON into TSV

CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...) EXECUTE E'curl http://<hostname>/gemfire-api/v1/ queries/adhoc?q=<URLencoded OQL statement>' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\');

Page 22: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

HAWQ Web Tables Architecture

Access dynamic data sources on a web server or by executing OS scripts.

Page 23: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

HAWQ Web Tables Limitations

• Not Scalable

• No Push Down Predicates

• Static

• No Compression

• Requires Additional Components

Page 24: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

P(ivotal) Extension Framework (PXF)

• Java-Based

• Parallel, High Throughput Data Access

• ANSI-compliant SQL On Any Dataset

• Wide variety of PXF plugins

Page 25: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF Architecture

Page 26: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF Data Model• Data Source is modeled as a collection of one or more

Fragments.

• Each Fragment consists of many Rows that in turn are split into typed Fields.

• Analyzer (optional) provides PXF statistical data for the HAWQ query optimizer

• Metadata about the data source locations, access attributes, table schemas formats, SQL queries filters, etc

Page 27: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF ProcessorsPlugin

InputData

FragmetergetFragments()

CustomAccessor CustomResolver

AnalyzergetEstimatedStat()

CustomAnalyzer

ReadResolvergetFields(OneRow)

WriteResolvergetFields(OneRow)

ReadAccessoropenForRead() readNextObject() closeForRead()

WriteAccessoropenForWrite() writeNextObject() closeForWrite()

CustomFragmeter

Extend ClassImplement Interface

Page 28: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF Deployment ModelHAWQ Master

Query Dispatcher

NameNodePXF

Service

Date Node XPXF

ServiceQuery

Executor

data request for Fragment X

pxfwritable records

Metadata request

Fragment list

External (Distributed) Data System

Date Node ZPXF

ServiceQuery

Executor

data request for Fragment Z

pxfwritable records

Scan plan Result

SQL query

Result

Para

llel e

xecu

tion

Page 29: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF External Tables CREATE EXTERNAL TABLE ext_table_name <Attribute list, …>

LOCATION('pxf://<host>:<port>/path/to/data? FRAGMENTER=package.name.FragmenterForX& ACCESSOR=package.name.AccessorForX& RESOLVER=package.name.ResolverForX& <Other custom user options>=<Value>’ ) FORMAT ‘custom'(formatter='pxfwritable_import');

Page 30: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF Gallery•HdfsTextSimple

•HdfsTextMulti

•Hive

•HiveRC

•HiveText

•HBase

•Avro

• Accumulo

• Casandra

• JSON

• Redis

• Geode/Gemfire

• JDBC

Page 31: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

HAWQ PXF/Geode

Page 32: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Federated Queries with PXF/Geode - Architecture

Page 33: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

PXF/Geode Table CREATE EXTERNAL TABLE <GEMFIRE_TABLE_NAME> (...) LOCATION('pxf://<namenode>/<path>? PROFILE=GEMFIRE & LOCATORS=<gemfire-server:port> & REGION=<region-name>') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Page 34: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Geode Profile

<profile> <name>GEMFIRE</name> <description>A profile for reading Gemfire data</description> <plugins> <fragmenter>io.pivotal.pxf.plugins.gemfire.GemfireFragmenter</fragmenter> <accessor>io.pivotal.pxf.plugins.gemfire.GemfireAccessor</accessor> <resolver>io.pivotal.pxf.plugins.gemfire.GemfireResolver</resolver> </plugins> </profile>

Page 35: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Federated Queries With PXF/Geode - Demo

Page 36: Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Stay Connected• PXF Maven Repository: https://bintray.com/big-data/maven/pxf/view

• PXF Community Plugins: https://bintray.com/big-data/maven/pxf-plugins/view

• Apache HAWQ: https://github.com/apache/incubator-hawq

• Apache Geode: https://github.com/apache/incubator-geode

• Apache Zeppelin: https://zeppelin.incubator.apache.org

• Spring XD: http://projects.spring.io/spring-xd/