Jethro for tableau webinar (11 15)

Webinar Topics

• Who is Jethro?

• Tableau & Big Data: Extract vs. Live Connect

• Big Data Platforms: Hadoop vs. EDW Appliances

• Two DB architectures: Full-scan vs. Index Access

• Live Demo: Tableau over Impala / Redshift / Jethro

• What is Jethro for Tableau and how it accelerates Tableau’s performance

• Q&A

About Us

• What does Jethro do?– SQL engine optimized for accelerating

BI on big data• How it works?

– Combines Columnar SQL DB design with full-indexing technology

• Where is it?– In dev since 2012; GA: mid 2015 – Download & free eval

• When to use it?– BI Spinner Syndrome (BSS)

• Partnerships– BI and Hadoop vendors

• Speaker– Eli Singer, CEO JethroData– [email protected]– 917.509.6111

• Experience– Long-time DBA– Over 20 years of leading Tech startups

• Where to find us– Jethrodata.com– @JethroData

mailto:[email protected]

http://www.jethrodata.com/

https://twitter.com/JethroData

https://twitter.com/JethroData

Tableau and Big Data: Extract (In-Mem)

Tableau

Extract

EDW / Hadoop

• Typical Tableau usage is based on extracting selective data from remote sources

• Extracted data is then dynamically loaded into Tableau memory for interactive analysis

• Limitations: Performance degradation and scale (typically ~200M rows)

Tableau and Big Data: Live Connect (In-DB)

Tableau

EDW / Hadoop

• Tableau issues SQL queries to the target DB for every user interaction

• DB retrieves requested data and returns to Tableau

• Limitation: DB performance is significantly slower than in-mem speed

LiveConnect

Big Data Platforms: Hadoop Vs. EDW Appliances

10x-100x Data1/10 HW $costOpen Platform

Analytics: ETL, Predictive, Reporting, BI

SQL enables the change of data platform while keeping the analytic apps intact

The Hadoop Trade-Off: Scale & Cost Vs. Performance

SQL-on-Hadoop

ETL Predictive Reporting

BI

Too SLOW in Hadoopx

It’s unrealistic to expect to the same performance when data is much larger, and highly optimized hardware is replaced with commodity boxes.

SQL-on-Hadoop – MPP / Full Scan Architecture

Architecture: MPP / Full-Scan (All SQL-on-Hadoop)

Query: List books by author “Stephen King”

Process: Each librarian is assigned a rack, they then pull each book, check if author is “Stephen King”, if so, get book title

Result: Too slow, costly, unscalable.

Unsuitable for BI

A Library Analogy:Billions of books, Thousands of racks

SQL-on-Hadoop – Index-Access Architecture

Architecture: Index Access (Only Jethro)

Query: List books by author “Stephen King”

Process: Access Author index, entry of “Stephen King”, get list of books, fetch only these books

Result: Fast, minimal resources, scalableOptimal for BI

10

SQL on Hadoop – Competitive Landscape

• Hive• Impala• Presto• SparkSQL• Drill

• Pivotal/HAWQ• IBM/Big SQL• Actian• Teradata/SQL-H• …

• Jethro

Full-Scan Based SolutionsReads all rows. Every Time.

Index Based SolutionReads ONLY needed rows.

Use-Case Comparison:Full-Scan: Optimal for Predictive, reportingIndex: Optimal for Interactive BI

LIVE Benchmark: BI on Hadoop (and Redshift)

Hardware – AWS• Hadoop: CDH 5.4• 6 nodes: m1.xlarge, r3.xlarge• Jethro: r3.8xlarge

• Point browser at: tableau.jethrodata.com– UID/PWD: demo / demo

• Choose workbook: “Jethro”, “Impala”, “Redshift”• BI Dashboard: choose year, category or any other filter to drill-down

• Data– Based on TPC-DS benchmark– 1TB raw data (400GB fact)– Fact table: ~2.9B rows– Dimensions: 7

Hardware Data Format

Hadoop Cluster

Compute Cluster

Total RAM, CPU

AWS $ per hr.

Jethro Jethro indexes (250GB)

3x m1.xlarge 2x r3.4xlarge (spot)

289GB, 44 cores

$0.80

Impala Parquet (160GB)

8x r3.2xlarge1x r3.xlarge

510GB68 cores

$5.95

Redshift Redshift (229GB)

8x dc1.large 120GB, 16 cores

$2.00

http://tableau.jethrodata.com/

What Is Jethro for Tableau?

Tableau

EDW / Hadoop / Cloud / Local FS / NAS

Extract

• An indexing & caching server

• Relevant data is extracted from EDW / Hadoop into Jethro. No size limitation

• Jethro then fully indexes the data (every column!)

• Jethro’s column and index files are stored back in Hadoop (or other storage system)

• Tableau uses Live Connect to send Jethro SQL queries (ODBC)

• Jethro uses indexes to speed up queries and return results to Tableau

LiveConnect

2. Store

3.

1.

Selecting Data for Jethro Acceleration

• Select only Tableau “worthy” datasets– Not ALL data in Hadoop should have Jethro

• Use any ETL tool to extract from source– Jethro receives data in a CSV/delimited format– Extracted data can be temporarily stored in a file or

“piped” live to Jethro• After initial creation, incremental loads are supported

– As frequently as every few min • Jethro stores it’s version of the dataset back in HDFS

– Can also use local filesystem, network storage or cloud storage• Load is fast

– ~1B rows/hour– Data in highly compressed: 1TB -> 400GB data + indexes

EDW / Hadoop

Extract

Data Node

Index-Access – How it works

Data Node

Data Node

Data Node

Data Node

Jethro Query Node

Query Node

1. Index Access 2. Read data only for require rows

Performance and resources based on the size of the working-set

Storage- HDFS- Cloud (S3, EFS)- NAS/SAN- Local FS

TableauSELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

Jethro Indexes – Superior Technology

http://www.google.com/patents/WO2013001535A3?cl=en Patent Pending:

• Complete– Every column is indexed

• Simple Inverted-list indexes map each column

value to a list of rows• Fast to read

Direct Access to a value entry No need to scan entire index, or load

index to memory• Scalable

Distributed, highly hierarchical compressed bitmaps

Appendable Index Structure for Fast Incremental Loads

http://www.google.com/patents/WO2013001535A3?cl=en

http://www.google.com/patents/WO2013001535A3?cl=en

Adaptive Optimization: Active Cache of Query Results

• Reuse of intermediate/final query results– Repeat queries return immediately

• Addresses wide top-of-the-funnel queries– Exploration starts with queries with no/few

filters– Those queries are likely to be repeated in

dashboard scenarios

• Transparently adapts to incremental loads– Execution on delta data + merge saved results

QuerySpeed

QuerySelectivity

Fast

Slow

Few More

Queryspeed

QuerySelectivity

Fast

Slow

Few More

Queryspeed

QuerySelectivity

Fast

Slow

Few More

Index Performance Cache Performance

Index + Cache

Summary: Why Index Access Optimal for BI?

1. Use of indexes eliminates need to read unnecessary data

2. The deeper you go, the faster it gets: as users drill down and add more filters the faster the queries perform

3. Unlimited flexibility: users can aggregate and filter by any columns they choose with no performance penalty

4. Concurrent users accessing dashboards generate repeatable queries that result in high cache efficiency

5. Shields BI workload from other analytics overwhelming the cluster

Ready to Try Jethro?

1. Register: jethrodata.com/download-jethro-for-tableau2. Schedule a 45min POC review with Jethro SA (free!)3. One time setup

- Download and Install Jethro on a server / VM - Start services, configure instance

4. Extract & Load data5. Use Tableau

- Install ODBC driver- Point Tableau data source at Jethro That’s It!

http://www.jethrodata.com/download-jethro-for-tableau



Q&A

Thank You!