Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240

Insights of Approximate Query Processing Systems

Presented by: Huanyi Chen

Ruoxi Zhang

Agenda§ Introduction

§ Background

§ VerdictDB & SnappyData

§ Experiment Setup

§ Evaluation

§ Insights

Insights of Approximate Query Processing Systems PAGE 2

Why AQP?


# of Day Income (CAD)1 1502 2403 1804 2005 2306 1907 180

Avg(Income) 195.71

shop income

Why AQP?


# of Day Income (CAD)1 1502 2403 1804 2005 2306 1907 180

Avg(Income)

186.67shop income

more efficient (50% rows)accuracy > 95%

195.71

Why AQP?


99.9% Identical100x-200x Faster

Sampling Based AQP


Sampling Based AQP


Query Column Set (QCS)

Why SnappyData & VerdictDB ?


Name Online/Offline

Distributed/Standalone Platform Algorithm Skewed

BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes

Sapprox Online Distributed Hadoop Distribution-aware Online sampling No

Approxhadoop Online Distributed Hadoop Approximation-enabled MapReduce No

Quickr Online Distributed N/A ASALQA algorithm No

SnappyData Online Distributed Spark and GemFireSpark as a computational engine; GemFire as transactional store

No

FluoDB Online Distributed Spark Mini-batch execution OLA Model No

XDB Online Standalone PostgreSQL Wander join No

VerdictDB Online Standalone Spark SQL Database learning No

IDEA Online Standalone N/AReuse answers of past overlapping queries for new query

No

BEAS Online Standalone Commercial DBMS Approximability theorem NoABS Online Standalone N/A Bootstrap No

• Spark• Open-source*

SnappyData


SDE is NOT open

source

SnappyData


+ WITH ERROR

QCSFRACTION

VerdictDB


VerdictDB


Experiment Setup


§ Cluster Setup§ SnappyData: 1 locator, 1 lead, and 2 servers

Experiment Setup


§ Cluster Setup§ SnappyData: 1 locator, 1 lead, and 2 servers

§ VerdictDB on Spark: 1 master and 2 executors

§ Each Node§ 24/32 GB memory used

§ 500 GB HDD

Experiment Setup


§ TPC-H Benchmark § OLAP

§ 22 queries includes Aggregation, Join, etc.

§ Well known and standard

§ Customizable

§ Data§ 1GB and 10GB

§ Uniformly distributed

Evaluation


SnappyData• Stratified Sampling• In-memory

VerdictDB• Uniform Sampling• Not in-memory (bug?)

SnappyData - Latency


29439

1832

66298092

13993870

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Q1 Q6 Q14

Execution time (ms) using TPC-H (SF=10, fraction 0.1)

SnappyData SnappyData_AQP (>95% accuracy)

Q1: Up to 3.6x speedup~0.0001 Error

fraction0.1

SnappyData - Accuracy


Base Table

fraction0.01

Sample Tables ...

0.00

0.00

0.01

0.01

0.02

0.02

fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3

Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)

01,0002,0003,0004,0005,0006,0007,000

fraction 0.01

fraction 0.1

fraction 0.2

fraction 0.3

Snappy

Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)

SnappyData- Creating Sample Tables


0

50,000

100,000

150,000

200,000

250,000


Time (ms) for creating SnappyData sample tables with different fractions

VerdictDB - Latency


206034

8891299195

17598 1521024355

0

50,000

100,000

150,000

200,000

250,000

Q1 Q6 Q14

Execution time (ms) using TPC-H (SF=10, fraction 0.1)

SparkSQL VerdictDB (> 95% accuracy)

Up to ~11x speedup!

VerdictDB - Speedup


0

2

4

6

8

10

12

14

Q1 Q6 Q14

Speedup for TPC-H (SF=10, fraction=0.1)

Speedup

0

2

4

6

8

10

12

Q1 Q6 Q14

Speedup for TPC-H (SF=1, fraction=0.1)

Speedup

VerdictDB - Creating Sample Tables


0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000


Time (ms) for creating VerdictDB sample tables with different fraction

fraction0.1

VerdictDB - Accuracy


Base Table

fraction0.01

Sample Tables ...

0

0.05

0.1

0.15

0.2

0.25

fraction 0.01

fraction 0.05

fraction 0.1

fraction 0.2

fraction 0.3

Actual Error for TPC-H Q14 result (SF=10)given different sample tables (fraction)

0

20,000

40,000

60,000

80,000

100,000

120,000

fraction 0.01

fraction 0.05

fraction 0.1 fraction 0.2fraction 0.3 SparkSQL

Time (ms) for TPC-H Q14 result (SF=10)given different sample tables (fraction)

converge!

Other Queries?


Q19Error: ~ 80%

Speedup: ~5.5X

Q14 Error: ~ 1.7%

Speedup: ~1.7X

Other Queries?


Key missing in sample tables!

Careful design of sample tableor original table!

Q7AQP not working!

Insights


§ AQP performs well:§ For aggregate functions such as SUM, AVG and COUNT

§ When WHERE is simple

§ Users’ foreseen is important!§ for both query and original table

Future Work


§ Test error estimation in sampling

§ Other sampling techniques§ Biased Sampling

§ Database learning

§ Approximate hardware

Documents

Insights of Approximate Query Processing Systemstozsu/courses/CS848/W19/projects...Why AQP? Insightsof Approximate Query Processing Systems PAGE 3 # of Day Income (CAD) 1 150 2 240