Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Insights of Approximate Query Processing Systems
Presented by: Huanyi Chen
Ruoxi Zhang
Agenda§ Introduction
§ Background
§ VerdictDB & SnappyData
§ Experiment Setup
§ Evaluation
§ Insights
Insights of Approximate Query Processing Systems PAGE 2
Why AQP?
Insights of Approximate Query Processing Systems PAGE 3
# of Day Income (CAD)1 1502 2403 1804 2005 2306 1907 180
Avg(Income) 195.71
shop income
Why AQP?
Insights of Approximate Query Processing Systems PAGE 4
# of Day Income (CAD)1 1502 2403 1804 2005 2306 1907 180
Avg(Income)
186.67shop income
more efficient (50% rows)accuracy > 95%
195.71
Why AQP?
Insights of Approximate Query Processing Systems PAGE 5
99.9% Identical100x-200x Faster
Sampling Based AQP
Insights of Approximate Query Processing Systems PAGE 6
Sampling Based AQP
Insights of Approximate Query Processing Systems PAGE 7
Query Column Set (QCS)
Why SnappyData & VerdictDB ?
Insights of Approximate Query Processing Systems PAGE 8
Name Online/Offline
Distributed/Standalone Platform Algorithm Skewed
BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes
Sapprox Online Distributed Hadoop Distribution-aware Online sampling No
Approxhadoop Online Distributed Hadoop Approximation-enabled MapReduce No
Quickr Online Distributed N/A ASALQA algorithm No
SnappyData Online Distributed Spark and GemFireSpark as a computational engine; GemFire as transactional store
No
FluoDB Online Distributed Spark Mini-batch execution OLA Model No
XDB Online Standalone PostgreSQL Wander join No
VerdictDB Online Standalone Spark SQL Database learning No
IDEA Online Standalone N/AReuse answers of past overlapping queries for new query
No
BEAS Online Standalone Commercial DBMS Approximability theorem NoABS Online Standalone N/A Bootstrap No
• Spark• Open-source*
SnappyData
Insights of Approximate Query Processing Systems PAGE 9
SDE is NOT open
source
SnappyData
Insights of Approximate Query Processing Systems PAGE 10
+ WITH ERROR
QCSFRACTION
VerdictDB
Insights of Approximate Query Processing Systems PAGE 11
VerdictDB
Insights of Approximate Query Processing Systems PAGE 12
Experiment Setup
Insights of Approximate Query Processing Systems PAGE 13
§ Cluster Setup§ SnappyData: 1 locator, 1 lead, and 2 servers
Experiment Setup
Insights of Approximate Query Processing Systems PAGE 14
§ Cluster Setup§ SnappyData: 1 locator, 1 lead, and 2 servers
§ VerdictDB on Spark: 1 master and 2 executors
§ Each Node§ 24/32 GB memory used
§ 500 GB HDD
Experiment Setup
Insights of Approximate Query Processing Systems PAGE 15
§ TPC-H Benchmark § OLAP
§ 22 queries includes Aggregation, Join, etc.
§ Well known and standard
§ Customizable
§ Data§ 1GB and 10GB
§ Uniformly distributed
Evaluation
Insights of Approximate Query Processing Systems PAGE 16
SnappyData• Stratified Sampling• In-memory
VerdictDB• Uniform Sampling• Not in-memory (bug?)
SnappyData - Latency
Insights of Approximate Query Processing Systems PAGE 17
29439
1832
66298092
13993870
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Q1 Q6 Q14
Execution time (ms) using TPC-H (SF=10, fraction 0.1)
SnappyData SnappyData_AQP (>95% accuracy)
Q1: Up to 3.6x speedup~0.0001 Error
fraction0.1
SnappyData - Accuracy
Insights of Approximate Query Processing Systems PAGE 18
Base Table
fraction0.01
Sample Tables ...
0.00
0.00
0.01
0.01
0.02
0.02
fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3
Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction)
01,0002,0003,0004,0005,0006,0007,000
fraction 0.01
fraction 0.1
fraction 0.2
fraction 0.3
Snappy
Time (ms) for TPC-H Q14 result (SF=10) given different sample tables (fraction)
SnappyData- Creating Sample Tables
Insights of Approximate Query Processing Systems PAGE 19
0
50,000
100,000
150,000
200,000
250,000
fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3
Time (ms) for creating SnappyData sample tables with different fractions
VerdictDB - Latency
Insights of Approximate Query Processing Systems PAGE 20
206034
8891299195
17598 1521024355
0
50,000
100,000
150,000
200,000
250,000
Q1 Q6 Q14
Execution time (ms) using TPC-H (SF=10, fraction 0.1)
SparkSQL VerdictDB (> 95% accuracy)
Up to ~11x speedup!
VerdictDB - Speedup
Insights of Approximate Query Processing Systems PAGE 21
0
2
4
6
8
10
12
14
Q1 Q6 Q14
Speedup for TPC-H (SF=10, fraction=0.1)
Speedup
0
2
4
6
8
10
12
Q1 Q6 Q14
Speedup for TPC-H (SF=1, fraction=0.1)
Speedup
VerdictDB - Creating Sample Tables
Insights of Approximate Query Processing Systems PAGE 22
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3
Time (ms) for creating VerdictDB sample tables with different fraction
fraction0.1
VerdictDB - Accuracy
Insights of Approximate Query Processing Systems PAGE 23
Base Table
fraction0.01
Sample Tables ...
0
0.05
0.1
0.15
0.2
0.25
fraction 0.01
fraction 0.05
fraction 0.1
fraction 0.2
fraction 0.3
Actual Error for TPC-H Q14 result (SF=10)given different sample tables (fraction)
0
20,000
40,000
60,000
80,000
100,000
120,000
fraction 0.01
fraction 0.05
fraction 0.1 fraction 0.2fraction 0.3 SparkSQL
Time (ms) for TPC-H Q14 result (SF=10)given different sample tables (fraction)
converge!
Other Queries?
Insights of Approximate Query Processing Systems PAGE 24
Q19Error: ~ 80%
Speedup: ~5.5X
Q14 Error: ~ 1.7%
Speedup: ~1.7X
Other Queries?
Insights of Approximate Query Processing Systems PAGE 25
Key missing in sample tables!
Careful design of sample tableor original table!
Q7AQP not working!
Insights
Insights of Approximate Query Processing Systems PAGE 26
§ AQP performs well:§ For aggregate functions such as SUM, AVG and COUNT
§ When WHERE is simple
§ Users’ foreseen is important!§ for both query and original table
Future Work
Insights of Approximate Query Processing Systems PAGE 27
§ Test error estimation in sampling
§ Other sampling techniques§ Biased Sampling
§ Database learning
§ Approximate hardware