28
Querying Large Scientific Databases Luciano and Brian

Querying Large Scientific Databases - DASlab

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Querying Large Scientific Databases - DASlab

Querying Large Scientific DatabasesLuciano and Brian

Page 2: Querying Large Scientific Databases - DASlab

Scientific Data challenges Data is in the Terabytes

Queries are exploratory

“Paraphrased, the answer is in the database, but the Nobel-prize winning query is still unknown”

Unhelpful query results that take minutes to process

4 TB Database

VS

Page 3: Querying Large Scientific Databases - DASlab

Research Directions

One-minute database kernels for real-time performance.

Multi-scale query processing for gradual exploration.

Result-set post processing for conveying meaningful data.

Q1

Scan

Post processing engine

AVG MAX

Histogram

Page 4: Querying Large Scientific Databases - DASlab

Research Directions

Query morphing to adjust for proximity results.

Query alternatives to cope with lack of providence

Page 5: Querying Large Scientific Databases - DASlab

An Introduction to BlinkDB

Page 6: Querying Large Scientific Databases - DASlab

Motivation of BlinkDB

Real time analytics for web companies

Page 7: Querying Large Scientific Databases - DASlab

Taxonomy of Solutions

Page 8: Querying Large Scientific Databases - DASlab

Query Column Sets - Include Table

SELECT country, SUM(clicks), AVG(time_on_page), FROM website_data WHEREreferral_website = ‘Google’ GROUPBY country

Query Column Sets consist of the columns in the WHERE, HAVING, and GROUPBY fields but not in the payload columns.

Query Column Set -referral_website, country

Page 9: Querying Large Scientific Databases - DASlab

Query Column Sets Distribution

The most popular queries use very few Query Column Sets

By TimeBy Rank

The Query Column Sets being used are stable over time

Page 10: Querying Large Scientific Databases - DASlab

Limitations: Lack of (Full) Support for JoinsFull support only for joins where one table fits entirely into memory.

How much is this a problem?

Page 11: Querying Large Scientific Databases - DASlab

System Overview

Modifications to HiveQL to support confidence intervals, error estimates, and time constraints

Integration of sample selection into plan formationIntegration into Hive

Creation and maintenance of samples. Integrates into the distributed file system and cache

Run query Q over file cache when deciding on sample selection

Page 12: Querying Large Scientific Databases - DASlab

HiveQL ExtensionSELECT COUNT(*)

FROM Sessions

WHERE Genre = ‘western’

GROUP BY OS

ERROR WITHIN 10% AT CONFIDENCE 95%

Page 13: Querying Large Scientific Databases - DASlab

Stratified Samples

Uniform Sample Stratified Sample

Query

Page 14: Querying Large Scientific Databases - DASlab

Stratified Sampling (Continued)

Page 15: Querying Large Scientific Databases - DASlab

Storage of Stratified Samples

Page 16: Querying Large Scientific Databases - DASlab

Limitations of Stratified SamplesLarge

Expensive to create

Sorting

Page 17: Querying Large Scientific Databases - DASlab

How to Create Samples● Number of occurrences of each possible ∣ϕ∣-tuple

● Sort the table by the columns in ϕ

● For each ∣ϕ∣-tuple that has a number of occurrences greater than K, we need to randomly sample from those occurrences

● Finally, we write the sample out to disk

How long does this take?

According to authors, 5-30 minutes. (To be examined later)

Page 18: Querying Large Scientific Databases - DASlab

What Samples should we create?

We want column query sets that are used by our workload

Want to use QCS that will take up less space

Page 19: Querying Large Scientific Databases - DASlab

Sample Selection and Runtime

Page 20: Querying Large Scientific Databases - DASlab

Query Planner

Stratified Sample

Uniform Sample

QCS1 QCS2

Sample Selection and Runtime

Page 21: Querying Large Scientific Databases - DASlab

Sample Selection and Runtime

Query Planner

Error Latency Pro�file

Error Profile

Latency Profile

Page 22: Querying Large Scientific Databases - DASlab

Query Execution Bias Correction

Sample Selection and Runtime

Page 23: Querying Large Scientific Databases - DASlab

Results

Page 24: Querying Large Scientific Databases - DASlab

Query Column Sets Distribution

90% of the queries come from 10% of the templates. However, this is still 18 Query Column Sets for Conviva and 45 different Query Column Sets for Facebook.

By TimeBy Rank

Page 25: Querying Large Scientific Databases - DASlab

Performance - Simple Query on Filter, Groupby, Avg.

Page 26: Querying Large Scientific Databases - DASlab

Results● Is this experiment fair? What portion of the speedup is due to the probabilistic

framework? What portion is due to the clustering & redundancy that BlinkDB employs?

● BlinkDB states their sample build time should be between 5-30 minutes. This experiment shows a simple two column filter, groupby, aggregate query to take 20 minutes. Does this seem reasonable?

Page 27: Querying Large Scientific Databases - DASlab

Accuracy of ResultsAccuracy of results by sampling method.

Page 28: Querying Large Scientific Databases - DASlab

Extensions and Open Questions1)Could we carry around only certain payload columns with each QCS?

2)Could we build this index partially and incrementally? Why or why not?

3)How does this apply to scientific data management?