99
CURE for Cubes: Cubing Using a ROLAP Engine Konstantinos Morfonios Yannis Ioannidis University of Athens VLDB 2006

CURE for Cubes: C ubing U sing a R OLAP E ngine Konstantinos Morfonios Yannis Ioannidis University of Athens VLDB 2006

Embed Size (px)

Citation preview

CURE for Cubes:Cubing Using a ROLAP Engine

Konstantinos Morfonios

Yannis Ioannidis

University of Athens

VLDB 2006

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

SELECT region, sum(revenue)FROM SALESWHERE month = ‘September’GROUP BY region

Gray On Data-warehousing:

CUBE

CUBE

IntroductionSELECT A, B, C, SUM(M)FROM RGROUP BY A, B, C

SELECT A, B, SUM(M)FROM RGROUP BY A, B

SELECT SUM(M)FROM R

Introduction

ProblemsConstruction algorithmStorage scheme

Focusing on ROLAP techniques (MVs)Stressed to limits?

Complete solution?

Unclear (not finishedwith efficient storage)

Unclear (not focusedon hierarchies)

Introduction

D

1i

Di 2)1L( Number of nodes: often

Efficient execution plan

Small domains in the higher levels of dimension hierarchies

New partitioning algorithm

Challenges of hierarchies:

Number of tuples increasesNovel storage scheme

CURE

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Execution Plan

Extend BUC (Bottom-Up-Cube) [BR99]Efficient pipeliningCheap identification of some kinds of

redundancy Inherent support for iceberg cubes and

holistic functions Existing “BUC-based” methods: BU-BST

[WLFY02] and QC-Tables [LPH02]

Execution Plan

Dimensions: A, B, C

ABC

AC BCAB

B CA

Execution Plan

Dimensions: A0→A1→A2, B0→B1, C0

Execution Plan

Dimensions: A0, A1, A2, B0, B1, C0

A0B0 A0B1 A0C0 B0C0 B1C0

A0 B0 B1 C0

A0B0C0 A0B1C0

A1B0 A1B1 A1C0

A1

A1B0C0 A1B1C0

A2B0 A2B1 A2C0

A2

A2B0C0 A2B1C0

Execution Plan

Dimensions: A0, A1, A2, B0, B1, C0

A0B0 A0B1 A0C0 B0C0 B1C0

A0 B0 B1 C0

A0B0C0 A0B1C0

A1B0 A1B1 A1C0

A1

A1B0C0 A1B1C0

A2B0 A2B1 A2C0

A2

A2B0C0 A2B1C0

Execution Plan

Height: 3

Dimensions: A0, A1, A2, B0, B1, C0

A0B0 A0B1 A0C0 B0C0 B1C0

A0 B0 B1 C0

A0B0C0 A0B1C0

A1B0 A1B1 A1C0

A1

A1B0C0 A1B1C0

A2B0 A2B1 A2C0

A2

A2B0C0 A2B1C0

Execution Plan

Dimensions: A0→A1→A2, B0→B1, C0

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

Execution Plan

Dimensions: A0→A1→A2, B0→B1, C0

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

Execution Plan

Height: 6

Dimensions: A0→A1→A2, B0→B1, C0

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

Execution Plan

Important properties of BUC-based cubing:Recursive calls at higher levels tend to be

cheaperBenefits from early pruning recursion at some

node N increase with the number of ancestors of N in the execution plan

Advantage of taller execution plansABC

AC BCAB

B CA

ABC

ACAB

A

Execution Plan

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

CURE’s Plan:

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

External PartitioningMemoryR

R Memory

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

External Partitioning

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

R Memory

External Partitioning

External PartitioningR MemoryPartitions

Partitions

External PartitioningR Memory

Sound

External Partitioning

For sound partitioning |Biggest partition| ≤ |M| In flat datasets this holds in general In hierarchical datasets…

External Partitioning|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

External Partitioning

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A2C0 B0 B1C0

A2B0 B0C0A2B1C0

A2B0C0

A1

A0 A1B1 A1C0

A0B1 A0C0 A1B0 A1B1C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A2C0 B0 B1C0

A2B0 B0C0A2B1C0

A2B0C0

A1

A0 A1B1 A1C0

A0B1 A0C0 A1B0 A1B1C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

|A0|/|A2| times smaller than R|A2B0C0| ≈ 50 MB

A2B1

B1

A2 C0

A2C0 B0 B1C0

A2B0 B0C0A2B1C0

A2B0C0

A1

A0 A1B1 A1C0

A0B1 A0C0 A1B0 A1B1C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

External Partitioning

A2B1

B1

A2 C0

A2C0 B0 B1C0

A2B0 B0C0A2B1C0

A2B0C0

A1

A0 A1B1 A1C0

A0B1 A0C0 A1B0 A1B1C0

A0B0 A0B1C0 A1B0C0

A0B0C0

|R| = 500 GB, |M| = 1 GBA0 (50,000)→A1 (500)→A2 (5)

|R|/|M| = 500

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Storage Format

Two types of redundancyDimensional Redundancy (DR)Aggregational Redundancy (AR)

Storage Format

ABC

AC BCAB

B CA

Example with flat cube only for simplicity

A2B1

B1

A2 C0

A1 A2C0 B0 B1C0

A0 A1B1 A1C0 A2B0 B0C0A2B1C0

A0B1 A0C0 A1B0 A1B1C0 A2B0C0

A0B0 A0B1C0 A1B0C0

A0B0C0

Storage Format

CUBE with DR CUBE’ without DR

t

t1

t2

t’

Storage Format

CUBE with DR CUBE’ without DR

t

t1

t2

t’

Storage Format

CUBE with DR

t

t1

t2

t’

CUBE’ without DR

Storage Format

CUBE with DR CUBE’ without DR

Storage Format

CUBE with DR CUBE’ without DR

Storage Format

CUBE with DR CUBE’ without DR

Classify tuples according to AR into:

• Normal Tuples (NTs)

• Trivial Tuples (TTs)

• Common Aggregate Tuples (CATs)

Storage Format

Storage Format

Storage Format

Storage Format

Storage Format

Storage Format

Storage Format

Storage Format

Storage Format

Storage Format

Purpose of the previous example:Explanation of different types of redundancyNot construction algorithm

Constructing an uncompressed cube and then compressing it would be inefficient

Instead, CURE classifies tuples during construction itself (details in the paper)

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Experimental Evaluation

Hierarchical datasets: APB-1Product: Code (6,500) → Class (435) →

Group (215) → Family (54) → Line (11) → Division (3)

Customer: Store (640) → Retailer (71)Time: Month (17) → Quarter (6) → Year (2)Channel: Base (9)

Flat datasets: CovType, Sep85L, Synthetic

Experimental Evaluation

Two versions of CURE:CURECURE+

Experimental Evaluation

0

50

100

150

200

250

300

1.E+06 1.E+07 1.E+08 1.E+09Number of Tuples in the Fact Table

Tim

e (m

in)

CURE

CURE+

Less than 3 hours

Experimental Evaluation

0

2

4

6

8

10

1.E+06 1.E+07 1.E+08 1.E+09

Number of Tuples in the Fact Table

Sto

rag

e S

pac

e (G

B)

CURE

CURE+ ≈ 6.8 GB

Experimental Evaluation

0

50

100

150

200

250

300

BUC BU-BST FCURE FCURE+ CURE CURE+

APB 0.4

Tim

e (s

ec)

Experimental Evaluation

0

50

100

150

200

250

300

350

400

450

BUC BU-BST FCURE FCURE+ CURE CURE+

APB 0.4

Sto

rag

e S

pac

e (M

B)

Experimental Evaluation

0

2

4

6

8

10

12

14

BUC BU-BST FCURE FCURE+ CURE CURE+APB 0.4

Tim

e (s

ec)

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Introduction

Execution Plan

External Partitioning

Storage Format

Experimental Evaluation

Conclusions

Conclusions Main contribution: CURE

Efficient execution planNew partitioning algorithmNovel storage scheme

Main advantages of CUREEfficient construction of complete cubes over

large datasets with arbitrary hierarchiesCube compressionOptimization opportunities for queries and

updatesEasy implementation

Current and Future Work

Study of indexing for queries and updates Comparison with the most prominent

MOLAP and Tree-based techniques

Questions???

Thank you!

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

4565

100110150

Storage Format

Memory ImageDisk Image

150

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

2030

Storage Format

Memory ImageDisk Image

30

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image

Storage Format

Memory ImageDisk Image