1
Scalable Discovery of Unique Column Combinations 1 Hasso Plattner Institute 2 Qatar Computing Research Institute Arvid Heise 1 Jorge-Arnulfo Quiané-Ruiz 2 Ziawasch Abedjan 1 Anja Jentzsch 1 Felix Naumann 1 Motivation Problem Large datasets at very fast rates: - Social networks - Scientific applications - Transactional applications - ... Finding uniques is crucial for: - Query optimization - Anomaly detection - Data modeling - Indexing - ... Unique column combinations often unknown in big datasets Exponential search space Finding unique column combinations is an NP-Hard problem DUCC Results * Work done in the context of Metanome: joint project between HPI and QCRI that provides a fresh view on data profiling and aims at providing scalability for Big Data. Website: http://www.hpi.uni-potsdam.de/naumann/projekte/metanome_data_profiling.html Lattice with eight attributes minimal unique unique maximal non-unique non-unique A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE o Graph divided into uniques and non-uniques o Minimal uniques summarize uniques o Maximal non-uniques summarize non-uniques o DuCC simultaneously detects (non-)uniques o Completeness verified (proof in paper) Modeled as graph coloring problem A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 1. 2. 3. 4. 5. 6. 7. Random walk graph traversal o Randomly pick next CC from current CC o Go upwards if CC is non- unique o Go downwards if CC is unique o Trace back if no pruned CC is left o Check for holes by comparing min uniques and max non-uniques minimal unique unique maximal non-unique non-unique o Worker receives next CC from traversal algorithm o Performs fast check with PLI intersection o Maintains pruning data structures o Seed provider detects holes and calculates restart CC Modular architecture o Workers exchange minimal uniques and maximal non-uniques o Scale up with local event bus o Scale out with distributed event bus (ZooKeeper) o Fault-tolerance with Map-only Hadoop Job Scalable architecture Scaling the Number of Columns on 100,000 Rows Scaling the Number of Rows on 15 Columns Scaling up/out Solution space and checks Related Work Gordian: [Gordian: efficient and scalable discovery of composite keys. VLDB06] - Row-based approach - Prefix-tree data organization HCA: [Advancing the discovery of unique column combinations. CIKM11] - Column-based approach - Histograms- and value-counting-based Swan: [Detecting Unique Column Combinations on Dynamic Data. ICDE14] - Builds on top of DUCC - Focus on dealing with incremental data Number of Columns log Execution Time (s) 5 10 15 20 25 30 35 40 45 50 55 60 65 70 1 10 100 1000 10000 GORDIAN HCA DUCC log Execution Time (s) Number of Columns 5 10 15 16 10^0 10^1 10^2 10^3 10^4 GORDIAN HCA DUCC log Execution Time (s) Number of Rows 10,000 100,000 539,165 1 10 100 1000 10000 GORDIAN HCA DUCC log Execution Time (s) Number of Rows 10,000 100,000 1,000,000 6,001,215 1 10 100 1000 10000 GORDIAN HCA DUCC Number of Columns Execution Time (s) 0 5,000 10,000 15,000 20,000 20 25 30 35 40 45 50 1 node 2 nodes 4 nodes 8 nodes Minimal uniques Uniqueness checks 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 10,000 30,000 50,000 70,000 90,000 110,000 R 2 = 0.9983 Uniprot Uniprot TPC-H TPC-H

Scalable Discovery of Unique Column Combinationsda.qcri.org/jquiane/jorgequianeFiles/presentations/...- Prefix-tree data organization HCA: [Advancing the discovery of unique column

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Discovery of Unique Column Combinationsda.qcri.org/jquiane/jorgequianeFiles/presentations/...- Prefix-tree data organization HCA: [Advancing the discovery of unique column

Scalable Discovery of Unique Column Combinations

1Hasso Plattner Institute 2Qatar Computing Research Institute

Arvid Heise1 Jorge-Arnulfo Quiané-Ruiz2 Ziawasch Abedjan1 Anja Jentzsch1 Felix Naumann1

Motivation Problem Large datasets at very fast rates: -  Social networks -  Scientific applications -  Transactional applications -  ...

Finding uniques is crucial for: -  Query optimization -  Anomaly detection -  Data modeling -  Indexing -  ...

Ø Unique column combinations often unknown in big datasets Ø Exponential search space

Finding unique column combinations is an NP-Hard problem

DUCC

Results

* Work done in the context of Metanome: joint project between HPI and QCRI that provides a fresh view on data profiling and aims at providing scalability for Big Data. Website: http://www.hpi.uni-potsdam.de/naumann/projekte/metanome_data_profiling.html

Lattice with eight attributes

minimal unique

unique

maximal non-unique

non-unique

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE o  Graph divided into uniques and non-uniques

o  Minimal uniques summarize uniques

o  Maximal non-uniques summarize non-uniques

o  DuCC simultaneously detects (non-)uniques

o  Completeness verified (proof in paper)

Modeled as graph coloring problem

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

1.

2. 3.

4. 5.

6.

7.

Random walk graph traversal o  Randomly pick next CC

from current CC o  Go upwards if CC is non-

unique o  Go downwards if CC is

unique o  Trace back if no pruned CC

is left o  Check for holes by

comparing min uniques and max non-uniques

minimal unique

unique

maximal non-unique

non-unique

o  Worker receives next CC from traversal algorithm

o  Performs fast check with PLI intersection

o  Maintains pruning data structures

o  Seed provider detects holes and calculates restart CC

Modular architecture

o  Workers exchange minimal uniques and maximal non-uniques

o  Scale up with local event bus

o  Scale out with distributed event bus (ZooKeeper)

o  Fault-tolerance with Map-only Hadoop Job

Scalable architecture

Scaling the Number of Columns on 100,000 Rows

Scaling the Number of Rows on 15 Columns

Scaling up/out Solution space and checks

Related Work Gordian: [Gordian: efficient and scalable discovery of composite keys. VLDB’06] -  Row-based approach -  Prefix-tree data organization

HCA: [Advancing the discovery of unique column combinations. CIKM’11] -  Column-based approach -  Histograms- and value-counting-based

Swan: [Detecting Unique Column Combinations on Dynamic Data. ICDE’14] -  Builds on top of DUCC -  Focus on dealing with incremental data

Number of Columns

log

Exec

utio

n Ti

me

(s)

5 10 15 20 25 30 35 40 45 50 55 60 65 701

10

100

1000

10000GORDIANHCADUCC

● ●

log

Exec

utio

n Ti

me

(s)

Number of Columns

5 10 15 1610^0

10^1

10^2

10^3

10^4● GORDIAN

HCADUCC

log

Exec

utio

n Ti

me

(s)

Number of Rows

10,000 100,000 539,1651

10

100

1000

10000GORDIANHCADUCC

log

Exec

utio

n Ti

me

(s)

Number of Rows

10,000 100,000 1,000,000 6,001,2151

10

100

1000

10000GORDIANHCADUCC

● ● ● ●●

●●

Number of Columns

Exec

utio

n Ti

me

(s)

0

5,000

10,000

15,000

20,000

20 25 30 35 40 45 50

● 1 node2 nodes4 nodes8 nodes

Minimal uniques

Uni

quen

ess

chec

ks

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

10,000 30,000 50,000 70,000 90,000 110,000

R2 = 0.9983

Uniprot

Uniprot

TPC-H

TPC-H