57
1 Querying Big Data: Theory and Practice Theory Tractability revisited for querying big data Parallel scalability Bounded evaluability Techniques Parallel algorithms Bounded evaluability and access constraints Query-preserving compression Query answering using views Bounded incremental query processing TDD: Topics in Distributed Databases

Querying Big Data Tractability revisited for querying big data BD-tractability

Embed Size (px)

DESCRIPTION

TDD: Research Topics in Distributed Databases. Querying Big Data Tractability revisited for querying big data BD-tractability Reductions, complete problems, separation results Querying big data Scale independence Making big data “small” Approximate query answering Relaxing query semantics - PowerPoint PPT Presentation

Citation preview

Page 1: Querying Big Data Tractability revisited for querying big data BD-tractability

1

Querying Big Data: Theory and Practice

Theory– Tractability revisited for querying big data– Parallel scalability– Bounded evaluability

Techniques– Parallel algorithms – Bounded evaluability and access constraints– Query-preserving compression– Query answering using views– Bounded incremental query processing

TDD: Topics in Distributed Databases

Page 2: Querying Big Data Tractability revisited for querying big data BD-tractability

2

Tractability revised for querying big data

Parallel scalability

Bounded evaluability

New theory for querying big data 2

To query big data, we have to determine whether it is feasible at all.

For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources?

Fundamental question

Is this feasible or not for Q?

Page 3: Querying Big Data Tractability revisited for querying big data BD-tractability

BD-tractability

33

Page 4: Querying Big Data Tractability revisited for querying big data BD-tractability

4

The good, the bad and the ugly

Traditional computational complexity theory of almost 50 years:

• The good: polynomial time computable (PTIME)

• The bad: NP-hard (intractable)

• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…

Polynomial time queries become intractable on big data!

What happens when it comes to big data?

Using SSD of 6G/s, a linear scan of a data set DD would take

• 1.9 days when DD is of 1PB (1015B)

• 5.28 years when DD is of 1EB (1018B)

O(n) time is already beyond reach on big data in practice!

4

Page 5: Querying Big Data Tractability revisited for querying big data BD-tractability

5

Complexity classes within P

NC (Nick’s class): highly parallel feasible• parallel polylog time • polynomially many processors

Too restrictive to include practical queries feasible on big data

BIG open: P = NC?

L: O(log n) space NL: nondeterministic O(log n) space polylog-space: logk(n) space

5

Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes

parallel logk(n)

as hard as P = NP

L NL polylog-space P, NC P

Page 6: Querying Big Data Tractability revisited for querying big data BD-tractability

6

Tractability revisited for queries on big data

A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that for any database D on which queries of Q are defined,

D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC)

BD-tractable queries are feasible on big data

Does it work? If a linear scan of D could be done in log(|D|) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years

DD (D)(D)

Q1((D))

Q2((D))。。

parallel logk(|D|, |Q|)

6

Page 7: Querying Big Data Tractability revisited for querying big data BD-tractability

7

BD-tractable queries

A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that for any database D on which queries of Q are defined,

D’ = (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC)

Preprocessing: a common practice of database people

one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, …

not necessarily reduce the size of D

BDTQ0: the set of all BD-tractable query classes

in parallel with more resources

What is the maximum size of D’ ?

7

Page 8: Querying Big Data Tractability revisited for querying big data BD-tractability

8

What query classes are BD-tractable?

Boolean selection queries Input: A dataset D Query: Does there exist a tuple t in D such that t[A] = c?

Build a B+-tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time.

Some natural query classes are BD-tractable

Relational algebra + set recursion on ordered relational databases

D. Suciu and V. Tannen: A query language for NC, PODS 1994

What else?

Graph reachability queries Input: A directed graph G Query: Does there exist a path from node s to t in G?

NL-complete

8

Page 9: Querying Big Data Tractability revisited for querying big data BD-tractability

9

Deal with queries that are not BD-tractable

Many query classes are not BD-tractable.

Can we make it BD-tractable?

No. The problem is well known to be P-complete! We need PTIME to process each query (G, (u, v)) ! Preprocessing does not help us answer such queries.

Breadth-Depth Search (BDS)Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G?

Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s

Is this problem (query class) BD-tractable?

D is empty, Q is (G, (u, v))What is P-complete?

9

Page 10: Querying Big Data Tractability revisited for querying big data BD-tractability

10

Make queries BD-tractable

Factorization: partition instances to identify a data part D for preprocessing, and a query part Q for operations

BDTQ: The set of all query classes that can be made BD-tractable

Preprocessing: (G) performs BDS on G, and returns a list M consisting of nodes in V in the same order as they are visited

For all queries (u, v), whether u occurs before v can be decided by a binary search on M, in log(|M|) time

Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G?

Factorization: D is G = (V, E), Q is (u, v)

after proper factorization

10

Page 11: Querying Big Data Tractability revisited for querying big data BD-tractability

11

Fundamental problems for BD-tractability

BD-tractable queries help practitioners determine what query classes are tractable on big data.

Are we done yet?

No, a number of questions in connection with a complexity class!

Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable?

Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced

How large is BDTQ? BDTQ0? Compared to P? NC?

Analogous to our familiar NP-complete problems

Why do we care?

Fundamental to any complexity classes: P, NP, …

Name one NP-complete problem that you know

Why do we need reduction?

11

Page 12: Querying Big Data Tractability revisited for querying big data BD-tractability

12

Reductions

Departing from our familiar polynomial-time reductions, we need reductions that are in NC, and deal with both data D and query Q!

transform a given problem to one that we know how to solve

NC-factor reductions NC: a pair of NC functions that allow re-factorizations (repartition data and query part), for BDTQ

F-reductions F: a pair of NC functions that do not allow re-factorizations, for BDTQ0

transformations for making queries BD-tractable

Properties: transitivity: if Q1 NC Q2 and Q2 NC Q3, then Q1 NC Q3 (also F)

compatibility: if Q1 NC Q2 and Q2 is in BDTQ, then so is Q1.

if Q1 F Q2 and Q2 is in BDTQ0, then so is Q1.

to determine whether a query class is BD-tractable

12

Page 13: Querying Big Data Tractability revisited for querying big data BD-tractability

13

Complete problems for BDTQ

A query class Q is complete for BDTQ if Q is in BDTQ, and moreover, for any query class Q’ in BDTQ, Q’ NC Q

A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for any query class Q’ in BDTQ0, Q’ F Q

There exists a natural query class Q that is complete for BDTQ

BDS is both P-complete and BDTQ-complete!

Is there a complete problems for BDTQ?

Breadth-Depth Search (BDS)

What does this tell us?

13

Page 14: Querying Big Data Tractability revisited for querying big data BD-tractability

14

Is there a complete problem for BDTQ0?

A query class Q is complete for BDTQ0 if Q is in BDTQ0, and for any query class Q’ in BDTQ0, Q’ F Q

Unless P = NC, a query class complete for BDTQ0 is a witness for P \ NC

Whether P = NC is as hard as whether P = NP If we can find a complete problem for BDTQ0 and show that it is

not in NC, then we solve the big open whether P = NC

It is hard to find a complete problem for BDTQ0

An open problem

14

Page 15: Querying Big Data Tractability revisited for querying big data BD-tractability

Comparing with P and NC

NC BDTQ = P

All PTIME query classes can be made BD-tractable!

Unless P = NC, NC BDTQ0 P

Unless P = NC, not all PTIME query classes are BD-tractable

need proper factorizations to answer PTIME queries on big data

How large is BDTQ? How large is BDTQ0?

separation

PTIMEPTIME

BD-tractablenot

BD-tractable

Properly contained in P

Not all polynomial-time queries are BD-tractable15

Page 16: Querying Big Data Tractability revisited for querying big data BD-tractability

16

Polynomial hierarchy revised

Tractability revised for querying big data

NP and beyond

PP

BD-tractablenot

BD-tractable

Parallel polylog time

16

Page 17: Querying Big Data Tractability revisited for querying big data BD-tractability

17

What can we get from BD-tractability?

Guidelines for the following.

Why we need to study theory for querying big data

What query classes are feasible on big data?

What query classes can be made feasible to answer on big data?

How to determine whether it is feasible to answer a class Q of queries on big data?

Reduce Q to a complete problem Qc for BDTQ via NC

If so, how to answer queries in Q?

• Identify factorizations (NC reductions) such that Q NC Qc

• Compose the reduction and the algorithm for answering queries of Qc

BDTQ0

BDTQ

17

Page 18: Querying Big Data Tractability revisited for querying big data BD-tractability

Parallel scalability

1818

Page 19: Querying Big Data Tractability revisited for querying big data BD-tractability

Parallel query answering

BD-tractability is hard to achieve.

Parallel processing is widely used, given more resources

DB

P

M

DB

P

M

DB

P

M

interconnection network

10,000 processors

Parallel scalable: the more processors, the “better”? 19

Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)

Only ideally!

How to define “better”?

Page 20: Querying Big Data Tractability revisited for querying big data BD-tractability

20

Degree of parallelism -- speedup

Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task

Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system

resources

Speed: throughputresponse time

Linear speedup

Question: can we do better than linear speedup?

Page 21: Querying Big Data Tractability revisited for querying big data BD-tractability

21

Degree of parallelism -- scaleup

Scaleup: TS/TL A task Q, and a task QN, N times bigger than Q

A DBMS MS, and a parallel DBMS ML,N times larger

TS: time taken by MS to execute Q

TL: time taken by ML to execute QN

Linear scaleup: if TL = TS, i.e., the time is constant if the resource increases in proportion to increase in problem size

resources and problem size

TS/TL

Question: can we do better than linear scaleup?

Page 22: Querying Big Data Tractability revisited for querying big data BD-tractability

22

Better than linear scaleup/speedup?

NO, even hard to achieve linear speedup/scaleup!

Startup costs: initializing each process

Interference: competing for shared resources (network, disk,

memory or even locks)

Skew: it is difficult to divide a task into exactly equal-sized parts;

the response time is determined by the largest part

Linear speedup is the best we can hope for -- optimal!

Give 3 reasons

Think of blocking in MapReduce

Data shipment cost for shard-nothing architecturesIn the real world, linear scaleup is too ideal to get!

A weaker criterion: the more processors are available, the less response time it takes.

Page 23: Querying Big Data Tractability revisited for querying big data BD-tractability

23

Parallel query answering

Given a big graph G, and n processors S1, …, SnG is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si

Performance guarantees: bounds on response time and data shipment

Performance

Parallel query answeringInput: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G

Response time (aka parallel computation cost): Interval from the time when

Q is submitted to the time when Q(G) is returned Data shipment (aka network traffic): the total amount of data shipped

between different processors, as messages

What is it? Why do we need to worry about it?

Page 24: Querying Big Data Tractability revisited for querying big data BD-tractability

24

Parallel scalability

A distributed algorithm is useful if it is parallel scalable

Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G

Complexityt(|G|, |Q|): the time taken by a sequential algorithm with a single

processorT(|G|, |Q|, n): the time taken by a parallel algorithm with n

processorsParallel scalable: if

T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|)k)

Polynomial reduction (including the cost of data shipment, k is a constant)

When G is big, we can still query G by adding more processors if we can afford

them

Page 25: Querying Big Data Tractability revisited for querying big data BD-tractability

25

linear scalability

Querying big data by adding more processors

An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G

Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n,

and in data shipment if the total amount of data shipped is a function of |

Q| and n

The more processors, the less response time

Independent of the size |G| of big G

Is it always possible?

Page 26: Querying Big Data Tractability revisited for querying big data BD-tractability

26

Graph pattern matching via graph simulation

Input: a graph pattern graph Q and a graph G

Output: Q(G) is a binary relation S on the nodes of Q and G

O((| V | + | VQ |) (| E | + | EQ| )) time

• each node u in Q is mapped to a node v in G, such that (u, v) S∈

• for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈

26

Parallel scalable?

Page 27: Querying Big Data Tractability revisited for querying big data BD-tractability

27

Impossibility

Nontrivial to develop parallel scalable algorithms

There exists NO algorithm for distributed graph simulation that is

parallel scalable in either computation, or data shipment

Why?

Pattern: 2 nodesGraph: 2n nodes, distributed to

n processors

Possibility: when G is a tree, parallel scalable in both response time

and data shipment

Page 28: Querying Big Data Tractability revisited for querying big data BD-tractability

28

Weak parallel scalability

Rational: we can partition G as preprocessing, such that|Ef| is minimized (an NP-complete problem, but there are effective

heuristic algorithms), and When G grows, |Ef| does not increase substantially

Algorithm T is weakly parallel scalable in computation if its parallel computation cost is a function of |Q| |G|/n

and |Ef|, and in data shipment if the total amount of data shipped is a function of |Q|

and |Ef|

The cost is not a function of |G| in practice

Doable: graph simulation is weakly parallel scalable

edges across different fragments

Page 29: Querying Big Data Tractability revisited for querying big data BD-tractability

29

MRC: Scalability of MapReduce algorithms

Characterize scalable MapReduce algorithms in terms of disk usage,

memory usage, communication cost, CPU cost and rounds.

For a constant > 0 and a data set D, |D|1- machines, a MapReduce

algorithm is in MRC if Disk: each machine uses O(|D|1-) disk, O(|D|2-2) in total. Memory: each machine uses O(|D|1-) memory, O(|D|2-2) in total.Data shipment: in each round, each machine sends or receives O(|

D|1-) amount of data, O(|D|2-2) in total.CPU: in each round, each machine takes polynomial time in |D|.The number of rounds: polylog in |D|, that is, logk(|D|)

the larger D is, the more processors

The response time is still a polynomial in |D|

Page 30: Querying Big Data Tractability revisited for querying big data BD-tractability

30

MMC: a revision of MRC

For a constant > 0 and a data set D, n machines, a MapReduce

algorithm is in MMC if Disk: each machine uses O(|D|/n) disk, O(|D|) in total. Memory: each machine uses O(|D|/n) memory, O(|D|) in total.Data shipment: in each round, each machine sends or receives O(|

D|/n) amount of data, O(|D|) in total.CPU: in each round, each machine takes O(Ts/n) time, where Ts is

the time to solve the problem in a single machine.The number of rounds: O(1), a constant number of rounds.

Speedup: O(Ts/n) timethe more machines are used, the less time is taken

Compared with BD-tractable and parallel scalability

What algorithms are in MRC? Recursive computation?Restricted to MapReduce

Page 31: Querying Big Data Tractability revisited for querying big data BD-tractability

Bounded evaluability

3131

Page 32: Querying Big Data Tractability revisited for querying big data BD-tractability

32

Scale independence

Input: A class Q of queries

Question: Can we find, for any query Q Q and any (possibly big) dataset D, a fraction DQ of D such that

|DQ | M, and

Q(D) = Q(DQ)?

Making the cost of computing Q(D) independent of |D|!

Independent of the size of D

Particularly useful for

A single dataset D, eg, the social graph of Facebook

Minimum DQ – the necessary amount of data for answering Q

Q( )DDQ( ) DQDQDQDQ

32

Page 33: Querying Big Data Tractability revisited for querying big data BD-tractability

33

Facebook: Graph Search

Find me restaurants in New York my friends have been to in 2013

select rid

from friend(pid1, pid2), person(pid, name, city),

dine(pid, rid, dd, mm, yy)

where pid1 = p0 and pid2 = person.pid and

pid2 = dine.pid and city = NYC and yy = 2013

How many tuples do we need to access?

Facebook: 5000 friends per person

Each year has at most 366 days

Each person dines at most once per day

pid is a key for relation person

Access constraints (real-life limits)

33

Page 34: Querying Big Data Tractability revisited for querying big data BD-tractability

34

Bounded query evaluation

Accessing 5000 + 5000 + 5000 * 366 tuples in total

Fetch 5000 pid’s for friends of p0 -- 5000 friends per person

For each pid, check whether she lives in NYC – 5000 person tuples

For each pid living in NYC, finds restaurants where they dined in

2013 – 5000 * 366 tuples at most

Contrast to Facebook : more than 1.38 billion nodes, and over 140 billion links

A query plan

Find me restaurants in New York my friends have been to in 2013

select ridfrom friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy)where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013

34

Page 35: Querying Big Data Tractability revisited for querying big data BD-tractability

35

Access constraints

On a relation schema R: X (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values

Access schema: A set of access constraints

Combining cardinality constraints and index

friend(pid1, pid2): pid1 (pid2, 5000) 5000 friends per person

dine(pid, rid, dd, mm, yy): pid, yy (rid, 366) each year has at most

366 days and each person dines at most once per day

person(pid, name, city): pid (city, 1) pid is a key for relation person

Examples

35

Page 36: Querying Big Data Tractability revisited for querying big data BD-tractability

36

Finding access schema

On a relation schema R: X (Y, N)

Bounded evaluability: only a small number of access constraints

Functional dependencies X Y: X (Y, 1)

Keys X: X (R, 1)

Domain constraints, e.g., each year has at most 366 days

Real-life bounds: 5000 friends per person (Facebook)

The semantics of real-life data, e.g., accidents in the UK from

1975-2005

• dd, mm, yy (aid, 610) at most 610 accidents in a day

• aid (vid, 192) at most 192 vehicles in an accident

Discovery: extension of function dependency discovery, TANE

How to find these?

36

Page 37: Querying Big Data Tractability revisited for querying big data BD-tractability

37

Bounded queries

Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q and any

(possibly big) dataset D, a fraction DQ of D such that

|DQ | M, and

Q(D) = Q(DQ)?Examples

The graph search query at Facebook

All Boolean conjunctive queries are bounded– Boolean: Q(D) is true or false– Conjunctive: SPC, selection, projection, Cartesian product

Boundedness: to decide whether it is possible to compute Q(D)

by accessing a bounded amount of data at all

What are these?

But how to find DQ?

37

Page 38: Querying Big Data Tractability revisited for querying big data BD-tractability

38

Boundedly evaluable queries

Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q and any

(possibly big) dataset D, a fraction DQ of D such that

|DQ | M,

Q(D) = Q(DQ), and moreover,

DQ can be identified in time determined by Q and A?

Examples

The graph search query at Facebook

All Boolean conjunctive queries are bounded but are not necessarily

effectively bounded!

effectively find

If Q is boundedly evaluable, for any big D, we can efficiently

compute Q(D) by accessing a bounded amount of data!38

Page 39: Querying Big Data Tractability revisited for querying big data BD-tractability

39

Deciding bounded evaluability

Input: A query Q, an access schema A Question: Is Q boundedly evaluable under A?

Yes. doable

Conjunctive queries (SPC) with restricted query plans:• Characterization: sound and complete rules• PTIME algorithms for checking effective boundedness and for generating query plans, in |Q| and |A|

Relational algebra (SQL): undecidable

Many practical queries are in fact boundedly evaluable!

What can we do?

• Special cases• Sufficient conditions

Parameterized queries in recommendation systems, even SQL

39

Page 40: Querying Big Data Tractability revisited for querying big data BD-tractability

Techniques for querying big data

4040

Page 41: Querying Big Data Tractability revisited for querying big data BD-tractability

An approach to querying big data

41

Given a query Q, an access schema A and a big dataset D

1. Decide whether Q is effectively bounded under A

2. If so, generate a bounded query plan for Q

3. Otherwise, do one of the following:

① Extend access schema or instantiate some parameters of Q, to make Q effectively bounded

② Use other tricks to make D small (to be seen shortly)

③ Compute approximate query answers to Q in D

Very effective for conjunctive queries

77% of conjunctive queries are boundedly evaluable

Efficiency: 9 seconds vs. 14 hours of MySQL 60% of graph pattern queries are boundedly

evaluable (via subgraph isomorphism) Improvement: 4 orders of magnitudes

41

Page 42: Querying Big Data Tractability revisited for querying big data BD-tractability

42

Bounded evaluability using views

Input: A class Q of queries, a set of views V, an access schema A Question: Can we find by using A, for any query Q Q and any

(possibly big) dataset D, a fraction DQ of D such that

|DQ | M,

a rewriting Q’ of Q using V,

Q(D) = Q’(DQ,V(D)), and

DQ can be identified in time determined by Q, V, and A?

access views, and additionally

a bounded amount of data

Query Q may not be boundedly evaluable, but may be boundedly

evaluable with views!

Q’( , )DDQ( ) DQDQDQDQVV

42

Page 43: Querying Big Data Tractability revisited for querying big data BD-tractability

43

Incremental bounded evaluability

Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q Q, any

dataset D, and any changes D to D, a fraction DQ of D such that

|DQ | M,

Q(D D) = Q(D) Q(D, DQ ), and

DQ can be identified in time determined by Q and A?

access an additional bounded amount of data

Query Q may not be boundedly evaluable, but may be

incrementally boundedly evaluable!

Q( , )DDQ( ) DQDQ DDDDQ( )

DD

old output

DQDQ

43

Page 44: Querying Big Data Tractability revisited for querying big data BD-tractability

Parallel query processing

44

Divide and conquer

partition G into fragments (G1, …, Gn), distributed to various sites

manageable sizes

upon receiving a query Q,

• evaluate Q( Gi ) in parallel

• collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G

evaluate Q on smaller Gi

Parallel processing = Partial evaluation + message passing

Q( )

GGQ( )

G1G1Q( )GnGnQ( )G2G2

… graph pattern matching in GRAPE: 21 times faster than MapReduce

44

Page 45: Querying Big Data Tractability revisited for querying big data BD-tractability

Query preserving compression

45

The cost of query processing: f(|G|, |Q|)

Query preserving compression <R, P> for a class L of queries

For any data collection G, GC = R(G)

For any Q in L, Q( G ) = P(Q, Gc)

Q( G )

RG Gc

QP

Q

Q( Gc )

45

Compressing

Post-processing

Q( )

GGQ( )

GCGC

reduce the parameter?

18 times faster on average for reachability queries

In contrast to lossless compression, retain only relevant information for answering queries in L. Query preserving!

No need to restore the original graph G or decompress the data.

Better compression ratio!

Page 46: Querying Big Data Tractability revisited for querying big data BD-tractability

Answering queries using views

46The complexity is no longer a function of |G|

can we compute Q(G) without accessing G, i.e., independent of |G|?

The cost of query processing: f(|G|, |Q|)

Query answering using views: given a query Q in a language L

and a set V views, find another query Q’ such that

Q and Q’ are equivalent

Q’ only accesses V(G )

for any G, Q(G) = Q’(G)

V(G ) is often much smaller than G (4% -- 12% on real-life data)

Improvement: 31 times faster for graph pattern matching

Q’( )Q( ) V(G)V(G)

46

GG

Page 47: Querying Big Data Tractability revisited for querying big data BD-tractability

Incremental query answering

47Minimizing unnecessary recomputation

Incremental query processing:

Input: Q, G, Q(G), ∆G

Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M

Changes to the outputNew output

Changes to the inputOld output

When changes ∆G to the data G are small, typically so are the

changes ∆M to the output Q(G⊕∆G)

Changes ∆G are typically small

Compute Q(G) once, and then incrementally maintain it

Real-life data is dynamic – constantly changes, ∆G

Re-compute Q(G ∆G⊕ ) starting from scratch?

5%/week in

Web graphs

5%/week in

Web graphs

At least twice as fast for pattern matching for changes up to 10%

47

Page 48: Querying Big Data Tractability revisited for querying big data BD-tractability

Complexity of incremental problems

Bounded: the cost is expressible as f(|CHANGED|, |Q|)?

48Complexity analysis in terms of the size of changes

Incremental query answering

Input: Q, G, Q(G), ∆G

Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M

The cost of query processing: a function of |G| and |Q|

incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: ∆M

The updating cost that is inherent to the incremental problem itself

Incremental algorithms?

Incremental graph simulation: bounded

48

Page 49: Querying Big Data Tractability revisited for querying big data BD-tractability

49

A principled approach: Making big data small

Bounded evaluable queries

Parallel query processing (MapReduce, GRAPE, etc)

Query preserving compression: convert big data to small data

Query answering using views: make big data small

Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data

. . .

Combinations of these can do much better than MapReduce!

Including but not limited to graph queries

Yes, MapReduce is useful, but it is not the only way!

49

Page 50: Querying Big Data Tractability revisited for querying big data BD-tractability

50

Summary and Review

What is BD-tractability? Why do we care about it?

What is parallel scalability? Name a few parallel scalable algorithms

What is bounded evaluability? Why do we want to study it?

How to make big data “small”?

Is MapReduce the only way for querying big data? Can we do better than

it?

What is query preserving data compression? Query answering using

views? Bounded incremental query answering?

If a class of queries is known not to be BD-tractable, how can we process

the queries in the context of big data?

Page 51: Querying Big Data Tractability revisited for querying big data BD-tractability

51

Projects (1)

Prove or disprove one of the following query classes is • BD-tractable,• parallel scalable• in MMC

If so, give an algorithm as a proof. Otherwise, prove the impossibility but identify practical sub-classes that are scalable.

The query classes include• Distance queries on graphs• Graph pattern matching by subgraph isomorphism• Graph pattern matching by graph simulation• Subgraph isomorphism and graph simulation on trees

Experimentally evaluate your algorithms

Both impossibility and possibility results are useful! 51

Pick one of these

Page 52: Querying Big Data Tractability revisited for querying big data BD-tractability

52

Projects (2)

Improve the performance of graph pattern matching via subgraph isomorphism via one of the following approaches:

• query-preserving graph compression• query answering using views

Prove the correctness of your algorithm, give complexity analysis and provide performance guarantees

Experimentally evaluate your algorithm and demonstrate the improvement

A research and development project 52

Page 53: Querying Big Data Tractability revisited for querying big data BD-tractability

53

Projects (3)

It is known that graph pattern matching via graph simulation can benefit from:

• query-preserving graph compression• query answering using views

W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression)

W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views)

Implement one of the algorithms Experimentally evaluate your algorithm and demonstrate the improvementBonus: can you combine the two approaches and verify its benefit?

A development project 53

Page 54: Querying Big Data Tractability revisited for querying big data BD-tractability

54

Projects (4)

Find an application with a set of SPC (conjunctive) queries and a dataset

Identify access constraints on your dataset for your queries Implement an algorithm that, given a query in your class, decide

whether the query is boundedly evaluable under your access constraints

If so, generate a query plan to evaluate your queries by accessing a bounded amount of data

Experimentally evaluate your algorithm and demonstrate the improvement

A development project 54

Page 55: Querying Big Data Tractability revisited for querying big data BD-tractability

55

Projects (5)

Write a survey on techniques for querying big data, covering• parallel query processing, • data compression• query answering using views• incremental query processing• …

Develop a good understanding on the topic

Survey:• A set of 5-6 representative papers• A set of criteria for evaluation• Evaluate each model based on the criteria• Make recommendation: what to use in different applications

55

Page 56: Querying Big Data Tractability revisited for querying big data BD-tractability

56

Reading: data quality

W. Fan and F.Geerts. Foundations of data quality management. Morgan & Claypool Publishers, 2012. (available upon request)

56

– Data consistency (Chapter 2)– Entity resolution (record matching; Chapter 4)– Information completeness (Chapter 5)– Data currency (Chapter 6)– Data accuracy (SIGMOD 2013 paper) – Deducing the true values of objects in data fusion (Chap. 7)

Page 57: Querying Big Data Tractability revisited for querying big data BD-tractability

Reading for the next weekhttp://homepages.inf.ed.ac.uk/wenfei/publication.html

57

1. M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS 1999. http://web.ing.puc.cl/~marenas/publications/pods99.pdf

2. Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 2007. http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattacharya-tkdd.pdf

3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf

4. W. Fan and F. Geerts , Relative information completeness, PODS, 2009.

5. Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013.

6. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.