25
Offering a Precision- Performance Tradeoff for Aggregation Queries over Replicated Data Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi Marolia

Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

  • Upload
    turi

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data. Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi Marolia. MOTIVATION. DATA CACHING OR REPLICATION IS A COMMON TECHNIQUE FOR REDUCING THE LATENCY TO ACCESS DATA FROM REMOTE SOURCES - PowerPoint PPT Presentation

Citation preview

Page 1: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Offering a Precision-Performance Tradeoff for

AggregationQueries over Replicated Data

Paper by

Chris Olston, Jennifer Widom

Presented by

Faizaan Kersi Marolia

Page 2: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

MOTIVATION

• DATA CACHING OR REPLICATION IS A COMMON TECHNIQUE FOR REDUCING THE LATENCY TO ACCESS DATA FROM REMOTE SOURCES

• PROPOGATING ALL UPDATES MAY BE INFEASIBLE DUE TO NETWORK AND/OR COMPUTATIONAL RESOURCES

• STALE REPLICATION – CACHED COPIES OF DATA VALUES ARE ALLOWED TO BECOME OUT OF DATE

Page 3: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

MOTIVATION• QUERY OVER CACHED DATA ITEM MAY RETURN

AN UNBOUNDEDLY IMPRECISE ANSWER

• QUERIES OVER REMOTE MASTER DATA WOULD RETURN A PRECISE ANSWER BUT WITH POOR PERFORMANCE

• HENCE WE HAVE A PRECISION-PERFORMANCE TRADEOFF

• IT IS REQUIRED TO LET THE USER CONTROL THE PRECISION AND PERFORMANCE OFFERED BY THE SYSTEM

Page 4: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP• TRADEOFF IN REPLICATION PRECISION AND

PERFORMANCE

• THE SYSTEM SUPPORTS A CONTINUOUS, MONOTONICALLY DECREASING TRADEOFF BETWEEN PRECISION AND PERFORMANCE

Page 5: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG – OVERVIEW• FOCUS IS OVER AGGREGATION QUERIES OVER

NUMERIC DATA

• AGGREGATE QUERIES DISCUSSED ARE THE STANDARD RELATIONAL AGGREGATE FUNCTIONS – COUNT, MIN, MAX, SUM AND AVG

• CONVENTIONAL SYSTEMS - QUERY RETURNS SINGLE VALUE. TRAPP/AG RETURNS A BOUNDED IMPRECISE ANSWER [L , H] – THE RANGE IN WHICH THE PRECISE ANSWER IS GUARANTEED

Page 6: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG - APPROACH

• PRECISION IS QUANTIFIED AS THE WIDTH OF THE RANGE ( H – L )

• THE PRECISION CONSTRAINT IS USER SPECIFIED AND IS PROVIDED AS A CONSTANT R >= 0

• HENCE, THE QUERY IS REQUIRED TO RETURN A RESULT [L , H] SUCH THAT 0 <= H – L <= R

• NOTE THIS ALLOWS FOR BOTH EXACT AND UNBOUNDED PRECISION

Page 7: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG - APPROACH

• WHENEVER A DATA SOURCE REFRESHES A CACHE’S VALUE FOR A DATA OBJECT, ALONG WITH THE EXACT VALUE IT ALSO SENDS A PARAMETRIZED BOUND [L(t) , H(t)].

• AT t=0 BOTH FUNCTIONS EQUAL THE CURRENT VALUE

• EVENTUALLY WHEN ANOTHER REFRESH OCCURS, THE SOURCE SENDS A NEW PAIR OF BOUND FUNCTIONS

Page 8: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG - APPROACH

REFRESHES OCCUR FOR ONE OF TWO REASONS:

• VALUE-INITIATED REFRESH

THE MASTER VALUE DOES NOT SATISFY THE BOUND SPECIFIED AT A CACHE. THE DATA SOURCE IS THEN OBLIGATED TO REFRESH THE CACHE

• QUERY-INITIATED REFRESH

QUERY EXECUTED AT CACHE REQUIRES THE EXACT VALUE IN ORDER TO MEET THE PRECISION CONSTRAINT

Page 9: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG - APPROACH

BOUND AND PRECISE VALUE OVER TIME

Page 10: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP - ARCHITECTURE

Page 11: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

EXECUTING A QUERY

• STANDARD FORM OF A QUERY

SELECT AGGREGATE(T.a) WITHIN R

FROM T

WHERE PREDICATE

• AGGREGATE IS ONE OF THE AGGREGATION FUNCTIONS. R IS A NON-NEGATIVE CONSTANT SPECIFYING THE PRECISION REQUIRED. IF R IS NOT SPECIFIED THEN R = INFINITY

Page 12: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

EXECUTING A QUERY

Suppose one required to compute an aggregate over columnT.a of a cached table T. The value of T.a for eachtuple t_i is stored in the cache as a bound [L_i , H_i]

While computing the aggregate, the query processor has the option for each tuple t_i of either reading the cached bound [L_i , H_i] or refreshing t_i to obtain the master value V_i. The cost to refresh t_i is C_i

The final answer to the aggregate is a bound [L , H]

Page 13: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

EXECUTING A QUERY

1. COMPUTE INITIAL BOUNDED ANSWER AND CHECK WHETHER THE PRECISION CONSTRAINT IS MET

2. IF NOT, AN ALGORITHM CHOOSE_REFRESH IS INVOKED WHICH SELECTS THE SUBSET OF TUPLES THAT HAVE TO BE REFRESHED. QUERY-INITIATED REFRESHES ARE TRIGGERED

3. RECOMPUTE THE BOUNDED ANSWER BASED ON THE CACHE’S PARTIALLY REFRESHED COPY OF THE TABLE

Page 14: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG - ASSUMPTIONS

• TIME TO REFRESH A BOUND IS SMALL SO THE IMPRECISION INTRODUCED IS INSIGNIFICANT

• VALUE-INITIATED REFRESHES DO NOT OCCUR WHEN A QUERY IS BEING PROCESSED

• OBJECT INSERTIONS OR DELETIONS ARE PROPOGATED IMMEDIATELY TO ALL CACHES

• LEVEL OF PRECISION OFFERED DOES NOT ACCOUNT FOR ELAPSES OF TIME WHILE SENDING REFRESH MESSAGES OR IN PROCESSING A QUERY

Page 15: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

TRAPP/AG - ASSUMPTIONS

• THE SYSTEM KNOWS THE COST OF REFRESHING THE INDIVIDUAL DATA OBJECTS FROM THEIR SOURCES

• THE COST OF REFRESHING A SET OF TUPLES IS THE SUM OF THE COSTS OF REFRESHING EACH MEMBER OF THE SET

(ALTHOUGH IN PRACTICE, ALL TUPLES FROM THE SAME SOURCE MAY INCUR A COMMON COST OVERHEAD)

Page 16: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing MIN (MAX) (No Selection Predicate)

• [ L , H ] = [ min(L_i) , max(H_i) ]

COMPUTED OVER ALL TUPLES t_i IN TABLE T

• THE CHOOSE_REFRESH ALGORITHM SELECTS ALL TUPLES t_i SUCH THAT

L_i < min( H_k ) – R

HENCE THE TUPLES REFRESHED ARE THOSE WHOSE LOWER BOUND IS LESS THAN THE MINIMUM UPPER BOUND MINUS THE PRECISION CONSTRAINT

Page 17: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing SUM(No Selection Predicate)

• [ L , H ] = [ sum(L_i) , sum(H_i) ] COMPUTED OVER ALL TUPLES t_i IN TABLE T

• CHOOSE_REFRESH INVOLVES FINDING THE TUPLES NOT TO REFRESH. CALL THIS SET K.

AFTER REFRESHING ALL REQUIRED TUPLES, sum( H_i – L_i ) in T = sum( H_i – L_i ) in K

HENCE WE REQUIRE TO CHOOSE K SO AS TO

MAXIMIZE sum( C_i ) over K UNDER THE CONSTRAINT

sum( H_i – L_i ) in K <= R

Page 18: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing COUNT(No Selection Predicate)

• SINCE WE REQUIRE THAT ALL INSERTIONS AND DELETIONS BE PROPOGATED TO ALL THE DATA CACHES IMMEDIATELY, THE CARDINALITY OF THE CACHED COPY IS ALWAYS EQUAL TO THE CARDINALITY OF THE MASTER COPY.

HENCE THERE IS NO NEED FOR ANY REFRESH

Page 19: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing AVG(No Selection Predicate)

• FIRST COMPUTE COUNT

• THEN COMPUTE SUM WITH THE PRECISION CONSTRAINT

R = R*COUNT

Page 20: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

INCLUDING Selection PREDICATES

Consider a selection predicate involving at least one column of T that contains bounded values. The system can partition T into three disjoint sets:

• T- : Tuples that cannot possible satisfy the predicate• T+ : Tuples that are guaranteed to satisfy the predicate• T? : Tuples that may or may not satisfy the predicate

All algorithms for computing bounded answers and choosing tuples to refresh examine only T+ and T?

Page 21: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing MIN (MAX)(With Selection Predicate)

The bounded MIN answer as well as the CHOOSE_REFRESH algorithm are identical to the case without a selection predicate but with taking T+ union T? as the set of tuples.

Page 22: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing SUM(With Selection Predicate)

• The bounded answer [ L , H ] is computed as:

L = sum(L_i) over all t_i in T+

+

sum(L_i) over all t_i in T? with L_i < 0

H = sum(H_i) over all t_i in T+

+

sum(H_i) over all t_i in T? with H_i > 0

Page 23: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing COUNT(With Selection Predicate)

• The bounded answer [ L , H ] is computed as:

L = cardinality( T+ )

H = cardinality( T+ ) + cardinality( T? )

• The CHOOSE_REFRESH algorithm is based on the fact that refreshing a tuple in in T? is guaranteed to remove it from T?. Hence, one is required to find the

cardinality( T? ) – R cheapest tuples in T?

Page 24: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Computing AVG(With Selection Predicate)

• WITH A PREDICATE, COUNT IS A BOUNDED VALUE AS WELL AS SUM. HENCE IT IS NOT A SIMPLE MATTER OF DIVIDING THE ENDPOINTS OF THE SUM BOUND BY THE EXACT COUNT

• ONE CAN CALCULATE AN APPROXIMATE BOUND [ L , H ] BY COMPUTING SUM AND COUNT AND SETTING

L = min ( Lsum / Hcount , Lsum / Lcount ) &

H = max ( Hsum / Lcount , Hsum / Hcount )

Page 25: Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

Online References

An intra-IIT reference of this presentation is at:

Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

The complete paper together with algorithms and proofs of correctness exists at:

http://www-db.stanford.edu/pub/papers/trapp-ag.ps