Upload
nichole-longway
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range
Histogram Histogram ConstructionsConstructions
Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range
Histogram Histogram ConstructionsConstructions
AuthorsAuthorsSudipto Guha, Nick Koudas, Divesh Sudipto Guha, Nick Koudas, Divesh
Srivastava.Srivastava.ACM PODS ’2002sACM PODS ’2002s
Layout• Introduction• Related Works• Problem Definition• Problem Solution
– A Sparse Interval Set System– The Dynamic Programming algorithm
• Experimental Evaluation• Conclusions
Introduction• Data Warehousing and OLAP applications
– OLAP – Online analytical processing
• Data has multiple logical dimensions with natural hierarchies defined on it
• OLAP queries – usually involve hierarchical selections on
some of the dimensions – often aggregate measure attributes
Introduction – Cont.
Histograms• Numeric attribute value domain • Space-efficient • Conditions on a given dimension -
hierarchical ranges • Range estimation depends on a
good solution to the histogram construction problem
The Main Idea• Proposes a fast practical
algorithms for the problem of constructing hierarchical range histograms
The Main Contributions• A novel notion of sparse intervals• A proposed algorithm effectively
trades space for construction time without compromising the accuracy
• First practical approach to the problem
Previous Works• V-Optimal histograms
– Minimizes error for equality queries– But… Constructed by taking only equality
queries into account • Koudas et al. - a polynomial-time
algorithm– For special and general cases – But… High polynomiality
• Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges– But.. High polynomiality
Problem Definition• An array A[1,n] of non-negative real
numbers• The average of items A[a],…,A[b]
1
][...][],[
ab
bAaAbaA
• A histogram of array A[1,n] using B buckets is specified by B+1 integers
• Each interval is a bucket• Each is a bucket boundary
Histogram Definition
nbbb
bb
B
B
121
11
...0
,...,
],1[ 1 ii bb
ib
Histogram Definition – Cont.
• Stored as – a series of bucket boundaries– the average of the array values
in each bucket – bucket sum can be obtained
],1[ 1 ii bbA
Histograms – Cont.• Mostly support equality queries
– “give me A[i]”
• Hierarchical range queries
Hierarchical Range Queries Definition
• A range query asks for the sum
• A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are– disjoint– or contained one in the other
ijR][...][ jAiAsij
ijR klR
Hierarchical Range Queries – Cont.
• Generalize equality queries
• Can be displayed as a tree– Each node u has an associated range– Node v is a child of node u iff and
there is no w such that
][iARii
uruv rr
uwv rrr
Workload Definition• A workload W consists of
– A set S of hierarchical range queries– A probability for each query in
S this probability can be obtained by monitoring and logging
• Simple probabilities model
ijp ijR
How The Histogram Works1. A histogram H of array A[1,n]2. Query 3. An expected answer 4. Left bucket such that5. Right bucket such that 6. Calculate precise total of the values in
the buckets between left and right buckets
7. Estimate the sums for the portions within the left and right buckets
ijR][...][ jAiAsij
],1[ 1 ll bb 11 ll bib],1[ 1 rr bb
11 rr bjb
How The Histogram Works – Cont.
8. The sum of A in the interval is estimated by
– Uniformity assumption
9. The right bucket likewise
],1[],[ 1 ll bbji
],1[],[],1[ 11 llll bbjibbA
The Total Estimate
• The total estimate
• left bucket estimation +right bucket estimation +
exact sum for buckets in between
ijs
ijs
Determining the average
• Construct a prefix sum array for all
• Given and return the average at constant time
ib
jjA
1][
ib
ib 1ib
Optimal histogram definition
• The error of the range query is
• Given a histogram H and workload W the total expected error for estimating W is over all queries in W
ije ijR2)ˆ( ijijij sse
ijR ijijep )(
Optimal Histogram Definition – Cont.
• Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets
Fast Histogram (FH) Construction for Hierarchical
Range Queries • Given an array A[1,n], B buckets
and workload W• E denotes the total expected error
of the optimal histogram• Find algorithms that construct HR
histograms with an error at most E trading space and construction time
Layout• Introduction• Related Works• Problem Definition• Problem Solution
– A Sparse Interval Set System– The Dynamic Programming algorithm
• Experimental Evaluation• Conclusions
FH construction• Constructing a set of “sparse
intervals”– Increases a number of buckets– Any arbitrary interval can be
represented
• Dynamic programming algorithm
A Sparse Interval System
• Given an integer set • Level 1 points: • Level 2 points: • Level j+1 points:• Last r+1 level points:
1r 11
rnl
n,...,0
...3,2,,0 lll
...3,2,,0 jjj lll
n,0
A Sparse Interval System
• The interval [0,n] is in the sparse system S
• Any pair of level j points between level j+1 points defines in interval in S
A Sparse Interval System Example
n=8 ; r=3 ; l=2
0 2 4 6 81 3 5 70 4 81 2 3 5 6 70 81 2 3 4 5 6 70 81 2 3 4 5 6 7
Level 2 pointsLevel 3 pointsLevel 4 pointsLevel 1 points
Sparse Interval System Properties
• Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system
Claim• Any interval [0,x] can be
expressed as a partition of at most r intervals from the sparse system
Claim Proof• By induction• Induction step Any interval where can be
expressed as j intervals. • Base case
true for j=1
],0[ x jlx
Claim Proof – Cont.• j+1• Consider• We can write the interval as and
where t is maximal • is a valid interval in the sparse
system (in level j+1 - 0 and are adjacent)
1 jj lxl
],0[ jtl
],[ xtl j )( lt ],0[ jtl
1jl
Claim Proof – Cont.• is essentially similar to • since t is maximal. Therefore by induction can be expressed by j
intervals• Total j+1• Since any interval can be
expressed as a union of r intervals
],[ xtl j ],0[ jtlx jj ltlx
rlx ],0[ x
Observation• Any interval can be expressed as
intervals • By cutting it in a point of the form with maximum j• By symmetry and can be expressed as a disjoint union
],[ ba r2
bala j
],[ jala ],[ bal j
Lemma• In a sparse set system with
parameter r, the number of intervals containing a point is at most
)(2
rrnO
Lemma Proof• Consider the level 1 intervals• There are at most such intervals
that contain a specific point– There are l points between adjacent points
of level 2– l points can create at most intervals
• Level j intervals behave on level j points the same as level 1 points on the original points
• Extend to r levels…(r+1’th level adds one more interval)
)( 2lO
)( 2lO
Layout• Introduction• Related Works• Problem Definition• Problem Solution
– A Sparse Interval Set System– The Dynamic Programming algorithm
• Experimental Evaluation• Conclusions
Hierarchy Representation By a
Tree• Ranges define a hierarchy based on the
inclusion relationship• T is a hierarchy representation by a
tree– Each node v of T is associated with a range
– The weight is – The error is
],[ RLij vvR
vw ijpve ije
Representation By a Binary Tree
• We allow • If a node had children transform it into a node
with two children – – a new node with weight 0
• The size of a tree increases only by factor 2• So assume that the tree is binary
0uwtuu ,...1
1u tuu ,...2
Dynamic Programming Algorithm - FH
• Best(v,left,right,p) denotes the smallest error of the range
• v – tree node associated with• left – overlapping interval on the left• right - overlapping interval on the left• v contains p intervals completely• Formally, left contains and right contains
],[ RL vv],[ RL vv
LvRv
FH stages• Let the children of v to be y and z
with ranges and • Cases (a) + (b)
],[ RL yy ],[ RL zz
Cases (a)+(b)• For all possible intervals I that
contain and ,compute
• In the case that I finishes on
Ry Lz
)1,,,((min)(cos 11
kpIleftyBestewewIt zzyyk
)),,,( 1krightIzBest
),,,((min)(cos 11
kpIleftyBestewewIt zzyyk
)),,,( 1krightzBest
Ry
Cases (a)+(b)
FH stages – Cont.• Return • When interval I is fixed, and
are automatically defined and can be counted in O(1) time.
)(min ICostI
ye ze
Time complexity• Time spent evaluating cost(I) is O(p)• The running time depends on the
number of choices of interval I• Let C(S) be the maximum number of
intervals in an interval system S that contain a particular element ( )
• If all intervals are allowed then
)()( 2nOSC
Ry
Time complexity – Cont.
• The running time of the algorithm FH is
• The number of entries for each tree node v is – Since there are C(S)+1 intervals for
choices of left (all intervals that contain and ). Similarly for right
• Work for every tree node
))(( 2SBCO
Lv
))(())()(( 32 SCBOSBCSpCO
))(( SpCO
Time complexity – Cont.
• Total work including preprocessing is
• When S is a set of all possible intervals
• The result matches the time complexity of the previous algorithm (for arbitrary intervals)
))(( 32 SCBTnO
)( 26 TBnO
Time Complexity For a Sparse Set System
• S – a sparse set system with parameter r
• Run FH with 2r(B-1) buckets • Error - less then or equal to the
original B bucket histogram– A histogram with B buckets can
be expressed as a histogram with 2r(B-1) buckets in sparse system
Time Complexity – Cont.
• • Set
– In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets
66
r
)( 52 TnBnO
)/( BO
rrnSC2
)(
Some Notes• Get alternate tradeoffs by constructing different
sparse set system– Complete binary tree on [0,n] – Allow intervals such that one end point is an
ancestor of the other– Any arbitrary interval can be expressed as a
disjoint union of two intervals from the sparse set
– C(S) = O(n)– Solution with 2B buckets in time
)( 23 BTnO
Experimental Evaluation
• FH was implemented with r=6• Compared to an algorithm A0
presented by Gilbert et al.– Optimized for arbitrary range queries – For a data series of length n to be
approximated with B buckets, constructs a histogram consisting of 2B buckets in time
– The only known algorithm with reasonable complexity
)( 2BnO
Description of Data Sets
• A: A real data set of length 1000 extracted from an AT&T operational warehouse
• B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5
• C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250
Workload Description• A normal used to assign the
probabilities to a full hierarchy• Then normalization to obtain a
probability distribution • W1 – generated by sampling N(10,10)• W2 – generated by sampling N(10,50)
Performance Evaluation
• Accuracy and construction time • Parameters
– Total space allowed for histogram– Total size of the data set
Computing Accuracy• Ask 1000 queries • Report the total expected sum
squared error of the workload execution on the histogram
Results for Data Set A
Results for Set A – Cont.
• The accuracy of FH is superior to A0• FH is more accurate for smaller
variance (W1) • As the variance increases, gets closer
to uniform (A0 optimized) • A0 linear in the space • FH is better in construction time for the
same range of space
Results for Data Set B
Results for Set B –Cont.• Similar to A• Accuracy improves much faster
with space– since the distribution is Zipf
• The savings in construction time for FH are dramatic– since data set B is twice the size of A
Results for Data Set C
Results for Set C – Cont.
• Data set size increases (x axis) and total space 20
• A0 has a plateau– Due to the way the data is generated
in the experiment (Gaussian tail)• Quadratic trend in construction time for
A0• FH – near-linear increasing in
construction time
Conclusions• The first practical approach to the
problem of constructing hierarchical range histograms
• The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy
• A novel notion of sparse intervals
Future plans• A formal study of the dynamic
properties of hierarchical range histograms
• How should one modify these histograms under data or workload modifications?
The ENDThe ENDThe ENDThe END
Thanks for listeningThanks for listening