Efficient Computation of Temporal
Aggregates with Range Predicates
D. Zhang*, A. Markowetz**, V. J. Tsotras*,
D. Gunopulos* and B. Seeger**
* University of California, Riverside
** Philipps Universität Marburg, Germany
Outline
• Introduction & Motivation
• Problem Decomposition
• The MVSB-tree
• Performance Results
• Conclusions
Introduction & Motivation• Consider a collection of temporal records.
• Each record: key k , value v , time interval [t1 , t2].
• E.g.: employees and their salaries over time.
• Temporal Aggregation: aggregate values over time.
• Focus on SUM/COUNT/AVG.
Introduction & Motivation
4
time
key
1 2
2 3 4
7 5
6
Previous Work
‘Given time t, aggregate over all records that contain t’. [Tum92, KS95, YK97, GHR+, MLI00]
4
time
key
1 2
2 3 4 7 5
6
Introduction & Motivation
Previous Work
‘Given interval [t1, t2], aggregate over all records that intersect [t1, t2]’. (SB-tree [YW01])
4 6
2
4
time
key
1
2 3
5 7
t2 E.g. the sum at t2 is 13.
‘Given time t, aggregate over all records that contain t’. [Tum92, KS95, YK97, GHR+, MLI00]
Introduction & Motivation
Previous Work
3 4
time
key
1 2
2
7 5
6
t1 t2
4
E.g. the sum over [t1 , t2] is 28.
‘Given interval [t1, t2], aggregate over all records that intersect [t1, t2]’. (SB-tree [YW01])
E.g. the sum at t2 is 13.
‘Given time t, aggregate over all records that contain t’. [Tum92, KS95, YK97, GHR+, MLI00]
Introduction & Motivation
Range-Temporal Aggregation (RTA)
‘Aggregate over all records intersecting interval [t1, t2] with keys in range [k1, k2]’.
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4 E.g. the RTA-sum over [k1, k2]x[t1, t2] is 19.
Introduction & Motivation
Range-Temporal Aggregation (RTA)
‘Aggregate over all records intersecting interval [t1, t2] with keys in range [k1, k2]’.
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4 E.g. the RTA-sum over [k1, k2]x[t1, t2] is 19.
Introduction & Motivation
• Find AVG salary over past ten years of all employees whose last names start with ‘B’.
• Alternative:
Introduction & Motivation
• Previous approaches would need a separate index for each possible key range. (inefficient)
• Our solution: O(logbn).
- index the records;
- selection query: ‘find all records intersecting [k1, k2]x [t1, t2]’.
- Query time is O(n).
Problem Decomposition
LKST query: given k, t, aggregate over all records with keys less than k and intervals containing t.
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
Problem Decomposition
• Decompose RTA into LKST and LKLT queries.
E.g. LKST(k2, t2)=11.
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
LKLT query: given k, t, aggregate over all records with keys less than k and intervals ending before t.
Problem Decomposition
E.g. LKLT(k2, t2)=20.
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2])
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) - LKST(k1, t2)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
LKLT(k2, t2)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
- LKLT(k1, t2)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
LKLT(k2, t2) - LKLT(k1, t2)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
LKLT(k2, t2) - LKLT(k1, t2) LKLT(k2, t1)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
LKLT(k2, t2) - LKLT(k1, t2) - LKLT(k1, t1)
5 7
time
key
1 2
2 3 4
6
k2
k1
t1 t2
4
7
time
key
1 2
2 3 4 5
6
k2
k1
t1 t2
4
=
7 2 k2
time
key
1 2
3 4 5
6 k1
t1 t2
4
+
time
key
1 2
2 3 4 7 5
6
k2
k1
t1 t2
4
-
Problem Decomposition
RTA([k1, k2]x[t1, t2]) LKST(k2, t2) - LKST(k1, t2)
LKLT(k2, t2) - LKLT(k1, t2) LKLT(k2, t1) - LKLT(k1, t1)
RTA([k1, k2]x[t1, t2]) = LKST(k2, t2) - LKST(k1, t2)
+ LKLT(k2, t2) - LKLT(k1, t2)
- LKLT(k2, t1) + LKLT(k1, t1)
• The RTA query is decomposed to LKST and LKLT.
Problem Decomposition
• Both LKST and LKLT are point queries: ‘given k, t, return value’.
• An index for LKST and LKLT should:
store points in key-time space;maintain a value for each point;support point queries.
Index Design
Index Design
Model• Assume updates come in increasing time order
(transaction-time model).
t1 tmax v k at t1, inserted as:
t1 t2 v k at t2, updated as:
Index Design
t1 t2 v k a record:
The LKST index
at t1
t1 tmax time
key
k +v
kmax
The effect of inserting record (k, [t1, t2], v):
at t2
t1 tmax time
key
k -v
t2
kmax
Index Design
The LKLT index
no update at t1
Index Design
The effect of inserting record (k, [t1, t2], v):
at t2
t1 tmax time
key
k +v
t2
kmax
Update Operation
• Common update operation for both: insert (k, t):v.
Index Design
• That is: add v to all points in [k, t] x [kmax, tmax].• Conclusion: an index supporting point query and
the above update can be used for LKLT and LKST.
The MVSB-tree• A partially persistent SB-tree. It inherits features from
both the SB-tree [YW01] and the MVBT [BGO+96].
The MVSB-tree
0
1 10
1 4 10
2
1 80
20 4
kmax
20
10 1
1 4 2 3
20
0
1
1
0 0
kmax
tmax
1
kmax
20
4 10
0
0
root1: [1, 4)
root2: [4, 10)
0
3 5
1 10 tmax
3
2 15
10 10
10 20
tmax
1
kmax
20
10 tmax
0
3
root3: [10, tmax)
15
5
6 8
5 4
0
3 1 2
10 0
6
Insertion
tmax 1 1
kmax
0
The initial MVSB-tree.
tmax 1 1
kmax
0
after inserting (20, 2):1
2
20 1
0
tmax 1 1
kmax
0
after inserting (10, 3):1 (conceptual view)
2
20 1
0 10
3
2
1 0
tmax 1 1
kmax
0
instead, logical splitting
2
20 1
0 10
3
1 0
The MVSB-tree
Insertion (cont.)
The MVSB-tree
• To handle overflow, copy records with end=tmax to a new page.
tmax 1 1
kmax
0
2
20 1
0 10
3
1 0
Insertion (cont.)
The MVSB-tree
• To handle overflow, copy records with end=tmax to a new page.
tmax 1 1
kmax
0
Overflow after (80, 4):1. 2
20 1
0 10
3
1 0 4
80 1 1
tmax 1
kmax
20 10
1 0
4
80 1 1 copy
• Strong overflow: limit the number of records in a new page.
tmax 1
20
10
0
4
1
tmax 20
kmax
10
2
4
1
tmax
20 kmax
1 4
0 0
root2: [4, tmax)
4 1 1
kmax
0
2
20 1
0 10
3
1 0
root1: [1, 4)
Point Query (k , t )• Follows a single path: the nodes containing (k , t ).• Aggregates the values found in this path.
The MVSB-tree
0
1 10
1 4 10
2
1 80
20 4
kmax
20
10 1
1 4 2 3
20
0
1
1
0 0
kmax
tmax
1
kmax
20
4 10
0
0
root1: [tmin, 4)
root2: [4, 10)
0
3 5
1 10 tmax
3
2 15
10 10
10 20
tmax
1
kmax
20
10 tmax
0
3
root3: [10, tmax)
15
5
6 8
5 4
0
3 1 2
10 0
6
Point Query (k , t )• Follows a single path: the nodes containing (k , t ).
The MVSB-tree
0
1 10
1 4 10
2
1 80
20 4
kmax
20
10 1
1 4 2 3
20
0
1
1
0 0
kmax
tmax
1
kmax
20
4 10
0
0
root1: [tmin, 4)
root2: [4, 10)
0
3 5
1 10 tmax
3
2 15
10 10
10 20
tmax
1
kmax
20
10 tmax
0
3
root3: [10, tmax)
15
5
6 8
5 4
0
3 1 2
10 0
6
• E.g.: PointQuery(23, 7) = 5+2 = 7.
• Aggregates the values found in this path.
Efficiency• Theorem: with 2 MVSBT indices, we achieve:
RTA query: O(logbn);
Update: O(logbK);
Space: O( * logbK).• n = number of updates;
• K= number of different keys;
• b = page capacity (in records).
b
n
The MVSB-tree
Performance Results• Sun Enterprize 250 Server; two 300 Mhz Ultra
SPARC-II processors; Solaris 2.8; GNU C++;
• Datasets: created using the TimeIT [KS98] software and transformed to add record keys.
• Each dataset has a million records (10k unique keys; on average 100 intervals per key).
• Compare against the straightforward approach using the MVBT [BGO+96] as temporal index.
Performance Results
Index Sizes
Performance Results
2KB 4KB 8KB
0
25
50
75
100
125
150
2MVSBT
naïve
Varying page size
Ind
ex
Siz
es
(#M
B)
0.1% 1% 10% 50%
0
50
100
150
200
319
151
763
Varying query rectangle size
Performance Results
Query Speedup
MVSBT
naive
2
• Query time is averaged over 100 queries of the same query rectangle size.
Conclusions• We addressed the range-temporal aggregation (RTA)
problem;
• New index structure (MVSB-tree) for incrementally maintaining and efficiently computing RTAs;
• Query time reduced from O(n) to O(logbn) with small space overhead;
• Open problems: Min/Max range-temporal aggregation; Valid-time environment; Multi-dimensional aggregation over objects with extents.