Upload
daniel-thornton
View
217
Download
0
Embed Size (px)
Citation preview
VLDB 2006, Seoul 1
Indexing For Function Approximation
Biswanath PandaMirek Riedewald, Stephen B. Pope, Johannes
Gehrke, L. Paul Chew
Cornell University
VLDB 2006, Seoul 2
Motivation
• Simulations are important in science
• Large simulations computationally infeasible– Driven by complex mathematical models – Require solution to complex differential equations
• Approximation techniques speed up simulations– Bounded error in the simulation – Approximate simulation steps using information from
previous steps
VLDB 2006, Seoul 3
Outline
• Example scientific application– Combustion simulation
• Function approximation problem– Formulation– Hardness– Algorithm
• Indexing problem
VLDB 2006, Seoul 4
Combustion SimulationHigh Dimensional
Composition Vector
Inflow
Outflow
Mixing &
Reaction
Air
Methane
Air + Methane
VLDB 2006, Seoul 5
Properties Of Simulation
• Composition dimensionality– 9 for simple hydrogen simulations– >50 for complex methane simulations
• Cost of reaction function evaluation: 30ms• Number of function evaluations: 108 to 1010
• Total simulation time– 108 function evaluations ≈ 35 days
VLDB 2006, Seoul 6
Function Approximation
• Approximate the reaction function• Approach
– Use previous function evaluations to approximate future function evaluations
– ISAT (In Situ Adaptive Tabulation) [Pope’ 97]
• Definition: ε-approximation of f(x)– Let f: Rm → Rn be a function, let x Rm and ε R. f*(x)
is an ε-approximation of f(x) if || f*(x) –f(x)|| < ε
VLDB 2006, Seoul 8
Example
Cost
f
VLDB 2006, Seoul 9
Example
x2x1
ε
ε
f*(x2) = f(x) + s * (x2 - x)
( x, f(x) )
An ε-Local Region Rf,f*(x, ε) Rm
Original Cost
Cost
f
VLDB 2006, Seoul 10
x1 x2 x3 x4 x5 x6
Original Cost
Cost
Example
f
f1*
f2*
f3*
VLDB 2006, Seoul 11
x1 x2 x3 x4 x5 x6
Example
f
f1*
f2*
f3*
When should a local region be added?
VLDB 2006, Seoul 12
Example
Each query point can be covered by several Local Regions
x1 x2 x3 x4 x5 x6x7 x8
f
f1*
f2*
f3*
f4*
VLDB 2006, Seoul 15
Challenges
• Finding good f* s and corresponding Local Regions
• Computing a set of Local Regions• Data management: storing Local Regions for
future use
• Problem: Minimize total simulation time by computing and storing a set of Local Regions
VLDB 2006, Seoul 17
Finding The Optimal Set Of Local Regions
• Simplified cost model– Both the function value and Local Region at a point can be
obtained at some constant cost equal across all regions– Approximations have zero cost
• Offline Problem– Given a set X={ x1, x2, … xn } of query points, find the smallest
set L={ l1, l2, … lk } of Local Regions, such that for each xi X there is an lj L which contains xi
– NP-Complete: Reduction from Geometric Covering By Discs
• Online Problem– No online algorithm is competitive
VLDB 2006, Seoul 19
Algorithm Illustration
x1 x2 x3 x4 x5 x6x7 x8
f
f1*
f2*
f3*
f4*
VLDB 2006, Seoul 20
Algorithm
Initialize S
Lookup x in S
Local Region Found?
Return Approximation
Y N
Add new region containing x to S
Evaluate function at x
Retrieve
Add
Simulation
VLDB 2006, Seoul 21
Possible Instantiation Of Local Regions
• Local Regions can be approximated using high dimensional ellipsoids [Pope ‘97]– Based on Taylor Expansion of function
• Two step approach– Initial conservative approximation
– Grow
x x1
VLDB 2006, Seoul 22
Example
x2x1
x ε’ < ε
VLDB 2006, Seoul 23
Example
x’2
x
x’1
ε’ < ε
VLDB 2006, Seoul 24
Example
x’1 x’2
x
ε
ε’ < ε
VLDB 2006, Seoul 26
Updating Existing RegionsN
Evaluate function at x
Can existing region
contain x?
Update existing regions to contain x
Add new region containing x to S
GrowNY
VLDB 2006, Seoul 28
Outline
• Example scientific application– Combustion Simulation
• Function Approximation Problem– Formulation– Hardness– Algorithm
• Indexing problem
VLDB 2006, Seoul 29
Indexing Problem
• Workload– Retrieve: Find ellipsoid
containing query point
VLDB 2006, Seoul 30
Indexing Problem
• Workload– Retrieve: Find ellipsoid
containing query point– Grow
• Find ellipsoids to be grown
• Update grown ellipsoids
VLDB 2006, Seoul 31
Indexing Problem
• Workload– Retrieve: Find ellipsoid
containing query point– Grow
• Find ellipsoids to be grown
• Update grown ellipsoids
– Add: Insert a new ellipsoid
VLDB 2006, Seoul 32
New Indexing Problem• Shape of regions• Updates and queries interleaved • Additional costs: ellipsoid maintenance costs
• Overall aim: Reduce total simulation time• Retrieve/grow/add are all optional
– Tuning parameters at each step
Operation Cost
Evaluation 2000
Addition 1200
Grow 10
Approximation 1
Search 1
VLDB 2006, Seoul 34
Outline
• Example scientific application– Combustion simulation
• Function approximation problem– Formulation– Hardness– Algorithm
• Indexing problem– Cost structure, tuning parameters and effects– Index structures and experiments
VLDB 2006, Seoul 35
Grow Effects
Cmiss = tf + tgrowsearch + Igrow * Cgrow + (1-Igrow)*Cadd
• Tuning Parameter: Ellg – Limit on number of ellipsoids examined for growing– No pruning criteria – Affects
• tgrowsearch
• Chance of finding a growable ellipsoid
• Tuning Parameter: Ngrown – Number of ellipsoids grown per step– Affects
• Cgrow
• Structure of the index (overlapping ellipsoids)
VLDB 2006, Seoul 36
Retrieve Effects
Ctot = tsearch + Iret * tla + (1-Iret) * Cmiss
• Tuning Parameter: Ellr – Limit on number of ellipsoids examined during retrieve– Limits how much of the index is searched
– Affects• tsearch
• Chances of a current retrieve and also future retrieves
VLDB 2006, Seoul 38
Add Effects
Cmiss = tf + tgrowsearch + Igrow * Cgrow + (1-Igrow)*Cadd
• Tuning parameter: Indirectly controlled by retrieves and grows– Affects
• Should query point be covered by an add or grow?
(-) Computing new ellipsoids is expensive
(-) New ellipsoids cover smaller part of the domain
(+) May lead to better ellipsoid distribution
VLDB 2006, Seoul 39
Candidate Index Structures
• Bounding Box Rtree• Point Rtree• Ellipsoid Rtree• Random Projection Rtree• Binary Tree• MRU List + Rtree
VLDB 2006, Seoul 40
Binary Tree
Primary Retrieve
A
C
B
1
2A
B C
21
q
VLDB 2006, Seoul 41
Binary Tree
Secondary Retrieve
A
C
B
1
2A
B C
21
q
VLDB 2006, Seoul 42
Binary Tree
A
C
B
1
2A
B C
2
1
VLDB 2006, Seoul 43
Binary Tree
Secondary Retrieve now Primary Retrieve
A
C
B
1
2A
1
2
3
3DB
D C
C
VLDB 2006, Seoul 44
Effects In Action: Binary Tree
• 32 dimensional Methane simulation• 6 x 106 queries• Windows XP machine (2.4 Ghz, 2GB)
VLDB 2006, Seoul 45
MRU List + Rtree
• MRU List for retrieving– High locality
• Rtree for searching growable ellipsoids
MRU List
Rtree
VLDB 2006, Seoul 46
Effects In Action: MRU List + Rtree
• Effects very different from Binary Tree
VLDB 2006, Seoul 47
Total Simulation TimesIndex Type Error Tolerance
0.005 0.00005 0.00004
Binary Tree (tuned)
1073 10181 13100
MRU List + Rtree 1125 14000 19920
Bbox Rtree 1201 14700 20850
Random Projection Rtree
1378 15800 22051
Binary Tree(default)
1344 29186 31200
FIFO List + Rtree 2154 33770 42900
Point Rtree 10431 >44000 -
Ellipsoidal Rtree 14328 >44000 -
VLDB 2006, Seoul 48
Conclusion & Future Work
• Formulated the function approximation problem• New class of applications for high dimensional indexing• Understand index selection for function approximation
• Future work– Dynamic parameter settings– New benchmark for index structures– Evaluation of other index structures– Comparison with other function approximation techniques
VLDB 2006, Seoul 49
Questions?