Upload
tarmon
View
43
Download
2
Embed Size (px)
DESCRIPTION
Embedded Databases. Krithi Ramamritham IIT Bombay. Small Devices are Proliferating. Handhelds, Cellphones, Sensors D evices are resource constrained Applications Personal Info Management E-diary Enterprise Applications Health-care, Micro-banking Sensor (Network)s - PowerPoint PPT Presentation
Citation preview
1© Krithi Ramamritham
Embedded DatabasesEmbedded Databases
Krithi Ramamritham
IIT Bombay
2© Krithi Ramamritham
Small Devices are ProliferatingSmall Devices are Proliferating
Handhelds, Cellphones, Sensors– Devices are resource constrained
• Applications– Personal Info Management
• E-diary– Enterprise Applications
• Health-care, Micro-banking– Sensor (Network)s
• Tracking, Monitoring, aggregation
3© Krithi Ramamritham
Handhelds in HealthcareHandhelds in Healthcare
– Medical professionals store current data in the handheld
• Sensors may be attached to devices– Update local copy and sync with backend– Need for information anytime and anywhere– Decision support, need to extract data from history– Need to aggregate information from data
Reduce communication costs
-- a lot of (sub)query processing / aggregation
must be done on the device itself.
4© Krithi Ramamritham
Sample healthcare database schemaSample healthcare database schema
Doctor (Docid, name, specialty,…)
Visit (Visitid, Docid, date, diagnostics,..)
Drug (Drugid, name, type)
Prescription (Visitid, Drugid, qty,..)
Q1: Who prescribed Antibiotics (to this patient) in 2004?– Multi-table Join
Q4: Number of prescriptions per doctor and per type of drug?– Multi-table Join with aggregations
5© Krithi Ramamritham
Sensor NetworksSensor Networks
• Collect data from the environment • Subject data to various queries
– health monitoring, habitat monitoring, defence applications– Continuous Aggregate Queries
• Examples– Current: Mica Mote – 4MHz cpu, 4kB RAM, 128kB code space, 512 kB
EEPROM, 50kbps connectivity (future: match box sized, processor running at several MHz, data memory of
several tens of Mbytes)• Reduce communication costs -> a lot of (sub)query processing / aggregation must be done on the device itself.
6© Krithi Ramamritham
Embedded Database SW Embedded Database SW
DB Software that
is directly in contact with, or
is significantly affected by,
the hardware that it executes on, or
can directly influence the behavior of that hardware.
7© Krithi Ramamritham
Need for Small Footprint DBMSsNeed for Small Footprint DBMSs
Data management is important– Increasing number of applications– They deal with a fair amount of data– Complex queries involving joins and aggregates– Atomicity and Durability for data consistency – Need temporal consistency / Synchronization– Ease of application development– Require Data Privacy
A device resident DBMS is needed
8© Krithi Ramamritham
ChallengesChallengesTightly constrained
– cost, size, performance, power, etc.– limited computing power and main memory– limited stable storage
DB is Flash / main memory based• Reads – similar overheads• Writes – flash memory orders of magnitude slower
9© Krithi Ramamritham
Lightweight versions of Popular Lightweight versions of Popular DBMSsDBMSs
Deploy reduced footprint codebase by stripping down db features
• e.g. , flat storage, nested loop join• limited support for consistency preserving updates• subset of SQL
• DB support not resource cognizant
10© Krithi Ramamritham
But, resources are But, resources are not uniform across devicesnot uniform across devices
• Handhelds – Tungsten - 64MB + upto 1GB additional storage– Simputer - 32MB + 24 MB Flash memory, Smart card
• Cell phones– 400 – 1600 kB +
• Sensors– Mica Mote – 4kB RAM, 128kB code space, 512 kB EEPROM
Need tailored DBMSs given• device characteristics and • application requirements
11© Krithi Ramamritham
DELite: Overall PhilosophyDELite: Overall Philosophy
Storage Management– Reduce storage cost to a minimum
• Aim at compactness in representation of data
• Limited storage could preclude any additional index
– But, to speed up query processing, storage model should try to incorporate some index information
Query Processing– Memory constrained query processing
• existing systems use Minimum memory algorithms -- do not work well for
complex joins and aggregates• Minimize writes to secondary storage• Efficient usage of limited main memory
– Optimal memory allocation among operators
12© Krithi Ramamritham
Storage ManagementStorage Management• Existing storage models
– Flat Storage• Tuples are stored sequentially. • Duplicates not eliminated
– Pointer-based Domain Storage• Values partitioned into domains which are sets of unique values• Tuples reference the attribute value by means of pointers• One domain shared among multiple attributes
10 20
3040
p
q
sr
Obg2
Visit relation
Car11
Car11
Car11
Car1110
20
3040
p
q
r s
DomainRelation
Obg2
Can we further reduce the storage cost
while retaining simplicity?
13© Krithi Ramamritham
Storage Management: Storage Management: ID ID StorageStorage
An identifier for each of the domain values– Identifier can be the ordinal value in the domain table– Store the (smaller) identifier instead of the pointer– Use the identifier as an offset into the domain table– D domain values can be distinguished by identifiers of length log2D /8 bytes.Relation R ID Values
0121
n0n
vv1
vn
Domain Values
Positional Indexing
Projection Index
– ID based Storage wins over
Domain Storage when
pointer size > log2 D/8
– Above condition almost always true
14© Krithi Ramamritham
ID Storage … ID Storage …
• Extendable IDs are used. Length of the identifier grows and shrinks depending on the number of domain values
• To reduce reorganization of data, ID values are projected out from the rest of the relation and stored separately maintaining Positional Indexing.
Why not bit identifiers?– Storage is byte addressable.– Packing bit identifiers in bytes increases the storage management complexity.
16© Krithi Ramamritham
Choice of Storage SchemesChoice of Storage Schemes
• Flat storage – 31K• ID storage – 28.5 K
• ID Join Index – 18 K
Doctors – 91, Drugs – 77
Visits–830, Prescriptions – 2155
Size of relation
Selectivity and length of attributes
Frequency of updates
17© Krithi Ramamritham
Storage -> Query ProcessingStorage -> Query ProcessingStorage Management
– Reduce storage cost to a minimum• Aim at compactness in representation of
data• Limited storage could preclude any
additional index– But, to speed up query processing, storage
model should try to incorporate some index information
Query Processing– Memory constrained query processing
• existing systems use Minimum memory algorithms -- do not work well for
complex joins and aggregates• Minimize writes to secondary storage• Efficient usage of limited main memory
– Optimal memory allocation among operators
18© Krithi Ramamritham
Basic Steps in Query ProcessingBasic Steps in Query Processing1. Parsing and translation2. Optimization3. Evaluation
19© Krithi Ramamritham
Query exec plans for Q1Query exec plans for Q1Who prescribed Antibiotics in 2003?Who prescribed Antibiotics in 2003?
59
691
270 Drug (77)Pres (2155)
Doctor (91)
Visit (830)
270
77
Visits in 2003
Prescriptions given by doctors visited in 2003
Doctors visited in 2003
Doctors who prescribed antibiotics during visits in 2003
-- selection
-- join
– Doctor (Docid, name, specialty,…) – Visit (Visitid, Docid, date, diagnostics,..)– Drug (Drugid, name, type)– Prescription (Visitid, Drugid, qty,..)
20© Krithi Ramamritham
Choosing right form of query expressionChoosing right form of query expressionand right sequence of transformationsand right sequence of transformations
can lead to memory and time optimization can lead to memory and time optimization
59
59
151
Doctor (91)
Visit (830)
Drug (77)
Pres (2155)
59
691
270 Drug (77)Pres (2155)
Doctor (91)
Visit (830)
QEP 1 Cost
= 77 + 7*2155 + 155*830 + 59*91 =149181
The best plan
QEP 2 Cost
= 830+ 270*91 + 270*2155 + 691*77
= 660457
830
7 270
59
691
Drug (77)
Pres (2155)Doctor (91)
77 77
196105
QEP 3 Cost
= 91*2155 + 196105*830 + 691*77
= 163016462
Visit (830)
21© Krithi Ramamritham
Operator Operator SchemesSchemes
Schemes for Join– Nested Loop Join– Indexed Nested Loop Join– Hash Join– …
Schemes for aggregation– Nested Loop aggregation– Buffered aggregation
Operator schemes implemented
using the demand-driven model
Doctor
Visit Pres
Drug
22© Krithi Ramamritham
Joins using ID based join indicesJoins using ID based join indices
Foreign Key - Primary Key Join Index
0
1
2
1
n
0
n
v0
v1
vn
Doctor
Visit DocID
23© Krithi Ramamritham
Evaluation of Query ExpressionsEvaluation of Query ExpressionsAlternatives
Materialization: generate results of an expression whose inputs
are relations or are already computed, materialize (store) it on disk.
Repeat.Pipelining: pass on tuples to parent
operations even as an operation is being executed
An evaluation plan defines exactly what scheme is used for each operation
– different schemes for an operator – (indexed) nested loop join, hash join, …
– have different memory usage and costand how the execution of the operations is
coordinated.59
59
151
Doctor (91)
Visit (830)
Drug (77)
Pres (2155)
830
7
24© Krithi Ramamritham
Need for informed memory Need for informed memory allocation allocation
If nested loop algorithms are used for every operator, minimum amount of memory is needed to execute the plan
– Nested loop algorithms are inefficient– Should memory usage be reduced to a
minimum at the cost of performance?– Different devices come with different
memory sizes– Query plans should make efficient use of
memory– Memory must be optimally allocated
among all operators
Need to generate the best query execution plan depending on the available memory
25© Krithi Ramamritham
Join approach affects memory Join approach affects memory requirementsrequirements
In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.
26© Krithi Ramamritham
Choice of query evaluation plansChoice of query evaluation plansNeed for Left-deep Query Plan
– Reduce materialization, if absolutely necessary use main memory
– Bushy trees and right-deep trees are ruled out
– Left deep tree is most suited for pipelined evaluation
– Right operand in a left-deep tree is always a stored relation
Other Considerations– Minimize writes to secondary storage– Efficient usage of limited main
memory– Main memory as write buffer– If read:write ratio very high, flash memory can be used as write
buffer
Doctor
Visit Pres
Drug
27© Krithi Ramamritham
Major Issues Major Issues in choice of query evaluation planin choice of query evaluation plan
• Order of execution of operators and operands for these
• Scheme (algorithm) to use
for each operator
• Amount of memory to
allocate for each scheme
Doctor
Visit Pres
Drug
28© Krithi Ramamritham
Memory Allocation: Memory Allocation: using traditional optimizerusing traditional optimizer
Phase 1• Query is first optimized to get a query plan• Scheme for every operator is determined
Enough Memory is assumed to be available for all the schemes --- may not be true for a resource constrained device
Phase 2• Division of memory on the basis of cost functions of the schemes
=> Traditional optimization cannot be used for DELite
29© Krithi Ramamritham
Query optimizer is made memory cognizantOptimizer takes into account division of memory among operators
while choosing between plans
Ideally, 1-phase optimization should be done --- but the optimizer is complex.
Optimal Memory Allocation: 1-Phase Approach
30© Krithi Ramamritham
Memory Cognizant 2-Phase Memory Cognizant 2-Phase ApproachApproach
Optimal use of memory => selecting the best scheme for every operator
Phase 1:
Determine the optimal left-deep join order (using a dynamic programming approach) using base and intermediate relation cardinalities
Phase 2:
Divide memory among the operators• based on the (memory, execution time)
profiles of operator schemes• such that execution time is optimized
31© Krithi Ramamritham
Query exec plans for Q1Query exec plans for Q1Who prescribed Antibiotics in 2003?Who prescribed Antibiotics in 2003?
59
691
270 Drug (77)Pres (2155)
Doctor (91)
Visit (830)
270
77
Visits in 2003
Prescriptions given by doctors visited in 2003
Doctors visited in 2003
Antibiotics given by doctors visited in 2003
-- selection
-- join
32© Krithi Ramamritham
Query exec plans for Q1Query exec plans for Q1Who prescribed Antibiotics in 2003?Who prescribed Antibiotics in 2003?
59
59
151
Doctor (91)
Visit (830)
Drug (77)
Pres (2155)
59
691
270 Drug (77)Pres (2155)
Doctor (91)
Visit (830)
QEP 1 Cost
= 77 + 7*2155 + 155*830 + 59*91 =149181
The best plan
QEP 2 Cost
= 830+ 270*91 + 270*2155 + 691*277
= 660457
830
7 270
59
691
Drug (77)
Pres (2155)Doctor (91)
77 77
196105
QEP 3 Cost
= 91*2155 + 196105*830 + 691*77
= 163016462
Visit (830)
33© Krithi Ramamritham
Query exec plans for Q4Query exec plans for Q4Number of prescriptions per doctor and per type of drug?Number of prescriptions per doctor and per type of drug?
2155
2155
2155
Doctor (91)
Visit (830)
Drug (77) Pres (2155)
GrpBy (Doctor.DocID, Drug. Type)
2155
2155
830
Drug (77)
Pres (2155)
Doctor (91)
GrpBy (Doctor.DocID, Drug. Type)
2155
2155
196105
Drug (77)
Visit (830)
Pres (2155)
GrpBy (Doctor.DocID, Drug. Type)
Visit (830) Doctor (91)
QEP 1 Cost
= 77*2155 + 2155*830 + 2155* 91 = 2150690
QEP 2 Cost
= 830*91 + 830* 2155 + 2155 * 77
= 2030115
The best plan
QEP3 Cost
= 91*2155 + 196105*830 +2155*77
= 163129190
34© Krithi Ramamritham
Phase 2 Memory DivisionPhase 2 Memory Division
• Various operator schemes• Schemes conform to left-deep evaluation plan• Schemes use different amounts of memory and have different execution
times• Evaluate the benefit of a scheme per unit of additional memory allocation• Divide memory among the operators based on the benefit of their
schemes
35© Krithi Ramamritham
Benefit/Size of a schemeBenefit/Size of a scheme
Assume n schemes s1, s2,…sn to implement an operator o
Minimum scheme for an operator is the scheme that has max. cost and needs min. memory
smin -- the minimum scheme for operator o
i, 1≤i≤n : Cost(si) ≤ Cost(smin) ,
Memory(si) ≥ Memory(smin)
Every scheme is characterized by a benefit/size ratio which represents its benefit per unit memory allocation
Benefit(si) = Cost(smin) – Cost(si)
Size(si) = Memory(si) – Memory(smin)
Memory
(0,c1)
(m2,c2)
(m3,c3)
(0,0)
Cost
(s1,b1)
(s2,b2)
(s3,b3)
Size
Benefit
36© Krithi Ramamritham
Phase 2 Memory DivisionPhase 2 Memory Division
Phase 2 memory division among the operators based on the cost functions of the operators
Two approaches– Exact memory allocation– Heuristic memory allocation
Memory
(0,c1)
(m2,c2)
(m3,c3)
(0,0)
Cost
37© Krithi Ramamritham
Heuristic memory allocationHeuristic memory allocation
– Determine which operator gains
the most per unit memory allocation
and allocate memory to that operator– Gain of every operator is determined by its best
feasible scheme– Repeat the process till memory allocation is done
Heuristic: Select the scheme that has the
maximum benefit/size ratio (s1,b1)
(s2,b2)
(s3,b3)
Size
Benefit
38© Krithi Ramamritham
SummarySummary– Selection of best query execution plan depending on
memory available in a device• Resources in the handheld affects the response time in a big way• Response times highest with minimum memory and least with maximum
memory• Heuristic memory allocation differed from exact algorithm in a few points only• Response times more for ID Storage due to extra cost in projection• Join Index reduces the query execution time considerably.
…. Towards a toolset for synthesizing small footprint DBMSs
39© Krithi Ramamritham
Other Challenges…Other Challenges…
Update management and synchronization have to consider disconnections, mobility and communication cost
Compression, operations on compressed dataEmbedded Operating System provides lesser facilities
• e.g. no multi-threading support
Better security measures are required as devices are easily stolen, damaged and lost
40© Krithi Ramamritham
Updating embedded Updating embedded DBs maintaining DBs maintaining
Temporal ConsistencyTemporal Consistency
• Air traffic control• aircraft position, speed,
direction, altitude, etc.• 20,000 data entities• validity intervals of 1 ~ 10 seconds
• Network services databases– network traffic management data
e.g., bandwidth / channel utilization, buffer / space usage
Time
Value
X
0 1 2 3 4 5
41© Krithi Ramamritham
Assign Periods & DeadlinesAssign Periods & Deadlines-- Problem & Goals-- Problem & Goals
• Problem domain:– maintaining temporal validity of real-time data by periodic
update transactions• Goals: assigning periods and deadlines s.t.
– update transactions can be guaranteed to complete by their deadlines– the imposed workload is minimized
42© Krithi Ramamritham
Problem : Maintaining Temporal Validity Problem : Maintaining Temporal Validity of Real-Time Dataof Real-Time Data
V
t+Vt
V : Validity length
t’+Vt’
V
• Real-time data refreshed by periodic update sensor transactions– X has to be refreshed before its validity interval expires– validity duration updated upon refresh
• How to maintain the validity of data while minimizing the workloads due to update transactions ?
43© Krithi Ramamritham
Traditional Approach: Half-Half Traditional Approach: Half-Half -- Sample at twice the rate of change: -- Sample at twice the rate of change: PP = D = D = V/2= V/2
Definition:• X : Real-Time Data • V : Validity Interval Length• T : Trans Updating X• P : Period of T
• D : Deadline of T
• C : Computation Time of TV
Problem : Imposes unnecessarily high workload
t
P=D
t+V/2 t +Vt
Observation : Data validity can be guaranteed if Period + Deadline <= Validity LengthWorkload : C / P = 2C / V
P=D
D
t t+V/2 t +Vt
P
P more than V/2 & D less than V/2 !
44© Krithi Ramamritham
Deriving Deadlines and Periods:Deriving Deadlines and Periods:Intuition of More-Less PrincipleIntuition of More-Less Principle
• Data validity can be guaranteed if Period + Relative Deadline <= Validity Length (1)
• To reduce the workload (C/P) imposed by T without
violating (1) :
Increase period to be more than half of validity length
Decrease relative deadline to be less than half of validity length
• If relative deadline <= period,
deadline monotonic scheduling is an optimal fixed
priority scheduling alg
45© Krithi Ramamritham
For a set of transactions {Ti} 1 <= i <= m • Validity Constraint (to ensure data validity) :
Period + Deadline <= Validity Length
More-Less Principle: Definition
• Deadline Constraint (to reduce workload) : Computation Time <= Deadline <= Period
• Schedulability Constraint (by deadline monotonic) :
Response time of the 1st instance <= Deadline
Note: 1st instance response time is the longest response time if all transactions start at same time
Question: Is More-Less always better than Half-Half ?
46© Krithi Ramamritham
More-Less Principle: P & D
T1T2
C V1 32 20
Parameters
Half-Half
T1
T2
D P
1.5 1.5
10 10
Utilization : 1/1.5 + 2/10 = 0.867
More-Less (priority order: T1 > T2)
T1
T2
D P
1 2
4 16
Utilization : 1/2 + 2/16 = 0.625 <
Determining deadline and period of a transaction in More-Less:Deadline: D = Response time of the 1st instance;
Period : P = Validity Length - Deadline;
Does the priority order T2 > T1 produce same P and D?
Is more-less always better than half-half ?
47© Krithi Ramamritham
More-Less Better than Half-HalfMore-Less Better than Half-Half
• Theorem: {Ti} can be scheduled by
(Half-Half , any fixed priority scheduling alg)
{Ti} can be scheduled by
(More-Less, Deadline Monotonic scheduling)
• The reverse is not true
• Question: How to determine transaction priority order s.t. load is minimized under More-Less ?
48© Krithi Ramamritham
Shortest Validity FirstShortest Validity First
• Shortest Validity First (SVF)– assign orders to transactions
• in the inverse order of validity interval length• resolve ties in favor of a transaction with less slack (V -
C)– is optimal under certain restrictions
49© Krithi Ramamritham
Shortest Validity First: SummaryShortest Validity First: Summary
• Restrictions:1) Ci <= min (Vj /2)
2) C2 - C1 <= 2(V2 - V1) (i.e., the increase of computation time is less than twice the increase in validity interval length),
• If restrictions (1) & (2) hold, SVF is optimal
• If only restriction (1) holds, SVF is near optimal
• In general, SVF is a good heuristic (shown in experiments)