48
1 © Krithi Ramamritham Embedded Databases Embedded Databases Krithi Ramamritham IIT Bombay

Embedded Databases

  • Upload
    tarmon

  • View
    43

  • Download
    2

Embed Size (px)

DESCRIPTION

Embedded Databases. Krithi Ramamritham IIT Bombay. Small Devices are Proliferating. Handhelds, Cellphones, Sensors D evices are resource constrained Applications Personal Info Management E-diary Enterprise Applications Health-care, Micro-banking Sensor (Network)s - PowerPoint PPT Presentation

Citation preview

Page 1: Embedded Databases

1© Krithi Ramamritham

Embedded DatabasesEmbedded Databases

Krithi Ramamritham

IIT Bombay

Page 2: Embedded Databases

2© Krithi Ramamritham

Small Devices are ProliferatingSmall Devices are Proliferating

Handhelds, Cellphones, Sensors– Devices are resource constrained

• Applications– Personal Info Management

• E-diary– Enterprise Applications

• Health-care, Micro-banking– Sensor (Network)s

• Tracking, Monitoring, aggregation

Page 3: Embedded Databases

3© Krithi Ramamritham

Handhelds in HealthcareHandhelds in Healthcare

– Medical professionals store current data in the handheld

• Sensors may be attached to devices– Update local copy and sync with backend– Need for information anytime and anywhere– Decision support, need to extract data from history– Need to aggregate information from data

Reduce communication costs

-- a lot of (sub)query processing / aggregation

must be done on the device itself.

Page 4: Embedded Databases

4© Krithi Ramamritham

Sample healthcare database schemaSample healthcare database schema

Doctor (Docid, name, specialty,…)

Visit (Visitid, Docid, date, diagnostics,..)

Drug (Drugid, name, type)

Prescription (Visitid, Drugid, qty,..)

Q1: Who prescribed Antibiotics (to this patient) in 2004?– Multi-table Join

Q4: Number of prescriptions per doctor and per type of drug?– Multi-table Join with aggregations

Page 5: Embedded Databases

5© Krithi Ramamritham

Sensor NetworksSensor Networks

• Collect data from the environment • Subject data to various queries

– health monitoring, habitat monitoring, defence applications– Continuous Aggregate Queries

• Examples– Current: Mica Mote – 4MHz cpu, 4kB RAM, 128kB code space, 512 kB

EEPROM, 50kbps connectivity (future: match box sized, processor running at several MHz, data memory of

several tens of Mbytes)• Reduce communication costs -> a lot of (sub)query processing / aggregation must be done on the device itself.

Page 6: Embedded Databases

6© Krithi Ramamritham

Embedded Database SW Embedded Database SW

DB Software that

is directly in contact with, or

is significantly affected by,

the hardware that it executes on, or

can directly influence the behavior of that hardware.

Page 7: Embedded Databases

7© Krithi Ramamritham

Need for Small Footprint DBMSsNeed for Small Footprint DBMSs

Data management is important– Increasing number of applications– They deal with a fair amount of data– Complex queries involving joins and aggregates– Atomicity and Durability for data consistency – Need temporal consistency / Synchronization– Ease of application development– Require Data Privacy

A device resident DBMS is needed

Page 8: Embedded Databases

8© Krithi Ramamritham

ChallengesChallengesTightly constrained

– cost, size, performance, power, etc.– limited computing power and main memory– limited stable storage

DB is Flash / main memory based• Reads – similar overheads• Writes – flash memory orders of magnitude slower

Page 9: Embedded Databases

9© Krithi Ramamritham

Lightweight versions of Popular Lightweight versions of Popular DBMSsDBMSs

Deploy reduced footprint codebase by stripping down db features

• e.g. , flat storage, nested loop join• limited support for consistency preserving updates• subset of SQL

• DB support not resource cognizant

Page 10: Embedded Databases

10© Krithi Ramamritham

But, resources are But, resources are not uniform across devicesnot uniform across devices

• Handhelds – Tungsten - 64MB + upto 1GB additional storage– Simputer - 32MB + 24 MB Flash memory, Smart card

• Cell phones– 400 – 1600 kB +

• Sensors– Mica Mote – 4kB RAM, 128kB code space, 512 kB EEPROM

Need tailored DBMSs given• device characteristics and • application requirements

Page 11: Embedded Databases

11© Krithi Ramamritham

DELite: Overall PhilosophyDELite: Overall Philosophy

Storage Management– Reduce storage cost to a minimum

• Aim at compactness in representation of data

• Limited storage could preclude any additional index

– But, to speed up query processing, storage model should try to incorporate some index information

Query Processing– Memory constrained query processing

• existing systems use Minimum memory algorithms -- do not work well for

complex joins and aggregates• Minimize writes to secondary storage• Efficient usage of limited main memory

– Optimal memory allocation among operators

Page 12: Embedded Databases

12© Krithi Ramamritham

Storage ManagementStorage Management• Existing storage models

– Flat Storage• Tuples are stored sequentially. • Duplicates not eliminated

– Pointer-based Domain Storage• Values partitioned into domains which are sets of unique values• Tuples reference the attribute value by means of pointers• One domain shared among multiple attributes

10 20

3040

p

q

sr

Obg2

Visit relation

Car11

Car11

Car11

Car1110

20

3040

p

q

r s

DomainRelation

Obg2

Can we further reduce the storage cost

while retaining simplicity?

Page 13: Embedded Databases

13© Krithi Ramamritham

Storage Management: Storage Management: ID ID StorageStorage

An identifier for each of the domain values– Identifier can be the ordinal value in the domain table– Store the (smaller) identifier instead of the pointer– Use the identifier as an offset into the domain table– D domain values can be distinguished by identifiers of length log2D /8 bytes.Relation R ID Values

0121

n0n

vv1

vn

Domain Values

Positional Indexing

Projection Index

– ID based Storage wins over

Domain Storage when

pointer size > log2 D/8

– Above condition almost always true

Page 14: Embedded Databases

14© Krithi Ramamritham

ID Storage … ID Storage …

• Extendable IDs are used. Length of the identifier grows and shrinks depending on the number of domain values

• To reduce reorganization of data, ID values are projected out from the rest of the relation and stored separately maintaining Positional Indexing.

Why not bit identifiers?– Storage is byte addressable.– Packing bit identifiers in bytes increases the storage management complexity.

Page 15: Embedded Databases

16© Krithi Ramamritham

Choice of Storage SchemesChoice of Storage Schemes

• Flat storage – 31K• ID storage – 28.5 K

• ID Join Index – 18 K

Doctors – 91, Drugs – 77

Visits–830, Prescriptions – 2155

Size of relation

Selectivity and length of attributes

Frequency of updates

Page 16: Embedded Databases

17© Krithi Ramamritham

Storage -> Query ProcessingStorage -> Query ProcessingStorage Management

– Reduce storage cost to a minimum• Aim at compactness in representation of

data• Limited storage could preclude any

additional index– But, to speed up query processing, storage

model should try to incorporate some index information

Query Processing– Memory constrained query processing

• existing systems use Minimum memory algorithms -- do not work well for

complex joins and aggregates• Minimize writes to secondary storage• Efficient usage of limited main memory

– Optimal memory allocation among operators

Page 17: Embedded Databases

18© Krithi Ramamritham

Basic Steps in Query ProcessingBasic Steps in Query Processing1. Parsing and translation2. Optimization3. Evaluation

Page 18: Embedded Databases

19© Krithi Ramamritham

Query exec plans for Q1Query exec plans for Q1Who prescribed Antibiotics in 2003?Who prescribed Antibiotics in 2003?

59

691

270 Drug (77)Pres (2155)

Doctor (91)

Visit (830)

270

77

Visits in 2003

Prescriptions given by doctors visited in 2003

Doctors visited in 2003

Doctors who prescribed antibiotics during visits in 2003

-- selection

-- join

– Doctor (Docid, name, specialty,…) – Visit (Visitid, Docid, date, diagnostics,..)– Drug (Drugid, name, type)– Prescription (Visitid, Drugid, qty,..)

Page 19: Embedded Databases

20© Krithi Ramamritham

Choosing right form of query expressionChoosing right form of query expressionand right sequence of transformationsand right sequence of transformations

can lead to memory and time optimization can lead to memory and time optimization

59

59

151

Doctor (91)

Visit (830)

Drug (77)

Pres (2155)

59

691

270 Drug (77)Pres (2155)

Doctor (91)

Visit (830)

QEP 1 Cost

= 77 + 7*2155 + 155*830 + 59*91 =149181

The best plan

QEP 2 Cost

= 830+ 270*91 + 270*2155 + 691*77

= 660457

830

7 270

59

691

Drug (77)

Pres (2155)Doctor (91)

77 77

196105

QEP 3 Cost

= 91*2155 + 196105*830 + 691*77

= 163016462

Visit (830)

Page 20: Embedded Databases

21© Krithi Ramamritham

Operator Operator SchemesSchemes

Schemes for Join– Nested Loop Join– Indexed Nested Loop Join– Hash Join– …

Schemes for aggregation– Nested Loop aggregation– Buffered aggregation

Operator schemes implemented

using the demand-driven model

Doctor

Visit Pres

Drug

Page 21: Embedded Databases

22© Krithi Ramamritham

Joins using ID based join indicesJoins using ID based join indices

Foreign Key - Primary Key Join Index

0

1

2

1

n

0

n

v0

v1

vn

Doctor

Visit DocID

Page 22: Embedded Databases

23© Krithi Ramamritham

Evaluation of Query ExpressionsEvaluation of Query ExpressionsAlternatives

Materialization: generate results of an expression whose inputs

are relations or are already computed, materialize (store) it on disk.

Repeat.Pipelining: pass on tuples to parent

operations even as an operation is being executed

An evaluation plan defines exactly what scheme is used for each operation

– different schemes for an operator – (indexed) nested loop join, hash join, …

– have different memory usage and costand how the execution of the operations is

coordinated.59

59

151

Doctor (91)

Visit (830)

Drug (77)

Pres (2155)

830

7

Page 23: Embedded Databases

24© Krithi Ramamritham

Need for informed memory Need for informed memory allocation allocation

If nested loop algorithms are used for every operator, minimum amount of memory is needed to execute the plan

– Nested loop algorithms are inefficient– Should memory usage be reduced to a

minimum at the cost of performance?– Different devices come with different

memory sizes– Query plans should make efficient use of

memory– Memory must be optimally allocated

among all operators

Need to generate the best query execution plan depending on the available memory

Page 24: Embedded Databases

25© Krithi Ramamritham

Join approach affects memory Join approach affects memory requirementsrequirements

In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.

Page 25: Embedded Databases

26© Krithi Ramamritham

Choice of query evaluation plansChoice of query evaluation plansNeed for Left-deep Query Plan

– Reduce materialization, if absolutely necessary use main memory

– Bushy trees and right-deep trees are ruled out

– Left deep tree is most suited for pipelined evaluation

– Right operand in a left-deep tree is always a stored relation

Other Considerations– Minimize writes to secondary storage– Efficient usage of limited main

memory– Main memory as write buffer– If read:write ratio very high, flash memory can be used as write

buffer

Doctor

Visit Pres

Drug

Page 26: Embedded Databases

27© Krithi Ramamritham

Major Issues Major Issues in choice of query evaluation planin choice of query evaluation plan

• Order of execution of operators and operands for these

• Scheme (algorithm) to use

for each operator

• Amount of memory to

allocate for each scheme

Doctor

Visit Pres

Drug

Page 27: Embedded Databases

28© Krithi Ramamritham

Memory Allocation: Memory Allocation: using traditional optimizerusing traditional optimizer

Phase 1• Query is first optimized to get a query plan• Scheme for every operator is determined

Enough Memory is assumed to be available for all the schemes --- may not be true for a resource constrained device

Phase 2• Division of memory on the basis of cost functions of the schemes

=> Traditional optimization cannot be used for DELite

Page 28: Embedded Databases

29© Krithi Ramamritham

Query optimizer is made memory cognizantOptimizer takes into account division of memory among operators

while choosing between plans

Ideally, 1-phase optimization should be done --- but the optimizer is complex.

Optimal Memory Allocation: 1-Phase Approach

Page 29: Embedded Databases

30© Krithi Ramamritham

Memory Cognizant 2-Phase Memory Cognizant 2-Phase ApproachApproach

Optimal use of memory => selecting the best scheme for every operator

Phase 1:

Determine the optimal left-deep join order (using a dynamic programming approach) using base and intermediate relation cardinalities

Phase 2:

Divide memory among the operators• based on the (memory, execution time)

profiles of operator schemes• such that execution time is optimized

Page 30: Embedded Databases

31© Krithi Ramamritham

Query exec plans for Q1Query exec plans for Q1Who prescribed Antibiotics in 2003?Who prescribed Antibiotics in 2003?

59

691

270 Drug (77)Pres (2155)

Doctor (91)

Visit (830)

270

77

Visits in 2003

Prescriptions given by doctors visited in 2003

Doctors visited in 2003

Antibiotics given by doctors visited in 2003

-- selection

-- join

Page 31: Embedded Databases

32© Krithi Ramamritham

Query exec plans for Q1Query exec plans for Q1Who prescribed Antibiotics in 2003?Who prescribed Antibiotics in 2003?

59

59

151

Doctor (91)

Visit (830)

Drug (77)

Pres (2155)

59

691

270 Drug (77)Pres (2155)

Doctor (91)

Visit (830)

QEP 1 Cost

= 77 + 7*2155 + 155*830 + 59*91 =149181

The best plan

QEP 2 Cost

= 830+ 270*91 + 270*2155 + 691*277

= 660457

830

7 270

59

691

Drug (77)

Pres (2155)Doctor (91)

77 77

196105

QEP 3 Cost

= 91*2155 + 196105*830 + 691*77

= 163016462

Visit (830)

Page 32: Embedded Databases

33© Krithi Ramamritham

Query exec plans for Q4Query exec plans for Q4Number of prescriptions per doctor and per type of drug?Number of prescriptions per doctor and per type of drug?

2155

2155

2155

Doctor (91)

Visit (830)

Drug (77) Pres (2155)

GrpBy (Doctor.DocID, Drug. Type)

2155

2155

830

Drug (77)

Pres (2155)

Doctor (91)

GrpBy (Doctor.DocID, Drug. Type)

2155

2155

196105

Drug (77)

Visit (830)

Pres (2155)

GrpBy (Doctor.DocID, Drug. Type)

Visit (830) Doctor (91)

QEP 1 Cost

= 77*2155 + 2155*830 + 2155* 91 = 2150690

QEP 2 Cost

= 830*91 + 830* 2155 + 2155 * 77

= 2030115

The best plan

QEP3 Cost

= 91*2155 + 196105*830 +2155*77

= 163129190

Page 33: Embedded Databases

34© Krithi Ramamritham

Phase 2 Memory DivisionPhase 2 Memory Division

• Various operator schemes• Schemes conform to left-deep evaluation plan• Schemes use different amounts of memory and have different execution

times• Evaluate the benefit of a scheme per unit of additional memory allocation• Divide memory among the operators based on the benefit of their

schemes

Page 34: Embedded Databases

35© Krithi Ramamritham

Benefit/Size of a schemeBenefit/Size of a scheme

Assume n schemes s1, s2,…sn to implement an operator o

Minimum scheme for an operator is the scheme that has max. cost and needs min. memory

smin -- the minimum scheme for operator o

i, 1≤i≤n : Cost(si) ≤ Cost(smin) ,

Memory(si) ≥ Memory(smin)

Every scheme is characterized by a benefit/size ratio which represents its benefit per unit memory allocation

Benefit(si) = Cost(smin) – Cost(si)

Size(si) = Memory(si) – Memory(smin)

Memory

(0,c1)

(m2,c2)

(m3,c3)

(0,0)

Cost

(s1,b1)

(s2,b2)

(s3,b3)

Size

Benefit

Page 35: Embedded Databases

36© Krithi Ramamritham

Phase 2 Memory DivisionPhase 2 Memory Division

Phase 2 memory division among the operators based on the cost functions of the operators

Two approaches– Exact memory allocation– Heuristic memory allocation

Memory

(0,c1)

(m2,c2)

(m3,c3)

(0,0)

Cost

Page 36: Embedded Databases

37© Krithi Ramamritham

Heuristic memory allocationHeuristic memory allocation

– Determine which operator gains

the most per unit memory allocation

and allocate memory to that operator– Gain of every operator is determined by its best

feasible scheme– Repeat the process till memory allocation is done

Heuristic: Select the scheme that has the

maximum benefit/size ratio (s1,b1)

(s2,b2)

(s3,b3)

Size

Benefit

Page 37: Embedded Databases

38© Krithi Ramamritham

SummarySummary– Selection of best query execution plan depending on

memory available in a device• Resources in the handheld affects the response time in a big way• Response times highest with minimum memory and least with maximum

memory• Heuristic memory allocation differed from exact algorithm in a few points only• Response times more for ID Storage due to extra cost in projection• Join Index reduces the query execution time considerably.

…. Towards a toolset for synthesizing small footprint DBMSs

Page 38: Embedded Databases

39© Krithi Ramamritham

Other Challenges…Other Challenges…

Update management and synchronization have to consider disconnections, mobility and communication cost

Compression, operations on compressed dataEmbedded Operating System provides lesser facilities

• e.g. no multi-threading support

Better security measures are required as devices are easily stolen, damaged and lost

Page 39: Embedded Databases

40© Krithi Ramamritham

Updating embedded Updating embedded DBs maintaining DBs maintaining

Temporal ConsistencyTemporal Consistency

• Air traffic control• aircraft position, speed,

direction, altitude, etc.• 20,000 data entities• validity intervals of 1 ~ 10 seconds

• Network services databases– network traffic management data

e.g., bandwidth / channel utilization, buffer / space usage

Time

Value

X

0 1 2 3 4 5

Page 40: Embedded Databases

41© Krithi Ramamritham

Assign Periods & DeadlinesAssign Periods & Deadlines-- Problem & Goals-- Problem & Goals

• Problem domain:– maintaining temporal validity of real-time data by periodic

update transactions• Goals: assigning periods and deadlines s.t.

– update transactions can be guaranteed to complete by their deadlines– the imposed workload is minimized

Page 41: Embedded Databases

42© Krithi Ramamritham

Problem : Maintaining Temporal Validity Problem : Maintaining Temporal Validity of Real-Time Dataof Real-Time Data

V

t+Vt

V : Validity length

t’+Vt’

V

• Real-time data refreshed by periodic update sensor transactions– X has to be refreshed before its validity interval expires– validity duration updated upon refresh

• How to maintain the validity of data while minimizing the workloads due to update transactions ?

Page 42: Embedded Databases

43© Krithi Ramamritham

Traditional Approach: Half-Half Traditional Approach: Half-Half -- Sample at twice the rate of change: -- Sample at twice the rate of change: PP = D = D = V/2= V/2

Definition:• X : Real-Time Data • V : Validity Interval Length• T : Trans Updating X• P : Period of T

• D : Deadline of T

• C : Computation Time of TV

Problem : Imposes unnecessarily high workload

t

P=D

t+V/2 t +Vt

Observation : Data validity can be guaranteed if Period + Deadline <= Validity LengthWorkload : C / P = 2C / V

P=D

D

t t+V/2 t +Vt

P

P more than V/2 & D less than V/2 !

Page 43: Embedded Databases

44© Krithi Ramamritham

Deriving Deadlines and Periods:Deriving Deadlines and Periods:Intuition of More-Less PrincipleIntuition of More-Less Principle

• Data validity can be guaranteed if Period + Relative Deadline <= Validity Length (1)

• To reduce the workload (C/P) imposed by T without

violating (1) :

Increase period to be more than half of validity length

Decrease relative deadline to be less than half of validity length

• If relative deadline <= period,

deadline monotonic scheduling is an optimal fixed

priority scheduling alg

Page 44: Embedded Databases

45© Krithi Ramamritham

For a set of transactions {Ti} 1 <= i <= m • Validity Constraint (to ensure data validity) :

Period + Deadline <= Validity Length

More-Less Principle: Definition

• Deadline Constraint (to reduce workload) : Computation Time <= Deadline <= Period

• Schedulability Constraint (by deadline monotonic) :

Response time of the 1st instance <= Deadline

Note: 1st instance response time is the longest response time if all transactions start at same time

Question: Is More-Less always better than Half-Half ?

Page 45: Embedded Databases

46© Krithi Ramamritham

More-Less Principle: P & D

T1T2

C V1 32 20

Parameters

Half-Half

T1

T2

D P

1.5 1.5

10 10

Utilization : 1/1.5 + 2/10 = 0.867

More-Less (priority order: T1 > T2)

T1

T2

D P

1 2

4 16

Utilization : 1/2 + 2/16 = 0.625 <

Determining deadline and period of a transaction in More-Less:Deadline: D = Response time of the 1st instance;

Period : P = Validity Length - Deadline;

Does the priority order T2 > T1 produce same P and D?

Is more-less always better than half-half ?

Page 46: Embedded Databases

47© Krithi Ramamritham

More-Less Better than Half-HalfMore-Less Better than Half-Half

• Theorem: {Ti} can be scheduled by

(Half-Half , any fixed priority scheduling alg)

{Ti} can be scheduled by

(More-Less, Deadline Monotonic scheduling)

• The reverse is not true

• Question: How to determine transaction priority order s.t. load is minimized under More-Less ?

Page 47: Embedded Databases

48© Krithi Ramamritham

Shortest Validity FirstShortest Validity First

• Shortest Validity First (SVF)– assign orders to transactions

• in the inverse order of validity interval length• resolve ties in favor of a transaction with less slack (V -

C)– is optimal under certain restrictions

Page 48: Embedded Databases

49© Krithi Ramamritham

Shortest Validity First: SummaryShortest Validity First: Summary

• Restrictions:1) Ci <= min (Vj /2)

2) C2 - C1 <= 2(V2 - V1) (i.e., the increase of computation time is less than twice the increase in validity interval length),

• If restrictions (1) & (2) hold, SVF is optimal

• If only restriction (1) holds, SVF is near optimal

• In general, SVF is a good heuristic (shown in experiments)