Upload
whitney-havill
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
V. Megalooikonomou
Distributed Databases
(based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)
Temple University – CIS Dept.CIS616– Principles of Data Management
Overview Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc.
control)
Problem – definition
distributed DB: stored in many connected cites DB typically geographically separated separately administered transactions are differentiated:
Local Global
possibly different DBMSs, DB schemas (heterogeneous)
LA NY
Problem – definition
LA NY
EMPEMPLOYEE
connect to LA; exec sql select * from EMP; ...
connect to NY; exec sql select * from EMPLOYEE; ...
now:
DBMS1 DBMS2
Problem – definition
LA NY
EMP
EMPLOYEE
connect to D-DBMS; exec sql select * from EMPL;ideally:
DBMS1 DBMS2
D-DBMSD-DBMS
Pros + Cons Pros
data sharing reliability & availability autonomy (local) speed up of query processing
Cons software development cost more bugs increased processing overhead (msg)
Overview Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc.
control)
Design of Distr. DBMS Homogeneous distr. DBs
Identical DBMS Same DB Schema Aware of one another Agree to cooperate in transaction processing
Heterogeneous distr. DBs Different DBMS Different DB Schema May not be aware of one another May provide limited facilities for cooperation in
transaction processing
Design of Distr. DBMS replication (several copies of a
table at different sites) fragmentation (horizontal; vertical;
hybrid) or both…
Design of Distr. DBMS Replication: a copy of a relation is
stored in two or more sites Pros and cons
Availability Increased parallelism (possible
minimization of movement of data among sites)
Increased overhead on update (replicas should be consistent)
Design of Distr. DBMS
ssn name address
123 smith wall str.
... ... ...
234 johnson sunset blvd
horiz.
fragm.
vertical fragm.Fragmentation:
• keep tuples/attributes at the sites where they are used the most• ensure that the table can be reconstructed
Transparency & autonomy
Issues/goals: naming and local autonomy replication transparency fragmentation transparency location transparencyi.e.:
Problem – definition
LA NY
EMP
EMPLOYEE
connect to D-DBMS; exec sql select * from EMPL;ideally:
DBMS1 DBMS2
D-DBMSD-DBMS
Overview Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc.
control)
Distributed Query processing
issues (additional to centralized q-opt) cost of transmission
parallelism / overlap of delays
(cpu, disk, #bytes-transmitted, #messages-transmitted)
minimize elapsed time?
or minimize resource consumption?
Distributed Query processing
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S2
S3SUPPLIER Join SHIPMENT = ?
semijoins choice of plans? plan #1: ship SHIP -> S1; join; ship ->
S3 plan #2: ship SHIP->S3; ship SUP->S3;
join ... others?
Distr. Q-opt – semijoins
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S2
S3SUPPLIER Join SHIPMENT = ?
Semijoins
Idea: reduce the tables before shipping
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S3SUPPLIER Join SHIPMENT = ?
Semijoins
Idea: reduce the tables before shipping
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S3SUPPLIER Join SHIPMENT = ?
(s1,s2,s5,s11)
Semijoins – e.g.: suppose each attr. is 4 bytes Q: transmission cost (#bytes) for
semijoinSHIPMENT’ = SHIPMENT semijoin
SUPPLIER
Semijoins
Idea: reduce the tables before shipping
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S3SUPPLIER Join SHIPMENT = ?
(s1,s2,s5,s11)
4 bytes
Semijoins – e.g.: suppose each attr. is 4 bytes Q: transmission cost (#bytes) for
semijoinSHIPMENT’ = SHIPMENT semijoin
SUPPLIER A: 4*4 bytes
Semijoins – e.g.: suppose each attr. is 4 bytes Q1: give a plan, with semijoin(s) Q2: estimate its cost (#bytes
shipped)
Semijoins – e.g.: A1:
reduce SHIPMENT to SHIPMENT’ SHIPMENT’ -> S3 SUPPLIER -> S3 do join @ S3
Q2: cost?
Semijoins
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S3
(s1,s2,s5,s11)
4 bytes
4 bytes 4 bytes
4 bytes
Semijoins – e.g.: A2:
4*4 bytes - reduce SHIPMENT to SHIPMENT’ 3*8 bytes - SHIPMENT’ -> S3 4*8 bytes - SUPPLIER -> S3 0 bytes - do join @ S3
72 bytes TOTAL
Other plans?
P2: reduce SHIPMENT to SHIPMENT’ reduce SUPPLIER to SUPPLIER’ SHIPMENT’ -> S3 SUPPLIER’ -> S3
A brilliant idea: two-way semijoins
(not in book, not in final exam) reduce both relations with one more
exchange: [Kang, ’86] ship back the list of keys that didn’t
match CAN NOT LOSE! (why?) further improvement:
or the list of ones that matched – whatever is shorter!
Two-way Semijoins
s# ...
s1
s2
s5
s11
S1
SUPPLIER
s# p#
s1 p1
s2 p1
s3 p5
s2 p9
SHIPMENT
S3
(s1,s2,s5,s11)
(s5,s11)
S2
Overview Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc.
control)
Transactions – recovery Problem: e.g., a transaction moves
$100 from NY $50 to LA, $50 to Chicago 3 sub-transactions, on 3 systems how to guarantee atomicity (all-or-
none)? Observation: additional types of
failures (links, servers, delays, time-outs ....)
Transactions – recovery Problem: e.g., a transaction moves
$100 from NY -> $50 to LA, $50 to Chicago
Distributed recovery Step 2: execute a commit protocol, e.g., “2 phase commit” when a transaction T completes
execution (i.e., when all sites at which T has executed inform the transaction coordinator Ci that T has completed) Ci starts the 2PC protocol
->
Distributed recovery Many, many additional details
(what if the coordinator fails? what if a link fails? etc)
and many other solutions (e.g., 3-phase commit)
Overview Problem – motivation Design issues Query processing – semijoins transactions (recovery, conc.
control)
Distributed deadlocksLA NY
T1,la
T2,la
T1,ny
T2,ny
• cites need to exchange wait-for graphs
• clever algorithms, to reduce # messages