Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Distributed Database SystemsFall 2018
IntroductionSL01
I Syllabus and Course ProjectI Data Independence and Distributed ComputingI Definition of Distributed DatabasesI Promises of Distributed DatabasesI Technical Problems to be StudiedI Architectural Models
DDBS18, SL01 1/69 M. Böhlen
This Course/1
Syllabus1. Introduction2. Distributed Database Design3. Distributed Query Processing4. Distributed Query Optimization5. Distributed Transaction Management
DDBS18, SL01 2/69 M. Böhlen
This Course/2
I Location: BIN 2.A.01I Time: Monday 14:00 - 15:45I Course page http://www.ifi.uzh.ch/dbtgI I assume you have followed an introductory database course before
(relational model; algebra; SQL; query processing; etc).I The course is based on the textbook Principles of Distributed
Database Systems from Tamer Öszu and Patrick Valduriez,Springer, 3rd edition, 2010.
DDBS18, SL01 3/69 M. Böhlen
Two Generals’ Problem/1
I Task: Generals A and B must negotiate a time to attack.I Rules:
I A and B must attack at the same timeI A and B synchronize through messengersI Messengers can get lost
source: Zoltan Fern, Stanford University, cs347 notes
DDBS18, SL01 4/69 M. Böhlen
Two Generals’ Problem/2
I How many messages are needed, so that both know for sure when toattack?
source: Zoltan Fern, Stanford University, cs347 notes
DDBS18, SL01 5/69 M. Böhlen
Two Generals’ Problem/3
I We must relax rules to get a solution.I Probalistic approach with retransmissions.
source: Zoltan Fern, Stanford University, cs347 notes
I What is the probability that B attacks at time t?I B attacks at t if he receives message to do so no later than t.I Each message delivered with probability p.I One message per time unit.
DDBS18, SL01 6/69 M. Böhlen
Two Generals’ Problem/4
I a(1) = pI a(2) = p + (1 − p)pI a(3) = p + (1 − p)p + (1 − p)2pI a(4) = p + (1 − p)p + (1 − p)2p + (1 − p)3p
0
0.2
0.4
0.6
0.8
1
1 1.5 2 2.5 3 3.5 4 4.5 5
time
Probability of successful transmission
0.90.80.70.5
DDBS18, SL01 7/69 M. Böhlen
Data Processing Systems/1
Two seminal contributions:
I S. Ghemawat, H. Gobioff, S. Leung, The Google file system,SOSP 2003: A scalable, distributed, fault-tolerant file systemtailored for data-intensive applications, running on inexpensivecommodity hardware
I J. Dean, S. Ghemawat, MapReduce: Simplified Data Processingon Large Clusters, OSDI 2004: A simple programming model andan implementation for processing large data sets on computingclusters.
DDBS18, SL01 8/69 M. Böhlen
Data Processing Systems/2Google File System (GFS):I Master manages metadataI Files broken into chunks (typically 64MB)I Chunks are replicated across three machinery for fault-toleranceI Data transfer happens directly between clients and chunk servers.
DDBS18, SL01 9/69 M. Böhlen
Data Processing Systems/3MapReduce: hides details of distributed programmingI Computations are expressed with two functions: Map and Reduce.I The Map function produces a set of intermediate key/value pairs.I The MapReduce framework groups all key/value pairs with the same
key and passes them to the Reduce function.I The Reduce function receives an intermediate key with its set of
associated values and merges them.
source: Sherif Sakr, IfI summer school 2016
DDBS18, SL01 10/69 M. Böhlen
Data Processing Systems/4Hadoop (= HDFS + MapReduce + ...)I open-source SW stack for data-intensive distributed applicationsI processes very large amount of unstructured and complex data.I runs on a large number of machines that don’t share memory/disks.I clones the Google’s GFS and MapReduce framework.
source: Sherif Sakr, IfI summer school 2016
DDBS18, SL01 11/69 M. Böhlen
Data Processing Systems/5
source: Sherif Sakr, IfI summer school 2016
I Hive: SQL over Apache HadoopI Impala: open source, native analytic database for Apache HadoopI Tajo: distributed data warehouse system for Apache Hadoop.I Giraph: iterative graph processing system built for high scalabilityI Hama: analytics using the Bulk Synchronous Parallel (BSP) modelI Mahout: an environment for scalable machine learning applications
DDBS18, SL01 12/69 M. Böhlen
Distributed DBS versus MapReduce/1(MapReduce: A major step backwards [David DeWitt, 2008])
I MapReduce is a giant step backward in the programming paradigmfor large-scale data intensive applications becauseI Schemas are good.I Separation of the schema from the application is good.I High-level access languages are good.
I MapReduce is a sub-optimal implementation, in that it uses bruteforce instead of indexing.I MapReduce has no indexes and therefore has only brute force as a
processing option.I MapReduce systems deal poorly with skew.I Reducers that run multiple reduce tasks simultaneously induce many
disk seeks.
DDBS18, SL01 13/69 M. Böhlen
Distributed DBS versus MapReduce/2(MapReduce: A major step backwards [David DeWitt, 2008])
I MapReduce misses most of the features that are routinely includedin current DBMS.I Bulk loader to transform input data in files into a desired format and
load it into a DBMSI Indexing as noted aboveI Updates to change the data in the data baseI Transactions to support parallel update and recovery from failures
during updateI Integrity constraints to help keep garbage out of the data baseI Referential integrity to help keep garbage out of the data baseI Views so the schema can change without having to rewrite the
application program
DDBS18, SL01 14/69 M. Böhlen
Data Processing Systems/6
I Thus, MapReduce is not the ulimate cure to all your datamanagement problems.
I Todays data processing systems are domain-specific, optimized andvertically focused systems.
source: Sherif Sakr, IfI summer school 2016
DDBS18, SL01 15/69 M. Böhlen
Data Processing Systems/7
I Apache Spark is a fast, general engine for large scale dataprocessing on a computing cluster (new engine for Hadoop).
I Developed initially at UC Berkeley in Scala.I RDD (Resilient Distributed Dataset), an in-memory data
abstraction, is the fundamental unit of data in Spark.I Resilient: if data in memory is lost, it can be recreated from the
lineage (how it was computed).I Distributed: stored in memory across the cluster.I Dataset: data can come from a file or be created programmatically.I Spark programming consists of performing operations (e.g., Map,
Filter) on RDDs.
DDBS18, SL01 16/69 M. Böhlen
Data Processing Systems/8
I Apache Flink is a distributed in-memory data processing frameworkwhich represents an alternative to the MapReduce framework thatsupports both of batch and realtime processing.
I Flink has originated from the Stratosphere research project that wasstarted at the Technical University of Berlin.
I Instead of the map and reduce abstractions, Flink uses a directedgraph approach that leverages in-memory storage for improving theperformance of the runtime execution.
I True streaming capabilities: Execute everything as streamsI Native iterative executionI DataSet API, DataStream API, Table API
DDBS18, SL01 17/69 M. Böhlen
Data Processing Systems/9
I Pregel from Google is the first BSP (Bulk Synchronous Parallel)based implementation for graph processing
I Communication through message passing (usually sent alongoutgoing edges from each vertex) + Shared-Nothing graphprogramming.
I Vertex-centric computation, each vertex:I Receives messages sent in the previous superstepI Executes the same user-defined functionI Modifies its valueI If active, sends messages to other vertices (next superstep)I Votes to halt if it has no further work to do
DDBS18, SL01 18/69 M. Böhlen
Data Processing Systems/10
source: Sherif Sakr, IfI summer school 2016
DDBS18, SL01 19/69 M. Böhlen
Database Processing Systems/11
I Column Stores: MonetDB, Greenplum, C-Store (Vertica)I Non-relational Databases: BigTable, Cassandra, HBase, SpannerI Document/XML Stores: MongoDB, CouchDB, BaseX,I Key-value Stores: Berkeley DB, NoSQL, Tokyo CabinetI Graph Databases: Neo4J, InfiniteGraph, Virtuoso,I Object Databases: GemStone, db4o, ZODB,I Cluster Computing: SparkI Graph Processing: Pregel, Giraph, GraphLab, GraphXI Event Processing: Storm, Flink, S4
DDBS18, SL01 20/69 M. Böhlen
Database Courses
I Datenbankpraktikum,A. Geppert, Tuesday 16.15-18.00, FS
I Distributed Database Systems (DDBS),M. Böhlen, Monday 14.00-15.45, HS even years
I XML and Databases (TSDM),C. Türker, Thursday 8.00-9.45, FS
I Data Warehousing (DW),A. Geppert, Tuesday 16.15-18.00, FS even years
I Temporal and Spatial Data Management (TSDM),M. Böhlen, Monday 14.00-15.45, HS odd years
DDBS18, SL01 21/69 M. Böhlen
Course Project/1
I hear and forgetI learn and rememberI do and understand
I Goal: in-depth understanding of a selected part of the courseI Task: Implement a solution from a research paperI Outcome: Running implementation and 5-10 page report
DDBS18, SL01 22/69 M. Böhlen
Course Project/2
Choose one of the following papers:I O. Polychroniou, R. Sen, and K. A. Ross, Track Join: Distributed
Joins with Minimal Network Traffic, ACM SIGMOD InternationalConference on Management of Data, pp. 1483-1494, 2014.
I C. Mohan, B. Lindsay, and R. Obermarck, TransactionManagement in the R* Distributed Database ManagementSystem, ACM Transactions on Database Systems, pages 378-396,1986.
I L. Michael, W. Nejdl, O. Papapetrou, W. Siberski, Improvingdistributed join efficiency with extended bloom filteroperations, 21st International Conference on Advanced InformationNetworking and Applications (AINA 2007), pp. 187-194, 2007.
DDBS18, SL01 23/69 M. Böhlen
Course Project/3
I Course project shall be solved in group of 2 people.I 4th week, October 8:
I Presentation of problem in terms of a numeric exampleI Plan for implementation (components, SW, functionality)I Plan for evaluation
I 9th week, November 12:I Presentation of solution (problem definition and solution)I Strong and weak points of implementationI Demonstration of implementation
I 14th week, December 17:I hand in of project report and implementation (source code, data)I send pdf by email to Kevin Wellenzohn ([email protected])
DDBS18, SL01 24/69 M. Böhlen
Schedule
17.09 Introduction to Distributed Database Systems24.09 Distributed Database Design01.10 Distributed Database Design08.10 Course project: plan for implementation and example15.10 Distributed Database Design22.10 Semantic Data Control29.10 Query Processing05.11 Distributed Query Optimization12.11 Course project: demonstration of implementation19.11 Distributed Query Optimization26.11 Distributed Transactions and Concurrency03.12 Distributed Transactions and Concurrency10.12 Distributed Transactions and Concurrency17.12 Conclusion
DDBS18, SL01 25/69 M. Böhlen
Logistics
I There is an oral exam at the end of the course.I The exam takes place January 7, 2019 starting at 14:00.I The oral exam lasts half an hour in total.I There will be questions related to your project (implementation and
understanding).I There will be a question about a selected topic (drawn randomly)
from the course.I It is important that you illustrate the answer to your question
through an appropriately chosen example.I Grade: 50% project, 50% course
DDBS18, SL01 26/69 M. Böhlen
Data Independence/1
I In the old days, programs stored data in regular filesI Each program has to maintain its own data
I huge overheadI error-prone
DDBS18, SL01 27/69 M. Böhlen
Data Independence/2
I The development of DBMSs helped to fully achieve dataindependence.
I Provide centralized and controlled data maintenance and access.I Application is immune to physical and logical file organization.
DDBS18, SL01 28/69 M. Böhlen
Data Independence/3
I Distributed database system is the union of what appear to be twodiametrically opposed approaches to data processing: databasesystems and computer networkI Computer networks promote a mode of work that goes against
centralizationI Key issues to understand this combination
I The most important objective of DB technology is integration notcentralization
I Integration is possible without centralization, i.e., integration ofdatabases and networking does not mean centralization
I Goal of distributed database systems: achieve data integration anddata distribution transparency
DDBS18, SL01 29/69 M. Böhlen
Distributed Computing
I A distributed computing system is a collection of autonomousprocessing elements that are interconnected by a computer network.
I The elements cooperate in order to perform the assigned task.I The term “distributed” is used very broadly. The exact meaning of
the word depends on the context. What can be distributed?I Processing logicI FunctionsI DataI Control
I Classification of distributed systems with respect to various criteriaI Degree of coupling, i.e., how closely the processing elements are
connected; e.g., measured as ratio of amount of data exchanged toamount of local processing; weak coupling, strong coupling
I Interconnection structure; e.g., point-to-point connection betweenprocessing elements, common interconnection channel
I Synchronization; synchronous, asynchronous
DDBS18, SL01 30/69 M. Böhlen
Definition of DDB and DDBMS/1
I A distributed database (DDB) is a collection of multiple, logicallyinterrelated databases distributed over a computer network
I A distributed database management system (DDBMS) is thesoftware that manages the DDB and provides an access mechanismthat makes this distribution transparent to the users
I DDBS (distributed database system) = DDB + DDBMSI Assumptions
I Data stored at a number of sites each site logically consists of asingle processor
I Processors at different sites are interconnected by a computer network(we do not consider multiprocessors in DDBS, cf. parallel systems)
I DDB is a database, not a collection of files. Data is logically related;structured into multiple files; accessed through common interface.
I DDBMS is a collection of DBMSs.
DDBS18, SL01 31/69 M. Böhlen
Definition of DDB and DDBMS/2
DDBS18, SL01 32/69 M. Böhlen
Definition of DDB and DDBMS/3I Example: Database consists of 3 relations employees,
projects, and assignment which are partitioned and stored atdifferent sites (fragmentation).
I We study the problems with queries, transactions, concurrency, andreliability.
DDBS18, SL01 33/69 M. Böhlen
What is not a DDBS?
I The following systems are different from (though related to)distributed DB systems
Shared Memory Shared Disk
Shared Nothing Central Databases
DDBS18, SL01 34/69 M. Böhlen
Promises of DDBSs
Distributed Database Systems provide the following advantages:I Transparency of distributed and replicated dataI Higher reliabilityI Improved performanceI Easier system expansion
DDBS18, SL01 35/69 M. Böhlen
Promises of DDBSs – Transparency/1 01.5
TransparencyI Refers to the separation of the higher-level semantics of the system
from the lower-level implementation issuesI A transparent system “hides” implementation details from users.I A fully transparent DBMS provides high-level support for the
development of complex applications.
(a) User wants to see one databa-se
(b) Programmer sees many databases
DDBS18, SL01 36/69 M. Böhlen
Promises of DDBSs – Transparency/2
I The goal of transparency is to provide a system where users do nothave to worry about representation, fragmentation, location, andreplication of data.
I Various forms of transparency can be distinguished for DDBSs:I Network transparency (also called distribution transparency)
I Location transparencyI Naming transparency
I Replication transparencyI Fragmentation transparency
I Horizontal fragmentationI Vertical fragmentation
DDBS18, SL01 37/69 M. Böhlen
Promises of DDBSs – Transparency/3 01.1
I Network (or distribution) transparency allows a user to perceivea DDBS as a single, logical entity
I The user is protected from the operational details of the network (oreven does not know about the existence of the network)
I Two types of network transparency can be identified: locationtransparency and naming transparency.
I The user does not need to know the location of data items and acommand used to perform a task is independent from the location ofthe data and the site the task is performed (location transparency)
I A unique name is provided for each object in the database (namingtransparency)I In absence of this, users are required to embed the location name as
part of an identifier
DDBS18, SL01 38/69 M. Böhlen
Promises of DDBSs – Transparency/4 01.2
Common (suboptimal) ways to get naming transparency at theapplication level:I Solution 1: Create a central name server; however, this results in
I loss of some local autonomyI central site may become a bottleneckI low availability (if the central site fails remaining sites cannot create
new objects)
I Solution 2: Prefix object with identifier of site that created itI e.g., branch created at site S1 might be named S1.BRANCHI Also need to identify each fragment and its copiesI e.g., copy 2 of fragment 3 of Branch created at site S1 might be
referred to as S1.BRANCH.F3.C2I Difficult for userI Relocation of objects becomes difficult
DDBS18, SL01 39/69 M. Böhlen
Promises of DDBSs – Transparency/5
I Replication transparency ensures that the user is not involved inthe managment of copies of some data
I The user should even not be aware about the existence of replicas,rather should work as if there exists a single copy of the data
I Replication of data is needed for various reasonsI e.g., increased efficiency for read-only data access
DDBS18, SL01 40/69 M. Böhlen
Promises of DDBSs – Transparency/6 01.4
I Fragmentation transparency ensures that the user is not aware ofand is not involved in the fragmentation of the data
I The user is not involved in finding query processing strategies overfragments or formulating queries over fragmentsI The evaluation of a query that is specified over an entire relation but
now has to be performed on top of the fragments requires anappropriate query evaluation strategy
I Fragmentation is commonly done for reasons of performance,availability, and reliability
I Two fragmentation alternativesI Horizontal fragmentation: divide a relation into a subsets of tuplesI Vertical fragmentation: divide a relation by columns
DDBS18, SL01 41/69 M. Böhlen
Promises of DDBSs – Reliability
Higher reliabilityI Replication of components and data should make DDBMS more
reliable.I No single points of failureI e.g., a broken communication link or processing element does not
bring down the entire systemI Distributed transaction processing guarantees the consistency of the
database and concurrency
DDBS18, SL01 42/69 M. Böhlen
Promises of DDBSs – Performance
Improved performanceI Proximity of data to its points of use
I Reduces remote access delaysI Requires some support for fragmentation and replication
I Parallelism in executionI Inter-query parallelismI Intra-query parallelism
I Update and read-only queries influence the design of DDBSssubstantiallyI If mostly read-only access is required, as much as possible of the data
should be replicatedI Writing becomes more complicated with replicated data
DDBS18, SL01 43/69 M. Böhlen
Promises of DDBSs – Extendability
Easier system expansionI Issue is database scalingI Emergence of microprocessor and workstation technologies
I Network of workstations much cheaper than a single mainframecomputer.
I Automatic data sharing has become cheap (coordination involvinghumans is expensive).
I Increasing database size (a central system might not even befeasible).
DDBS18, SL01 44/69 M. Böhlen
Technical Problems to be Studied/1
I Distributed database designI How to fragment the data?I Partitioned data vs. replicated data?
I Distributed query processingI Design algorithms that analyze queries and convert them into a series
of data manipulation operationsI Distribution of data, communication costs, etc. has to be consideredI Find optimal query plans
DDBS18, SL01 45/69 M. Böhlen
Technical Problems to be Studied/2 01.3
I Distributed concurrency controlI Synchronization of concurrent accesses such that the integrity of the
DB is maintainedI Integrity of multiple copies of (parts of) the DB have to be
considered (mutual consistency)
I ReliabilityI How to make the system resilient to failuresI Atomicity and durability
I Many techniques and solutions for DDBMSs build on standardDBMS solutions and extend these to distributed settings.
DDBS18, SL01 46/69 M. Böhlen
Architecture
I Architecture: The architecture of a system defines its structure:I the components of the system are identified;I the function of each component is specified;I the interrelationships and interactions among the components are
defined.I Applies both for computer systems as well as for software systems,
e.g,I division into modules, description of modules, etc.I architecture of a computer
I There is a close relationship between the architecture of a system,standardisation efforts, and a reference model.
DDBS18, SL01 47/69 M. Böhlen
ANSI/SPARC Architecture of DBMSI ANSI/SPARC architecture is based on dataI 3 views of data: external view, conceptual view, internal viewI Defines a total of 43 interfaces
DDBS18, SL01 48/69 M. Böhlen
Example/1I Conceptual schema: Provides enterprise view of entire database
RELATION EMP [KEY = {ENO}ATTRIBUTES = {
ENO : CHARACTER(9)ENAME: CHARACTER(15)TITLE: CHARACTER(10)
}]
RELATION PAY [KEY = {TITLE}ATTRIBUTES = {
TITLE: CHARACTER(10)SAL : NUMERIC(6)
}]
RELATION PROJ [KEY = {PNO}ATTRIBUTES = {
PNO : CHARACTER(7)PNAME : CHARACTER(20)BUDGET: NUMERIC(7)LOC : CHARACTER(15)
}]
RELATION ASG [KEY = {ENO,PNO}ATTRIBUTES = {
ENO : CHARACTER(9)PNO : CHARACTER(7)RESP: CHARACTER(10)DUR : NUMERIC(3)
}]
DDBS18, SL01 49/69 M. Böhlen
Example/2
I Internal schema: Describes the storage details of the relations.I Relation EMP is stored on an indexed fileI Index is defined on the key attribute ENO and is called EMINXI A HEADER field is used that might contain flags (delete, update,
etc.)
INTERNAL_REL EMPL [INDEX ON E# CALL EMINXFIELD = {HEADER: BYTE(1)E# : BYTE(9)ENAME : BYTE(15)TIT : BYTE(10)
}]
Conceptual schema:RELATION EMP [KEY = {ENO}ATTRIBUTES = {ENO : CHARACTER(9)ENAME: CHARACTER(15)TITLE: CHARACTER(10)
}]
DDBS18, SL01 50/69 M. Böhlen
Example/3
I External view: Specifies the view of different users/applications
I Application 1: Calculates the payroll payments for engineers
CREATE VIEW PAYROLL (ENO, ENAME, SAL) ASSELECT EMP.ENO,EMP.ENAME,PAY.SALFROM EMP, PAYWHERE EMP.TITLE = PAY.TITLE
I Application 2: Produces a report on the budget of each project
CREATE VIEW BUDGET(PNAME, BUD) ASSELECT PNAME, BUDGETFROM PROJ
DDBS18, SL01 51/69 M. Böhlen
Architectural Models for DDBSs/1I Architectural Models for DDBSs (or more generally for multiple
DBMSs) can be classified along three dimensions:I AutonomyI DistributionI Heterogeneity
DDBS18, SL01 52/69 M. Böhlen
Architectural Models for DDBSs/2
I Autonomy: Refers to the distribution of control (not of data) andindicates the degree to which individual DBMSs can operateindependently.I Tight integration: a single-image of the entire database is available to
any user who wants to share the information (which may reside inmultiple DBs); realized such that one data manager is in control ofthe processing of each user request.
I Semiautonomous systems: individual DBMSs can operateindependently, but have decided to participate in a federation tomake some of their local data sharable.
I Total isolation: the individual systems are stand-alone DBMSs, whichknow neither of the existence of other DBMSs nor how to comunicatewith them; there is no global control.
DDBS18, SL01 53/69 M. Böhlen
Architectural Models for DDBSs/3
I Autonomy has different dimensionsI Design autonomy: each individual DBMS is free to use the data
models and transaction management techniques that it prefers.I Communication autonomy: each individual DBMS is free to decide
what information to provide to the other DBMSsI Execution autonomy: each individual DBMS can execture the
transactions that are submitted to it in any way that it wants to.
DDBS18, SL01 54/69 M. Böhlen
Architectural Models for DDBSs/4
I Distribution: Refers to the physical distribution of data overmultiple sites.I No distribution: No distribution of data at allI Client/Server distribution:
I Data are concentrated on the server, while clients provide applicationenvironment/user interface
I First attempt to distributionI Peer-to-peer distribution (also called full distribution):
I No distinction between client and server machineI Each machine has full DBMS functionality
DDBS18, SL01 55/69 M. Böhlen
Architectural Models for DDBSs/5 01.6
I Heterogeneity: Refers to heterogeneity of the components atvarious levelsI hardwareI communicationsI operating systemI DB components (e.g., data model, query language, transaction
management algorithms)
DDBS18, SL01 56/69 M. Böhlen
Client-Server Architecture/1
I General idea is to divide the functionality of the database systeminto two classes:I server functions: mainly data management, including query
processing, optimization, transaction management, etc.I client functions: might also include some data management functions
(consistency checking, transaction management, etc.) not just userinterface
I Client and server refer to actual machines (i.e., different machineswhere the SW runs).
I Leads to a more efficient division of the work.I Yields a two-tier architecture (a further distribution of the
functionality leads to n-tier architectures with application servers).I Different types of client/server architecture
I Multiple client/single serverI Multiple client/multiple server
DDBS18, SL01 57/69 M. Böhlen
Client-Server Architecture/2
I One server, many clientsI Similar to centralized DBMSsI differences are execution of
transactions and cache management.
DDBS18, SL01 58/69 M. Böhlen
Client-Server Architecture/3I Database server, application serversI General application server; heavier clients
DDBS18, SL01 59/69 M. Böhlen
Client-Server Architecture/401.7
I Many database and application serversI Specialized application servers; lighter clients
DDBS18, SL01 60/69 M. Böhlen
Peer-to-Peer Architecture/1I Data organizational view (architecture with equal peers).I Provides transparency since it is an extension of the ANSI/SPARC
architecture.I Local internal schema (LIS)
I Describes the local physical data organization (which often is differenton each machine)
I Local conceptual schema (LCS)I Describes logical data organiza-
tion at each siteI Required since the data are
fragmented and replicatedI Global conceptual schema (GCS)
I Describes the global logical viewof the data
I Union of the LCSsI External schema (ES)
I Describes the user/application view of the dataDDBS18, SL01 61/69 M. Böhlen
Peer-to-Peer Architecture/2
I Detailed components of adistributed peer-to-peerdatabase systems.
I User and data processor arepresent on each machine (this isorganizational separation; notfunctional separation as forclient/server)
I User processor handlesinteractions withusers/applications.
I Data processor handles storageand data processing.
DDBS18, SL01 62/69 M. Böhlen
Multi-DBMS Architecture/1
I Individual DBMSs are fully autonomous and have no concept ofcooperation.
I “Data integration system” is an often used term for such systems.I Fundamental difference to peer-to-peer DBMS is in the definition of
the global conceptual schema (GCS)I In a MDBMS the GCS represents only the collection of some of the
local databases that each local DBMS want to share.I The GCS is a (proper) subset of the union of the LCSs (no complete
view exists)I This leads to the question, whether the GCS should even exist in a
MDBMS.I Two different architecutre models:
I Models with a GCSI Models without GCS
DDBS18, SL01 63/69 M. Böhlen
Multi-DBMS Architecture/2I Model with a GCS
I GCS is the union of parts of the LCSsI Local DBMS define their own views on the local DB
DDBS18, SL01 64/69 M. Böhlen
Multi-DBMS Architecture/3I Model without a GCS
I The local DBMSs present to the multi-database layer the part of theirlocal DB they are willing to share.
I External views are defined on top of LCSs
DDBS18, SL01 65/69 M. Böhlen
Multi-DBMS Architecture/4I Full-fledged DBMSs, each managing a different database.I SW layer on top that allows to access various databases.I For individual DBMSs the SW layer is simply yet another application.I Implemented through mediator/wrapper technology.
DDBS18, SL01 66/69 M. Böhlen
Conclusion/1
I A distributed database (DDB) is a collection of logically interrelateddatabases distributed over a computer network
I Data is stored at a number of sites, the sites are connected by anetwork.
I DDBS supports the relational model. DDBS is not a remote filesystem.
I Transparent system ‘hides’ implementation details from usersI Distribution/network transparencyI Replication transparencyI Fragmentation transparency
I Higher reliability; improved performance; easier system expansionI Programming a distributed database involves:
I Distributed database designI Distributed query processingI Distributed concurrency control
DDBS18, SL01 67/69 M. Böhlen
Conclusion/2
I Architecture defines the structure of the system. There are differentways to define the architecture: e.g., based on components or data
I DDBS might be based on identical components (homogeneoussystems) or different components (heterogeneous systems)
I ANSI/SPARC architecture defines external, conceptual, and internalschemas
I There are three orthogonal implementation dimensions for DDBS:level of distribution, autonomity, and heterogeinity
I Different architectures are discussed:I Client-Server SystemsI Peer-to-Peer SystemsI Multi-DBMS
DDBS18, SL01 68/69 M. Böhlen
Course project, October 10
I October 10, 14:00 - end:I no lectureI group project presentations of about 20 minutes in my office (BIN
2.E.13)I Structure of presentations:
I Presentation of the problem with a numeric exampleI Plan for implementation (components, HW, SW, functionality)I (plan for evaluation)
I October 3: Send email to [email protected] with the followinginformation: group members, emails, possible constraints for theOctober 10 meeting
I A plan with the exact slots for the presentations will be distributed(web page).
DDBS18, SL01 69/69 M. Böhlen