View
224
Download
1
Category
Tags:
Preview:
Citation preview
CS346: Advanced DatabasesGraham Cormode G.Cormode@warwick.ac.uk
Distributed Databases, BASE, CAP
& NoSQL
Outline
Chapter: “Distributed Databases” in Elmasri and Navathe
¨ What are distributed databases?¨ Architectural choices¨ ACID vs BASE¨ Consistency, Availability, Partition tolerance: CAP¨ NoSQL systems
Why?¨ As data gets larger, must move to distributed data management¨ Tech companies (Google, Facebook etc.) rely on distributed data
CS346 Advanced Databases2
Distributed Databases
¨ When data gets large and processing is slow, use distribution– A distributed database (DDB) managed by a distributed DBMS– Goal: split the processing into smaller pieces and spread them
¨ DDB technology combines databases with OS/Networks– Manage concurrent access to replicated data
¨ DDB is quite different to e.g. the world-wide web– Similarities: many machines, distributed around the world– Different: each website is (mostly) independent of others
Facebook and YouTube are managed independently– However: many large websites use DDB technology
Facebook can be seen as a massive distributed database
CS346 Advanced Databases3
Distributed Databases: Pros and Cons¨ DDB can be (in principle) more available
– If one machine fails, others can take over¨ DDB can (in principle) be faster
– Parallelize computation, combine results¨ DDB is (in principle) easier to expand
– Just add more machines/storage¨ “In principle” isn’t always the case
– DDB is more complicated to manage– Performance/availability may worsen in unpredictable ways
CS346 Advanced Databases4
Additional functionality of DDB
The DDB has additional or expanded roles to perform:¨ Keeping track of data distribution: where’s my data?¨ Distributed query processing: break up a query into pieces¨ Distributed transaction management: data items are distributed¨ Replicated data management: keep distribute copies of the data¨ Distributed database recovery: manage machine failures¨ Security: manage security of distributed data¨ Distributed catalog management: keep the metadataSaw some of these issues in Hadoop/MapReduce
CS346 Advanced Databases5
Distributed Architectures
¨ Many possible levels of sharing:– Shared memory: multiple processors (cores) share disk, memory– Shared disk: multiple cores share disk, but have separate memory– Shared nothing: no common storage, communicate over network
¨ ‘Shared nothing’ is the model for large distributed systems– Hadoop follows a shared nothing architecture
¨ Shared nothing pros and cons:– Can be slower: network is slower than local disk (is it? fibre is fast)– Easy to expand: add more machines to the network– Allows fragmentation (sharding): breaking the database into pieces
CS346 Advanced Databases6
Fragmentation and Replication
¨ How to split the data up among sites?– Horizontal fragmentation: subset of tuples on each machine
E.g. break up the EMPLOYEE relation by Dno– Vertical fragmentation: different columns on each machine
Name, Bdate, Address on one, Ssn, Salary, Dno on another– Mixed: break up by both horizontal and vertical
¨ How to replicate data around the system? – No replication: a unique copy exists– Fully replicated: data is copied everywhere– Partial replication: in between these two extremes
E.g. HDFS, default number of replicas is 3
CS346 Advanced Databases7
ACID vs BASE systems
¨ Recall the ACID properties of transactions– Atomicity, Consistency, Isolation, Durability
¨ Not every system requires this level of guarantee– Can trade-off guarantees for perfomance
¨ “BASE”: Basically Available, Soft-State, Eventually Consistent (coined by Eric Brewer, founder of Inktomi, 2000)– A weaker set of requirements – Drop consistency and isolation to improve availability, performance– Suits distributed settings without much competition for resources
¨ ACID vs BASE is a spectrum of possible design points– “Real internet systems are a mixture of ACID and BASE subsystems”
CS346 Advanced Databases8
CAP concepts
¨ Consistency: all processes/transactions see the same data– Equivalent to having a single, up to date copy of the data– Not easy to provide, hence much effort on concurrency
¨ Availability: is the system up and responsive to requests?– All processes can find some version of the data they need– Formally: does every request receive a response (allowing fails)
¨ Partition-tolerance: what happens when the network breaks?– Network partition: something breaks and the network divides
E.g. a router fails/crashes: messages can’t traverse the router– Does the system still operate even if messages are lost?
CS346 Advanced Databases9
Points of Comparison
¨ Consistency: strong (ACID) or weak consistency (BASE)?– Weak: processes can see operations in different orders– Weak: synchronization points bring processes into agreement
¨ Eventual consistency: system eventually reaches a consistent state– If no updates are made to an item, then reads will give same value
¨ Compared to ACID, the BASE approach is: – More focused on availability of resources– Tolerates approximate answers rather than exact– More aggressive (optimistic concurrency control)– Aims to be simpler, faster– Provides ‘best effort’ rather than guarantees
CS346 Advanced Databases10
The CAP Conjecture / Theorem
¨ Brewer made a famous “CAP conjecture” in 2000– Consistency, Availability, Partition Tolerance: pick any two– I.e. it is impractical to build a distributed system with all three
¨ Lynch and Gilbert “proved” a CAP theorem in 2002– For a specific set of distributed scenarios
¨ An example of a ‘pick two’ (from three) choice– For university: Good grades, enough sleep or a social life– For products: fast, good or cheap
CS346 Advanced Databases11
Consequences of CAP Theorem
Obtain different results from different choices: ¨ Forfeit partition tolerance (obtain consistency and availability)
– E.g. traditional centralized DBMS¨ Forfeit availability (obtain partition tolerance and consistency)
– E.g. distributed databases, protocols based on majority agreement¨ Forfeit consistency (obtain partition tolerance and availability)
– E.g. Emerging NoSQL systems¨ These concepts cut across many aspects of computer science:
– The OS and network provide availability, but no consistency– Databases are better at consistency than availability– Distributed databases want both
CS346 Advanced Databases12
NoSQL systems
¨ NoSQL systems drop support for the full relational model– Do not provide same level of reliability/availability– Do not necessarily support rich languages like SQL– Aim to have simpler design, better scaling via distribution
¨ Often support analysis via query language or MapReduce on top– Systems primarily support data storage and retrieval
CS910 Foundations of Data Analytics13
Types of NoSQL systems
¨ Key-value store: stores and retrieves data in the form (key, value)– E.g. store demographic data (values) for each user (by key)– Data is distributed, and replicated for resilience, e.g. Memcached
¨ Column store: stores data organized by column (instead of row)– Allows faster access to particular entries when data is sparse– Implemented in Hbase (database component of Hadoop system)
¨ Document store: to store and retrieve document data– E.g. to store information for very large websites (Amazon, eBay)– Each “document” can be an arbitrary collection of information– Examples include MongoDB and Apache Cassandra
CS910 Foundations of Data Analytics14
NoSQL systems: pros and cons
¨ NoSQL systems are highly popular at the moment– Scale to truly massive amounts of data– Allow analytics on top via MapReduce/Hadoop– Can be very fast to retrieve data
¨ But they also have limitations– Systems still under development, hard to make use of– Some quite primitive: just provide data storage/retrieval– Currently have to write and debug code to implement applications– Can be overkill when your data is not massive
CS910 Foundations of Data Analytics15
Summary
CS346 Advanced Databases16
¨ Motivations for Distributed Databases¨ Architectural choices for distributed databases
¨ What is shared? How much replication? ¨ ACID/BASE (Basically Available, Soft-State, Eventually Consistent)¨ Consistency, Availability, Partition tolerance: CAP
¨ Pick any two¨ NoSQL systems: key-value, column, document store
Recommended reading: Brewer’s PODC’00 KeynoteChapter: “Distributed Databases” in Elmasri and Navathe
Recommended