Upload
nas-tra
View
179
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A presentation showing some NoSQL databases and Apache Cassandra in detail
Citation preview
DatabasesEduard Tudenhöfner
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
Why NoSQL?
● original intention: modern web-scale DBs○ amount of data drastically increased○ data in the web is less structured
● higher requirements regarding performance
● some problems are easier to solve without the relational approach
● scaling out & running on commodity HW is much cheaper than scaling up
Typical Characteristics
● non-relational
● horizontally scalable
● flexible schema
● easy replication support
● simple API
● eventually consistent -> BASE principle
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
Classification
source: http://blog.octo.com/wp-content/uploads/2012/07/QuadrantNoSQL.png
Classification
source: http://www.sics.se/~amir/files/download/dic/NoSQL%20Databases.pdf
Key/Value Stores
● data model: collection of key/value pairs
● keys and values can be complex compounds
● based on Amazon’s Dynamo Paper
● designed to handle massive load
Key/Value Stores
● no complex query filters
● all joins must be in the code
● easy to distribute across cluster
● very predictable performance -> O(1)
Wide Column Stores
● Tables are similar to RDBMS, but semi-structured
● based on Google’s BigTable
● Rows can have arbitrary columns
Wide Column Stores -> BigTable
● <RowKey, ColumnKey, Timestamp> triple as key for lookups, inserts, deletes● ColumnKey uses syntax family:qualifier● arbitrary columns on a row-by-row basis● does not support a relational model
○ no table-wide integrity constraints○ no multi-row transactions
source: http://research.google.com/archive/bigtable.html
Document Stores
● inspired by Lotus Notes
● central concept of a Document
● Documents encapsulate/encode data in some format/encoding
● Encodings:○ XML, YAML, JSON, BSON, PDF
Document Stores
source: http://www.mongodb.org/
Document Stores
source: http://www.mongodb.org/
Graph Databases
● based on Graph Theory -> G = (V, E)
● designed for data that is well represented in a graph○ social networks, public transport links, network topologies, road maps
● nodes, edges, properties are used to represent and store data
● graph relationships are queryable
Graph Databases
source: http://www.neo4j.org/
Graph Databases
source: http://en.wikipedia.org/wiki/Graph_database
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
CAP Theorem source: http://blog.nahurst.com/visual-guide-to-nosql-systems
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
ACID
● Atomicity○ all-or-nothing approach
● Consistency○ DB will be in a consistent state before & after a transaction
● Isolation○ transaction will behave as if it’s the only operation being performed upon the
DB● Durability
○ once a transaction is committed, it is durably preserved
● CA-Systems are ACID-Systems
BASE
● an application that works basically all the time, does not have to be consistent all the time, but will be in some known state eventually
● Basically Available○ achieved by using a highly distributed approach
● Soft State○ state of the system is always “soft” due to eventual consistency
● Eventual Consistency (in German: schlussendliche Konsistenz)○ at some point in the future, the data will be consistent○ no guarantees are made about when this will occur
BASE vs ACID
source: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
Cassandra
● initially created by Facebook for Inbox Search
● distributed, horizontally scalable database
● high availability
● very flexible data model○ data might be structured, semi-structured, unstructured
● commercial support through DataStax
Cassandra - Design
● all nodes are equally important
● no Single-Point-of-Failure
● no central controller
● no master/slave relationships
● every node knows how to route requests and where the data lives
source: http://cassandra.apache.org/
Scales Linearly
source: http://www.datastax.com
Uses Consistent Hashing
Murmur3Partitioner generates hash
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeHashing_c.html
Uses Consistent Hashing
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeHashing_c.html
Writes are very fast
● All writes are sequential● no reading & seeking before a
write● Each of the N node will perform
the following upon receiving the RowMutation message:○ Append write to the commit log○ Update in-memory Memtable data
structure○ Write is done!
● If Memtable gets full, it’s flushed to disk (SSTable)
source: http://www.roman10.net/how-apache-cassandra-write-works/
Write Requests
● Client requests can go to any node in the cluster because all nodes are peers
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
write consistency level is configurable
Write Requests
● Cassandra chooses one Coordinator per remote data center to handle requests to replicas
● coordinator only needs to forward WR to one node in each remote data center
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsWrite.html
Read Requests
● Two different types of Read Requests○ direct read request (RR)○ background read repair request (RRR)
● number of replicas contacted by a RR is determined by Consistency Level
● RRR are sent to any additional nodes that did not get a direct RR
● RRR ensure consistency
Read Requests
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html
Read Requests
source: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html
2 of the 3 replicas for the given row must respond to fulfill the read request
Read Requestssource: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureClientRequestsRead_c.html
CQL
● very similar to SQL● does not support JOINS / Subqueries● no referential integrity● no cascading operations
We denormalize the data because joins are not performant in a distributed system
CQL
CQL
no index, no service :)
CQL - Collections
● CQL introduced collections to columns○ list○ map○ set
● Add new collections to the previous example
CQL - Collections
Cassandra vs MySQL (50GB)
● MySQL○ writes avg: ~300ms○ reads avg: ~350ms
● Cassandra○ writes avg: ~0.12ms○ reads avg: ~15ms
source: http://www.odbms.org/wp-content/uploads/2013/11/cassandra.pdf
Overview
● Why NoSQL?● Classification● CAP Theorem● BASE vs ACID● Cassandra in Action● Summary
Summary
● elastic scaling (scaling out instead of up)● huge amounts of data can be handled while maintaining high
throughput rates● require less DBA’s and management resources
○ automatic repairs/data distribution○ simpler data models
● better economics○ cost per GB is much lower than for RDBMS due to clusters of
commodity HW○ we handle more data with less money
● flexible data models○ very relaxed or even non-existent data model restrictions○ changes to data model are much cheaper
Summary
● might not be mature enough for enterprises● compatibility issues regarding standards
○ each DB has its own API○ not easy to switch to another NoSQL DB
● search support is not the same as in RDBMS● easier to find experienced RDBMS experts than NoSQL experts
Which DB for which purpose?
● NoSQL is an alternative○ addresses certain limitations of the relational DB world
● depends on characteristics of data○ if data is well structured -> relational DB might be better○ if data is very complex -> might be difficult to map it to the
relational model● depends on volatility of the data model
○ what if schema changes daily?● relational DBs still have their pluses
○ relational model / transactions / query language○ should be used when multi-row transactions and strict consistency is
required
Thank you! - Questions?