49
Database scalability Jonathan Ellis

What Every Developer Should Know About Database Scalability

  • View
    36.307

  • Download
    0

Embed Size (px)

DESCRIPTION

Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955

Citation preview

Page 1: What Every Developer Should Know About Database Scalability

Database scalability

Jonathan Ellis

Page 2: What Every Developer Should Know About Database Scalability

Classic RDBMS persistence

Data

Index

Page 3: What Every Developer Should Know About Database Scalability

Disk is the new tape*

• ~8ms to seek– ~4ms on expensive 15k rpm disks

Page 4: What Every Developer Should Know About Database Scalability
Page 5: What Every Developer Should Know About Database Scalability

What scaling means

money

perf

orm

ance

Page 6: What Every Developer Should Know About Database Scalability

Performance

• Latency

• Throughput

Page 7: What Every Developer Should Know About Database Scalability

Two kinds of operations

• Reads

• Writes

Page 8: What Every Developer Should Know About Database Scalability

Caching

• Memcached

• Ehcache

• etc

DB

cache

Page 9: What Every Developer Should Know About Database Scalability

Cache invalidation

• Implicit

• Explicit

Page 10: What Every Developer Should Know About Database Scalability

Cache set invalidation

get_cached_cart(cart=13, offset=10, limit=10)

get('cart:13:10:10')

?

Page 11: What Every Developer Should Know About Database Scalability

Set invalidation 2

prefix = get('cart_prefix:13')

get(prefix + ':10:10')

del('cart_prefix:13')

http://www.aminus.org/blogs/index.php/2007/12/30/memcached_set_invalidation?blog=2

Page 12: What Every Developer Should Know About Database Scalability

Replication

Page 13: What Every Developer Should Know About Database Scalability

Types of replication

• Master → slave– Master → slave → other slaves

• Master ↔ master– multi-master

Page 14: What Every Developer Should Know About Database Scalability

Types of replication 2

• Synchronous

• Asynchronous

Page 15: What Every Developer Should Know About Database Scalability

Synchronous

• Synchronous = slow(er)

• Complexity (e.g. 2pc)

• PGCluster

• Oracle

Page 16: What Every Developer Should Know About Database Scalability

Asynchronous master/slave

• Easiest

• Failover

• MySQL replication

• Slony, Londiste, WAL shipping

• Tungsten

Page 17: What Every Developer Should Know About Database Scalability

Asynchronous multi-master

• Conflict resolution

– O(N3) or O(N2) as you add nodes

– http://research.microsoft.com/~gray/replicas.ps

• Bucardo

• MySQL Cluster

Page 18: What Every Developer Should Know About Database Scalability

Achtung!

• Asynchronous replication can lose data if the master fails

Page 19: What Every Developer Should Know About Database Scalability

“Architecture”

• Primarily about how you cope with failure scenarios

Page 20: What Every Developer Should Know About Database Scalability

Replication does not scale writes

Page 21: What Every Developer Should Know About Database Scalability

Scaling writes

• Partitioning aka sharding– Key / horizontal

– Vertical

– Directed

Page 22: What Every Developer Should Know About Database Scalability

Partitioning

Page 23: What Every Developer Should Know About Database Scalability

Key based partitioning

• PK of “root” table controls destination– e.g. user id

• Retains referential integrity

Page 24: What Every Developer Should Know About Database Scalability

Example: blogger.com

Users

Blogs

Comments

Page 25: What Every Developer Should Know About Database Scalability

Example: blogger.com

Users

Blogs

Comments

Comments'

Page 26: What Every Developer Should Know About Database Scalability

Vertical partitioning

• Tables on separate nodes

• Often a table that is too big to keep with the other tables, gets too big for a single node

Page 27: What Every Developer Should Know About Database Scalability

Growing is hard

Page 28: What Every Developer Should Know About Database Scalability

Directed partitioning

• Central db that knows what server owns a key

• Makes adding machines easier

• Single point of failure

Page 29: What Every Developer Should Know About Database Scalability

Partitioning

Page 30: What Every Developer Should Know About Database Scalability

Partitioning with replication

Page 31: What Every Developer Should Know About Database Scalability

What these have in common

• Ad hoc

• Error-prone

• Manpower-intensive

Page 32: What Every Developer Should Know About Database Scalability

To summarize

• Scaling reads sucks

• Scaling writes sucks more

Page 33: What Every Developer Should Know About Database Scalability

Distributed* databases

• Data is automatically partitioned

• Transparent to application

• Add capacity without downtime

• Failure tolerant

*Like Bigtable, not Lotus Notes

Page 34: What Every Developer Should Know About Database Scalability

Two famous papers

• Bigtable: A distributed storage system for structured data, 2006

• Dynamo: amazon's highly available key-value store, 2007

Page 35: What Every Developer Should Know About Database Scalability

The world doesn't need another half-assed key/value store

(See also Olin Shivers' 100% and 80% solutions)

Page 36: What Every Developer Should Know About Database Scalability

Two approaches

• Bigtable: “How can we build a distributed database on top of GFS?”

• Dynamo: “How can we build a distributed hash table appropriate for the data center?”

Page 37: What Every Developer Should Know About Database Scalability

Bigtable architecture

Page 38: What Every Developer Should Know About Database Scalability
Page 39: What Every Developer Should Know About Database Scalability

Lookup in Bigtable

Page 40: What Every Developer Should Know About Database Scalability

Dynamo

Page 41: What Every Developer Should Know About Database Scalability

Eventually consistent

• Amazon: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

• eBay: http://queue.acm.org/detail.cfm?id=1394128

Page 42: What Every Developer Should Know About Database Scalability

Consistency in a BASE world

• If W + R > N, you are 100% consistent

• W=1, R=N

• W=N, R=1

• W=Q, R=Q where Q = N / 2 + 1

Page 43: What Every Developer Should Know About Database Scalability

Cassandra

Page 44: What Every Developer Should Know About Database Scalability

Memtable / SSTable

Commit log

Disk

Page 45: What Every Developer Should Know About Database Scalability

ColumnFamilies

keyA column1 column2 column3

keyC column1 column7 column11

Column

Byte[] Name

Byte[] Value

I64 timestamp

Page 46: What Every Developer Should Know About Database Scalability

LSM write properties

• No reads

• No seeks

• Fast

• Atomic within ColumnFamily

Page 47: What Every Developer Should Know About Database Scalability

vs MySQL with 50GB of data

• MySQL– ~300ms write

– ~350ms read

• Cassandra– ~0.12ms write

– ~15ms read

• Achtung!

Page 48: What Every Developer Should Know About Database Scalability

Classic RDBMS persistence

Data

Index

Page 49: What Every Developer Should Know About Database Scalability

Questions