23
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels SIGOPS 07

Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Embed Size (px)

Citation preview

Page 1: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Dynamo:Amazon’s Highly Available Key-value Store

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and

Werner Vogels

SIGOPS 07

Page 2: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Index

Introduction

NoSQL

CAP

Genealogy

Dynamo

Partition

Replication

Versioning

Handoff

Evaluation

Page 3: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Why No-Relational DB?

Relational DB are hard to scale

Replication – Scaling by Duplication

Partitioning (Sharding) – Scaling by Division

Do not need some features

Update / Delete

Join

Transactions (ACID)

Fixed Schema

Page 4: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Desired Characteristics

High Scalability

Ability to add nodes incrementally to support more users and data

High Availability

Data should be available even as some nodes go down

High Performance

Operations should return fast

Consistency

Do not need strong consistency

Durability

Data should be persisted to disk and not just kept in volatile memory

Deployment Flexibility

Modeling Flexibility

Key-Value pairs, Graphs

Query Flexibility

Multi-Gets, Range Queries

Page 5: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

CAP Theorem (1/2)

You can’t have it all

Page 6: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

CAP Theorem (2/2)

Page 7: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

NoSQL Genealogy

Page 8: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Dynamo Motivation

Build a distributed storage system

Scale

Simple: key-value

Highly available

Guarantee Service Level Agreements (SLA)

ACID vs BASE

Basically Available

Soft-state

Eventual consistency

Page 9: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Dynamo Motivation

Application can deliver its func-tionality in abounded time Every dependency in the platform

needs to deliver its functionality with even tighter bounds

Example Service guaranteeing that it will pro-

vide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per sec-ond

Page 10: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Consideration

Sacrifice strong consistency for availability

Conflict resolution is executed during read instead of write, i.e. “always writeable”

Other principles

Principle Effect

Incremental scalability A storage host can be scaled without undue impact to the sys-tem

Symmetry All nodes are the same

Decentralization Focus on peer to peer techniques

Heterogeneity Work must be distributed accord-ing to capabilities of the nodes

Page 11: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Summary of techniques used in Dy-namo

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Scalability

High Availability for writes Vector clocksVersion size is decoupled from

update rates

Handling temporary failures Sloppy Quorum and Hinted handoff Provides high availability and durability guarantee when some of

the replicas are not available.

Recovering from permanent failures

Merkle treesSynchronizes divergent replicas in

the background.

Membership and failure detection Gossip

Preserves symmetry and avoids having a centralized registry for storing membership and node

liveness information.

Page 12: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Partition Algorithm

Consistent hashing The output range of a hash

function is treated as a fixed circular space or “ring”

Virtual Nodes Each node can be responsi-

ble for more than one virtual node

Page 13: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Replication

Optimistic Replication

Changes are allowed to propagate to replicas in the background, and concur-rent, disconnected work is tolerated

Each data item is replicated at N hosts

Preference list

The list of nodes that is responsible for storing a particular key

Page 14: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Data Versioning

A put() call may return to its caller before the update has been applied at all the replicas

A get() call may return many versions of the same object

Challenge

An object having distinct version sub-histories, which the system will need to reconcile in the future.

Solution

Uses vector clocks in order to capture causality between different versions of the same object

Page 15: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Vector Clock

A vector clock is a list of (node, counter) pairs

Every version of every object is associated with one vec-tor clock

If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten

Page 16: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Vector clock example

Page 17: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Sloppy Quorum

NRW Configuration W=N R=1 Read optimized strong consistency W=1 R=N Write optimized strong consistency W+R <= N Weak eventual consistency. A read might

not see latest update W+R > N Strong consistency through quorum assem-

bly. A read will see at least one copy of the most recent update

In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency Typical Values of (N,R,W) for Amazon Apps are (3,2,2)

Page 18: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Hinted handoff Assume N = 3. When A is

temporarily down or un-reachable during a write, send replica to D

D is hinted that the replica is belong to A and it will deliver to A when A is recovered

Again: “always writeable”

Page 19: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Other techniques

Replica synchronization

Merkle hash tree

Membership and Failure Detection

Gossip

Physical Storage

Berkley DB transactional store, MySQL, In-memory buffer

Page 20: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Implementation

Java

Local persistence component allows for different storage engines to be plugged in:

Berkeley Database (BDB) Transactional Data Store: object of tens of kilobytes

MySQL: object of > tens of kilobytes

BDB Java Edition, etc.

Page 21: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Evaluation

Page 22: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

Evaluation

Page 23: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avanish Lakshman, Alex Pilchin,

Copyright 2008 by CEBT

3/18

Cassandra

– Bigtable + Dynamo

– Partitioning, Replication, Membership

Consensus on Transaction Commit

– Paxos Commit / 2Phase Commit

3/25

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

– Data Replication

기타 NoSQL 특징 비교– 장단점 , Query Language 등