44
Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang

Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Dynamo & Bigtable

CSCI 2270, Spring 2011Irina Calciu Zikai Wang

Page 2: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

DynamoAmazon's highly available key-value store

Page 3: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Amazon's E-commerce Platform

Hundreds of services (recommendations, order fulfillment, fraud detection, etc.)

Millions of customers at peak time

Tens of thousands of servers in geographically distributed data centers

Reliability (always-on experience)

Fault Tolerance

Scalability, Elasticity

Page 4: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Why not RDBMS?

Most �Amazon services only needs read/write by primary key

RDBMS's complex querying and management functionalities are unnecessary and expensive

Available replication technologies are limited and typically choose consistency over availability

Not easy to scale out databases or use smart partitioning schemes for load balancing

Page 5: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

System Assumptions & Requirements

Query model: no need for relational schema, simple read/write operations based on primary key are enough

ACID Properties: Weak consistency (in exchange for high availability), no isolation, only single key updates

Efficiency: function on commodity hardware infrastructure, be able to meet stringent SLAs on latency and throughput

Other assumptions: non-hostile operation environment, no security related requirements

Page 6: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Design considerations

Optimistic replication & eventually consistency

Always writable & resolve update conflicts during reads

Applications are responsible for conflict resolution

Incremental scalability

Symmetry

Decentralization

Heterogeneity

Page 7: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Architecture Highlights

Partitioning

Replication

Versioning

Membership

Failure Handling

Scaling

Page 8: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

API / Operators

get(key) returns:one object or a list of objects with conflicting versionsa context

put(key, context, object):find correct locationswrites replicas to diskcontext contains metadata about the object

Page 9: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Partitioning

variant of consistent hashing similar to Chord

each node gets keys between its predecessor and itself

accounts for heterogeneity of nodes using virtual nodes

the system scales incrementally

load balancing

Page 10: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Replication

Page 11: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Versioning

put operation can always be executed

eventual consistency

reconciled using vector clocks

if automatic reconciliation not possible, the system returns a list of versions to the client

Page 12: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Versioning

Page 13: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Executing a read / write

coordinator node = first node to store the keyput operation - written to W nodes (w/ the coord. vector clock)get operation - coordinator reconciles R versions or sends conflicting versions to the clientif R + W > N (preference list size) - quorum like systemusually R + W < N to decrease latency

Page 14: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Hinted Handoff

the N nodes to which a request is sent are not always the first N nodes in the preference list, if there are failures

instead a node can temporarily store a key for another node and give it back when that nodes comes back up

Page 15: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Replica Synchronization

compute Merkle tree for each key rangeperiodically check that key ranges are consistent between nodes

Page 16: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Membership

Ring join / leave propagated via gossip protocolLogical partitions avoided using seed nodesWhen a node joins the keys it becomes responsible for are transferred to it by its peers

Page 17: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Summary

Page 18: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Durability vs. Performance

Page 19: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Durability vs. Performance

Page 20: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Conclusion

Combine different techniques to provide a single highly-available system An eventually-consistent system could be use in production with demanding applications Balancing performance, durability and consistency by tuning parameters N, R, W

Page 21: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

BigtableA distributed storage system for

structured data

Page 22: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Applications and Requirements

wide applicability for a variety of systemsscalability high performance high availability

Page 23: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Data Modelkey / value pairs structureadded support for sparse semi-structured datakey: <row key, column key, timestamp>value: uninterpreted array of bytesexample: Webtable

Page 24: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Data Model

multidimensional maplexicographic order by row keyrow access is atomicrow range dynamically partitioned (tablet)can achieve good locality of data

e.g. webpages stored by reversed domainstatic column families variable columnstimestamps used to index different versions

Page 25: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

API / Operators

create / delete tablecreate / delete column familieschange metadata (cluster / table / column family)single-row transactionsuse cells as integer countsexecute client supplied scripts on the servers

Page 26: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Architecture at a Glance

Page 27: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

GFS & Chubby

GFSGoogle's distributed file systemScalable, fault-tolerant, with high aggregate performanceStore logs, tablets (SSTables)

Chubby Distributed coordination service Highly available, persistentData model after directory tree structure of file systemsMembership maintenance (the master & tablet servers)Location of root tablet of METADATA table (bootstrap)Schema information, access control lists

Page 28: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

The Master

Detecting addition and expiration of tablet serversAssign tablets to tablet serversBalancing tablet-server loadGarbage collection of GFS filesHandling schema changes

Performance bottleneck?

Page 29: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Tablet Servers

Manage a set of tablets Handle users' read/write requests for those tabletsSplit tablets that have grown too large

Tablet servers' in-memory structuresTwo-level cache (scan & block)Bloom filtersMemtablesSSTables (if requested)

Page 30: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Architecture at a Glance

Page 31: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Locate a Tablet: METADATA Table

METADATA table stores tablet locations of user tablesRow key of METADATA table encodes table ID + end rowClients caches tablet locations

Page 32: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Assign a TabletFor tablet servers:

Each tablet is assigned to one tablet server Each tablet server is managing several tablets

For the master: Keep track of live tablet servers with ChubbyKeep track of current assignment of tablets Assign unassigned tablets to tablet servers considering load balancing issues

Page 33: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Read/Write a Tablet(1)

Persistent state of a tablet includes a tablet log and SSTablesUpdates are committed to tablet log that stores redo recordsMemtable, a in-memory sorted buffer stores latest updatesSSTables stores older updates

Page 34: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Read/Write a Tablet(2)

Write operationWrite to commit log, commit it, write to memtableGroup commit

Read operationRead on a merged view of memtable and SSTables

Page 35: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Compactions

Minor compactionWrite the current memtable into a new SSTable on GFSLess memory usage, faster recovery

Merging compactionPeriodically merge a few SSTables and memtable into a new SSTableSimplify merged view for reads

Major compaction Rewrite all SSTables into exactly one SSTable Reclaim resources used by deleted data Deleted data disappears in a timely fashion

Page 36: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Optimizations(1)

Locality groups Group column families typically accessed togetherGenerate a separate SSTable for each locality groupSpecify in-memory locality groups (METADATA:location)More efficient reads

CompressionControl if SSTables for a locality group are compressedSpeed VS space, network transmission costLocality has influences over compression rate

Page 37: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Optimizations(2)

Two-level cache for read performanceScan cache: caches accessed key-value pairs Block cache: caches accessed SSTables blocks

Bloom filtersCreated for SSTables in certain locality groups Identify whether SSTable might contain data queried

Commit-log implementationSingle commit log per tablet serversCo-mingle mutations for different tabletsDecrease number of log filesComplicate recovery process

Page 38: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Optimizations(3)

�Speeding up tablet recoveryTwo minor compaction when moving tablet between tablet serversReduce uncompacted state in commit log

Exploiting immutability SSTables are immutableNo synchronization for readsWrites generate new SSTables Copy-on-write for memtables Tablets are allowed to share SSTables

Page 39: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Evaluation

Number of operations per second per tablet server

Page 40: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Evaluation

Aggregate number of operations per second

Page 41: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Applications

Click TableSummary Table

One table storing raw imagery, served from disk

User dataRow: useridEach group can add their own user column

Page 42: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Lessons Learned

1. many types of failures, not just network partitions2. add new features only if needed3. improve the system by careful monitoring4. keep the design simple

Page 43: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Conclusion

Bigtable is used in production code since April 2005used extensively by several Google projects"unusual interface"

compared to the traditional relational modelIt has empirically shown its performance, availability and elasticity

Page 44: Dynamo & Bigtable - Brown University · 2011-03-03 · Dynamo & Bigtable CSCI 2270, Spring 2011 Irina Calciu Zikai Wang. Dynamo Amazon's highly available key-value store. Amazon's

Dynamo vs. Bigtable