Scaling Memcache at Facebook Presenter: Rajesh Nishtala ([email protected]) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry

Scaling Memcacheat Facebook

Presenter: Rajesh Nishtala ([email protected])Co-authors: Hans Fugal, Steven Grimm, MarcKwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy,Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, Venkateshwaran Venkataramani

Infrastructure Requirementsfor Facebook

1.  Near real-time communication

2.  Aggregate content on-the-fly frommultiple sources

3.  Be able to access and update very popularshared content

4.  Scale to process millions of user requestsper second

Design Requirements

Support a very heavy read load•  Over 1 billion reads / second

•  Insulate backend services from high read rates

Geographically Distributed

Support a constantly evolving product•  System must be flexible enough to support a variety of use cases

•  Support rapid deployment of new features

Persistence handled outside the system•  Support mechanisms to refill after updates

Need more read capacity

Database

DatabaseDatabase

Memcache

4. Set (key)2. Miss (key)

3. DB lookup

•  Two orders of magnitudemore reads than writes

•  Solution: Deploy a fewmemcache hosts to handlethe read capacity

•  How do we store data?

• Demand-filled look-aside cache

• Common case is data is

available in the cache

Web Server

1. Get (key)

Handling updates

•  Memcache needs to beinvalidated after DB write

•  Prefer deletes to sets• Idempotent

• Demand filled

•  Up to web application

to specify which keysto invalidate afterdatabase update

Database

Memcache

2. Delete1. Database

update

Web Server

Memcache

While evolving our system we prioritize two major design goals.

• Any change must impact a userfacing or operational issue. Optimizations that have limited scope are rarely considered

• We treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.

Roadmap

• Single front-end cluster–Read heavy workload

–Wide fanout

–Handling failures

• Multiple front-end clusters–Controlling data replication

–Data consistency

• Multiple Regions–Data consistency

Single front-end cluster

• Reducing Latency:focusing on the memcache client

–serialization–compression–request routing–request routing–request batching

• Parallel requests and batching

• Client-server communication:–embed the complexity of the system into a stateless client rather th

an in the memcached servers– We rely on UDP for get requests ,TCP via mcrouter for set,delete r

equests.

• Incast congestion:–Memcacheclients implement flowcontrol

mechanisms to limit incast congestion.

• Reducing Load– Leases: 是一个 64-bit的 Token，由MemCache在读请求时分配给 Client，可以用来解决两个问题： stale sets 和 thundering herds.

– stale sets :A stale set occurs when a web server sets a value in memcache that does not reflect the latest value that should be cached. This can occur when concurrent updates to memcacheget reordered

– Thundering herd :happens when a specific key undergoes heavy read and write activity

• Stale values:–When a key is deleted, its value is transferred to a data

structure that holds recently deleted items, where it lives for a short time before being flushed.A get request can return a lease token or data that is marked as stale.

• Memcache Pools: We designate one pool (named wildcard) as the default and provision separate pools for keys whose residence in wildcard is problematic.

• Replication Within Pools,We choose to replicate a cat

egory of keys within a pool :– the application routinely fetches many keys simultaneousl

y– the entire data set fits in one or two memcached

servers– the request rate is much higher than what

a single server can manage

• Handling Failures –There are two scales at which we must address failures:• a small number of hosts are inaccessible due to a network or server failure •a widespread outage that affects a significant percentage of the servers within the cluster

–Gutter

Multiple front-end clusters;Region

• Region:–We split our web and memcached servers into multiple

front-end clusters. These clusters, along with a storage cluster that contain the databases, define a region.– We trade replication of data for more independent fail

ure domains, tractable network configuration, and a reduction of incast congestion

Databases invalidate caches

•  Cached data must be invalidated after database updates

•  Solution: Tail the mysql commit log and issue deletes basedon transactions that have been committed• Allows caches to be resynchronized in the event of a problem

Front-End Cluster #1

Web Server

MC MC MC MC

Commit Log

MySQLStorage Server


Web Server

MC MC MC


Web Server

MC MC MC MC

McSqueal

Invalidation pipelineToo many packets

• Aggregating deletes reducespacket rate by 18x

• Makes configurationmanagement easier

• Each stage buffers deletes in

case downstream component isdown

McSqueal

DB

McSqueal

DB

McSqueal

DB

MC MC MC MC

Memcache

Routers

MC MC MC

Memcache

Routers

MC MC MC MC

Memcache

Routers

Memcache Routers

• Regional Pools–We can reduce the number of replicas by having multipl

e frontend clusters share the same set of memcached servers. We call this aregional pool .

• Cold Cluster Warmup–A system called Cold Cluster Warmup mitigates this by allo

wing clients in the “cold cluster” to retrieve data from the “warm cluster” rather than the persistent storage.

Across Regions: Geographically distributed clusters

• One region to hold the master databases and the other regions to contain read-only replicas;

• Advantages –putting web servers closer to end users can significantly

reduce latency–geographic diversity can mitigate the effects of events s

uch as natural disasters or massive power failures– new locations can provide cheaper power and other ec

onomic incentives

Replica Master

Geographically distributed clusters

Replica

Writes in non-masterDatabase update directly in master

• Race between DB replication and subsequent DB read

ReplicaDB

MemcacheMasterDB

1. Write to master

WebServer

3. Read from DB

(get missed)2. Delete from mc

WebServer

4. Set potentially

state value to

3. MySQL replication

memcache

Race!

ReplicaDB

Memcache

Web Server

MasterDB

2. Write to master

3. Delete from

memcache

5. Delete remote

marker

4. Mysql replication

Remote markersSet a special flag that indicates whether a race is likely

Read miss path:If marker set

read from master DBelse

read from replica DB

1. Set remotemarker

Single Server Improvements

• Performance Optimizations:–允许内部使用的 hashtable自动扩张 (rehash)–由原先多线程下的全局单一的锁改进为更细粒度的

锁–每个线程一个 UDP来避免竞争

• Adaptive Slab Allocator– 允许不同 Slab Class互换内存。当发现要删除的 item对应的 Slab Cl

ass最近使用的频率超过平均的 20%，将会把最近最少使用的 Slab Class的内存交互给需要的 Slab Class使用。整体思路类似一个全局的 LRU。

• The Transient Item Cache– 某些 key的过期时间会设置的较短，如果这些 key已经过期但是还没达到 LRU尾部时，它所占用的内存在这段时间就会浪费。因此可以把这些 key单独放在一个由链表组成的环形缓冲（即相当于一个循环数组，数组的每个元素是个链表，以过期时间作为数组的下标），每隔 1s，会将缓冲头部的数据给删除。

• Software Upgrades– Memcache的数据是保存在 System V的共享内存区域，方便机器上的软件升级。

Memcache Workload• Fanout

• Response size

• Pool Statistics

• Invalidation Latency

Conclusion• Separating cache and persistent storage systems allows us to i

ndependently scale them• Features that improve monitoring, debugging and operational

efficiency are as important as performance• Managing stateful components is operationally more complex

than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.

• Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.

• Simplicity is vital

Documents

Scaling Memcache at Facebook Presenter: Rajesh Nishtala ([email protected]) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry