Upload
lizbeth-quinn
View
222
Download
0
Embed Size (px)
Citation preview
Scaling Memcacheat Facebook
Presenter: Rajesh Nishtala ([email protected])Co-authors: Hans Fugal, Steven Grimm, MarcKwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy,Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, Venkateshwaran Venkataramani
Infrastructure Requirementsfor Facebook
1. Near real-time communication
2. Aggregate content on-the-fly frommultiple sources
3. Be able to access and update very popularshared content
4. Scale to process millions of user requestsper second
Design Requirements
Support a very heavy read load• Over 1 billion reads / second
• Insulate backend services from high read rates
Geographically Distributed
Support a constantly evolving product• System must be flexible enough to support a variety of use cases
• Support rapid deployment of new features
Persistence handled outside the system• Support mechanisms to refill after updates
Need more read capacity
Database
DatabaseDatabase
Memcache
4. Set (key)2. Miss (key)
3. DB lookup
• Two orders of magnitudemore reads than writes
• Solution: Deploy a fewmemcache hosts to handlethe read capacity
• How do we store data?
• Demand-filled look-aside cache
• Common case is data is
available in the cache
Web Server
1. Get (key)
Handling updates
• Memcache needs to beinvalidated after DB write
• Prefer deletes to sets• Idempotent
• Demand filled
• Up to web application
to specify which keysto invalidate afterdatabase update
Database
Memcache
2. Delete1. Database
update
Web Server
Memcache
While evolving our system we prioritize two major design goals.
• Any change must impact a userfacing or operational issue. Optimizations that have limited scope are rarely considered
• We treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.
Roadmap
• Single front-end cluster–Read heavy workload
–Wide fanout
–Handling failures
• Multiple front-end clusters–Controlling data replication
–Data consistency
• Multiple Regions–Data consistency
Single front-end cluster
• Reducing Latency:focusing on the memcache client
–serialization–compression–request routing–request routing–request batching
• Parallel requests and batching
• Client-server communication:–embed the complexity of the system into a stateless client rather th
an in the memcached servers– We rely on UDP for get requests ,TCP via mcrouter for set,delete r
equests.
• Incast congestion:–Memcacheclients implement flowcontrol
mechanisms to limit incast congestion.
• Reducing Load– Leases: 是一个 64-bit的 Token,由MemCache在读请求时分配给 Client,可以用来解决两个问题: stale sets 和 thundering herds.
– stale sets :A stale set occurs when a web server sets a value in memcache that does not reflect the latest value that should be cached. This can occur when concurrent updates to memcacheget reordered
– Thundering herd :happens when a specific key undergoes heavy read and write activity
• Stale values:–When a key is deleted, its value is transferred to a data
structure that holds recently deleted items, where it lives for a short time before being flushed.A get request can return a lease token or data that is marked as stale.
• Memcache Pools: We designate one pool (named wildcard) as the default and provision separate pools for keys whose residence in wildcard is problematic.
• Replication Within Pools,We choose to replicate a cat
egory of keys within a pool :– the application routinely fetches many keys simultaneousl
y– the entire data set fits in one or two memcached
servers– the request rate is much higher than what
a single server can manage
• Handling Failures –There are two scales at which we must address failures:• a small number of hosts are inaccessible due to a network or server failure •a widespread outage that affects a significant percentage of the servers within the cluster
–Gutter
Multiple front-end clusters;Region
• Region:–We split our web and memcached servers into multiple
front-end clusters. These clusters, along with a storage cluster that contain the databases, define a region.– We trade replication of data for more independent fail
ure domains, tractable network configuration, and a reduction of incast congestion
Databases invalidate caches
• Cached data must be invalidated after database updates
• Solution: Tail the mysql commit log and issue deletes basedon transactions that have been committed• Allows caches to be resynchronized in the event of a problem
Front-End Cluster #1
Web Server
MC MC MC MC
Commit Log
MySQLStorage Server
Front-End Cluster #2
Web Server
MC MC MC
Front-End Cluster #3
Web Server
MC MC MC MC
McSqueal
Invalidation pipelineToo many packets
• Aggregating deletes reducespacket rate by 18x
• Makes configurationmanagement easier
• Each stage buffers deletes in
case downstream component isdown
McSqueal
DB
McSqueal
DB
McSqueal
DB
MC MC MC MC
Memcache
Routers
MC MC MC
Memcache
Routers
MC MC MC MC
Memcache
Routers
Memcache Routers
• Regional Pools–We can reduce the number of replicas by having multipl
e frontend clusters share the same set of memcached servers. We call this aregional pool .
• Cold Cluster Warmup–A system called Cold Cluster Warmup mitigates this by allo
wing clients in the “cold cluster” to retrieve data from the “warm cluster” rather than the persistent storage.
Across Regions: Geographically distributed clusters
• One region to hold the master databases and the other regions to contain read-only replicas;
• Advantages –putting web servers closer to end users can significantly
reduce latency–geographic diversity can mitigate the effects of events s
uch as natural disasters or massive power failures– new locations can provide cheaper power and other ec
onomic incentives
Replica Master
Geographically distributed clusters
Replica
Writes in non-masterDatabase update directly in master
• Race between DB replication and subsequent DB read
ReplicaDB
MemcacheMasterDB
1. Write to master
WebServer
3. Read from DB
(get missed)2. Delete from mc
WebServer
4. Set potentially
state value to
3. MySQL replication
memcache
Race!
ReplicaDB
Memcache
Web Server
MasterDB
2. Write to master
3. Delete from
memcache
5. Delete remote
marker
4. Mysql replication
Remote markersSet a special flag that indicates whether a race is likely
Read miss path:If marker set
read from master DBelse
read from replica DB
1. Set remotemarker
Single Server Improvements
• Performance Optimizations:–允许内部使用的 hashtable自动扩张 (rehash)–由原先多线程下的全局单一的锁改进为更细粒度的
锁–每个线程一个 UDP来避免竞争
• Adaptive Slab Allocator– 允许不同 Slab Class互换内存。当发现要删除的 item对应的 Slab Cl
ass最近使用的频率超过平均的 20%,将会把最近最少使用的 Slab Class的内存交互给需要的 Slab Class使用。整体思路类似一个全局的 LRU。
• The Transient Item Cache– 某些 key的过期时间会设置的较短,如果这些 key已经过期但是还没达到 LRU尾部时,它所占用的内存在这段时间就会浪费。因此可以把这些 key单独放在一个由链表组成的环形缓冲(即相当于一个循环数组,数组的每个元素是个链表,以过期时间作为数组的下标),每隔 1s,会将缓冲头部的数据给删除。
• Software Upgrades– Memcache的数据是保存在 System V的共享内存区域,方便机器上的软件升级。
Memcache Workload• Fanout
• Response size
• Pool Statistics
• Invalidation Latency
Conclusion• Separating cache and persistent storage systems allows us to i
ndependently scale them• Features that improve monitoring, debugging and operational
efficiency are as important as performance• Managing stateful components is operationally more complex
than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.
• Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.
• Simplicity is vital