32
Enhancing the Scalability of Memcahced Rajiv Kapoor, Intel Corporation Sep 19 2012

[B5]memcached scalability-bag lru-deview-100

Embed Size (px)

Citation preview

Page 1: [B5]memcached scalability-bag lru-deview-100

Enhancing the Scalability of Memcahced

Rajiv Kapoor, Intel Corporation Sep 19 2012

Page 2: [B5]memcached scalability-bag lru-deview-100

Content

• What is Memcached • Usage model • Measuring performance • Baseline performance & scalability •  Performance root cause • Base transaction flow • Optimization goals, design considerations • Optimized transaction flow • Optimization details • Optimized version performance • Summary

2

Page 3: [B5]memcached scalability-bag lru-deview-100

What is Memcached? • Open Source distributed memory caching system − Typically serves as a cache for persistent databases − In-memory key-value store for quick data access − For a particular “key” a “value” is stored/deleted/retrieved etc.

− Provides a networked data caching API simple to use and setup

• Used by many companies with web centric businesses • Most common usage model - web data caching − Original data resides in persistent database − Database queries are expensive − Memcached caches the data to provide low latency access − Helps reduce the load on the database

• Computational cache • Temporary object store

3

Page 4: [B5]memcached scalability-bag lru-deview-100

Web data caching usage model

•  Memcached tier acts as a cache for the database tier −  Cache is spread over several memcached servers

•  Client requests the “value” associated with a “key” •  A “GET” request for “key” sent to memcached •  If “key” found

−  Memcached returns “value” for “key”

•  If “key” not found −  Persistent database is queried for “key”

−  “value” from database is returned to client

−  “SET” request sent to MC with “key” & “value”

•  Key-value pair stays in cache unless −  It is evicted because of cache LRU policies

−  Explicitly removed by a “DELETE” request

•  Typical operations −  GET, SET, DELETE, STATS, REPLACE, etc.

•  Most frequent transaction is “GET” −  Impacts perf of most common use cases

4

Page 5: [B5]memcached scalability-bag lru-deview-100

Measuring performance • Measure perf of most important transaction - “get” • Best perf = max “get” Requests Per Sec (RPS) under SLA − SLA (Service Level Agreement) : Average “get” latency <= 1 ms

• Measurement configuration is “client-server” − Run memcached on one or more servers − Run load generator/s on “client/s” to send requests to MC servers − Load generator keeps track of transactions and reports results

• Process − Load gen sends “set” requests to prime cache with key-value pairs − For incremental RPS in a range, do following until avg latency >

1ms: − Send random key “gets” for 60 secs, calculate average latency

• S/W and H/W configuration − Open Source Memcached V 1.6 base and optimized − Open Source Mcblaster load generator − Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory

5

Page 6: [B5]memcached scalability-bag lru-deview-100

Baseline performance & core scalability

•  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory

•  Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF

6

No scalability beyond 3 cores, degrades beyond 4

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory, Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF

Page 7: [B5]memcached scalability-bag lru-deview-100

Performance root cause •  Profile during “gets” shows lots of time spent in locks

•  Drill down into code shows coarse grained global cache locks −  Held for most of a thread’s execution time

•  Removing the global locks & measuring “gets” showed substantial improvement −  Unsafe, done only as a proof of concept

•  “Top” shows unbalanced CPU core utilization, possibilities are: −  Sub-optimal network packet handling and distribution −  Thread migration between cores

7

Page 8: [B5]memcached scalability-bag lru-deview-100

Transaction flow

8

•  Incoming requests from clients

•  Libevent distributes them to MC threads −  # of MC threads = # of cores

−  No thread affinity

•  Threads do key hashing in parallel

•  Hash table processing to

−  Find place for new item (key-value pair)

−  Find location of existing item

•  LRU processing to maintain cache policy −  Move item to front of list indicating most

recently accessed

•  A global cache lock around hash table and LRU processing −  Serializes all transactions on all threads

−  This is the key bottleneck to scalability

•  Final responses handled in parallel

Page 9: [B5]memcached scalability-bag lru-deview-100

The hash table • Hash table is arranged as an array of buckets • Each bucket has a singly linked list as a hash chain • The hashed key is used to find the bucket it belongs in •  Item (key-value pair) is then inserted/retrieved from in the hash chain of that bucket

9

Page 10: [B5]memcached scalability-bag lru-deview-100

The LRU • LRU - Least Recently Used cache management scheme − Cache is finite amount - evict old items to make room for new ones − LRU policy determines eviction order of cache items − Oldest active cache item is evicted first

• Uses a doubly linked list for quick manipulation − Head has most recently used item − GET for item removes it from current position & moved to head − On eviction the tail is checked for oldest item

10

Page 11: [B5]memcached scalability-bag lru-deview-100

Why the global lock • Linked lists are used in both the hash table & the LRU • Corruption can occur if the lock is removed − Example below of two close by items being removed − Higher chance of corruption in the LRU because of doubly linked list

11

Page 12: [B5]memcached scalability-bag lru-deview-100

Optimization goals, design considerations • Goals − Must scale well with larger core counts − Hash distribution should have little effect on perf − Same performance accessing 1 unique key or 100k unique keys

− Changes to LRU must maintain/increase hit rates − ~90% with test data set

•  Implementation considerations − Any lock removal or reduction should be safe − No additional data should be used for cache items − Millions to billions of cache items in a fully populated instance − A single 64-bit field would reduce useable memory considerably leading to a

reduced the hit rate

− Focus on GETs for best performance − Most memcached instances are read dominated − New design should account for this and optimize for read traffic

− Transaction ordering not guaranteed – just like the original implementation

12

Page 13: [B5]memcached scalability-bag lru-deview-100

Optimized transaction flow

13

Original •  Global lock serializes Hash table and LRU

operations

Optimized •  Non-blocking gets using a “Bag” LRU scheme

•  Better parallelization for set/delete with striped locks

Page 14: [B5]memcached scalability-bag lru-deview-100

SET/DEL optimization - parallel hash table • Uses striped locks instead of a global lock − Fine grain collection of locks instead of a single global lock

• Makes use of a fixed-size, shared collection of locks for the entire hash table − Allows for a highly scalable hash table solution − Fixed-overhead

• Number of locks is a ^2 to determine lock quickly − Bitwise and the bucket with the number of locks to determine lock

• Not used for GETs

14

Page 15: [B5]memcached scalability-bag lru-deview-100

SET/DEL optimization - parallel hash table .. • Each lock services Z number of buckets • Number of locks, Z, based on balance between parallelism and lock maintenance overhead

• Multiple buckets can be manipulated in parallel

15

Page 16: [B5]memcached scalability-bag lru-deview-100

GET optimization – removing the global lock • No global lock during hash table processing for GET • With no global lock, two situations must be handled − Expansion of hash table during a GET − Hash table expands if there are a lot more items than buckets can handle

− SET/DEL of an item during a GET • Handling hash table expansion during GET − If expanding then wait for it to finish before looking up hash chain − If not expanding then find data in hash chain and return it

• Handling SET/DEL during a GET − If hash table expanding, wait to finish before modifying hash chain − Modify pointers in right order using atomic operations to ensure correct

hash chain traversal for GETs • A GET may still happen while the item is being modified

(SET/DEL/REPLACE) − Is that a problem? − No, as long as traversal is correct, because operation order is not

guaranteed anyways

16

Page 17: [B5]memcached scalability-bag lru-deview-100

GET optimization – Parallel Bag LRU • Replaces the original doubly linked list LRU • Basic concept is to group items with similar time stamps into “bags” − As before, no ordering is guaranteed

• Has all the functionality as the original LRU • Re-uses original item data structure – no additions • SET to a bag uses atomic Compare and Swap operation • GET from a bag is lockless • DEL requests do nothing to the Bag LRU • LRU cleanup is delegated to a “cleaner thread” − Acts like “garbage collection/cleanup” − Evicts expired items quickly − Handles item cleanup from deletes − Reorders cache items based on update time − Adds additional Bags as needed

17

Page 18: [B5]memcached scalability-bag lru-deview-100

Parallel Bag LRU details – Bag Array

18

Original LRU

Bag LRU •  A list of bags in chronological order

•  Bags have list of items •  Newest bag has recently allocated or accessed items

•  Alternate bag used by cleaner thread to avoid lock contention on inserts to newest bag

•  Bag head has pointers to oldest and newest bags for quick access

Page 19: [B5]memcached scalability-bag lru-deview-100

Parallel Bag LRU details – Bags • Each bag has a singly linked list of cache items • SET causes new item to be inserted into “newest bag” • GET updates timestamp & pointer to point to “newest bag”

• Evictions handled by cleaner thread

19

Page 20: [B5]memcached scalability-bag lru-deview-100

Parallel Bag LRU – Cleaner Thread • Periodically does house keeping on the Bag LRU − Currently every 5 secs

• Starts cleaning from the oldest bag’s oldest item

20

Page 21: [B5]memcached scalability-bag lru-deview-100

Optimizations - Misc • Used thread affinity to bind 1 memcached thread per core • Configured NIC driver to evenly distribute incoming packets over CPUs − 1 NIC queue per logical CPU, affinitized to a logical CPU

•  Irqbalance, iptables services turned off

21

Page 22: [B5]memcached scalability-bag lru-deview-100

Optimized performance & core scaling

•  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory

•  Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF

22

Linear scaling with optimizations Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory, Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF

Page 23: [B5]memcached scalability-bag lru-deview-100

Server capacity

•  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory

23

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory, Intel® Turbo Boost Technology OFF/ON, Intel® Hyper-Threading Technology OFF/ON

Overall 900% gains vs. baseline Turbo and HT boost performance by 31%

Page 24: [B5]memcached scalability-bag lru-deview-100

Efficiency and hit rate • Hit rate measured with a synthetic benchmark increased slightly − At ~90% - similar to that of original version

• Efficiency (Transactions Per watt) increased by 3.4X − Mostly due to much higher RPS for little increase in power − Power draw would be less in a production environment

24

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory, Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology ON

•  Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory

Page 25: [B5]memcached scalability-bag lru-deview-100

Summary

• Base core/thread scalability hampered by locks − No throughput scaling beyond 3 cores, degradation beyond 4

• Lockless “GETs” with Bag LRU improves scalability − Linear till the measured 16 cores − No increase in average latency − No loss in hit rate (~90%) − Same performance for random and hot/repeated keys

• Striped locks parallelize hash table access for SET/DEL • Bag LRU source code available on GitHub − https://github.com/rajiv-kapoor/memcached/tree/bagLRU

25

Page 26: [B5]memcached scalability-bag lru-deview-100

Thank You

Page 27: [B5]memcached scalability-bag lru-deview-100

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. •  A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in

personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

•  Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

•  The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

•  Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.

•  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number.

•  Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. •  Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be

obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm •  [Add any code names from previous pages] and other code names featured are used internally within Intel to identify products

that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user

•  Intel, [Add words with TM or R from previous pages..ie Xeon, Core, etc] and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

•  *Other names and brands may be claimed as the property of others. •  Copyright ©2012 Intel Corporation.

Page 28: [B5]memcached scalability-bag lru-deview-100

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Page 29: [B5]memcached scalability-bag lru-deview-100

Legal Disclaimer •  Built-In Security: No computer system can provide absolute security under all conditions. Built-in security features

available on select Intel® Core™ processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon configuration. Consult your PC manufacturer for more details.

•  Enhanced Intel SpeedStep® Technology - See the Processor Spec Finder at http://ark.intel.com or contact your Intel representative for more information.

•  Intel® Hyper-Threading Technology (Intel® HT Technology) is available on select Intel® Core™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support Intel HT Technology, visit http://www.intel.com/info/hyperthreading.

•  Intel® 64 architecture requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit http://www.intel.com/info/em64t

•  Intel® Turbo Boost Technology requires a system with Intel Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo

•  Other Software Code Disclaimer Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Page 30: [B5]memcached scalability-bag lru-deview-100

Risk Factors The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including supply constraints and other disruptions affecting customers; customer acceptance of Intel’s and competitors’ products; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate new features into its products. Intel is in the process of transitioning to its next generation of products on 22nm process technology, and there could be execution and timing issues associated with these changes, including products defects and errata and lower than anticipated manufacturing yields. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. The majority of Intel’s non-marketable equity investment portfolio balance is concentrated in companies in the flash memory market segment, and declines in this market segment or changes in management’s plans with respect to Intel’s investments in this market segment could result in significant impairment charges, impacting restructuring charges as well as gains/losses on equity investments and interest and other. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent Form 10-Q, Form 10-K and earnings release.

Rev. 5/4/12

Page 31: [B5]memcached scalability-bag lru-deview-100

Summary

Memcached is a popular key-value caching service used by web service delivery companies to reduce the latency of serving data to consumers and reduce load on back-end database servers. It has a scale out architecture that easily supports increasing throughput by simply adding more memcached servers, but at the individual server level scaling up to higher core counts is less rewarding. In this talk we introduce optimizations that break through such scalability barriers and allow all cores in a server to be used effectively. We explain new algorithms implemented to achieve an almost 6x increase in throughput while maintaining a 1ms average latency SLA by utilizing concurrent data structures, a new cache replacement policy and network optimizations.

31

Page 32: [B5]memcached scalability-bag lru-deview-100

Optimized transaction flow

32