If you can't read please download the document
Upload
wim-godden
View
9.088
Download
0
Embed Size (px)
Citation preview
Caching and tuning funfor high scalability
Wim GoddenCu.be Solutions
Who am I ?
Wim Godden (@wimgtr)
Founder of Cu.be Solutions (http://cu.be)
Open Source developer since 1997
Developer of OpenX, PHPCompatibility, Nginx SCL, ...
Speaker at PHP and Open Source conferences
Who are you ?
Developers ?
System/network engineers ?
Managers ?
Caching experience ?
Goals of this tutorial
Everything about caching and tuning
A few techniquesHow-to
How-NOT-to
Increase reliability, performance and scalability
5 visitors/day 5 million visitors/day
(Don't expect miracle cure !)
LAMP
Architecture
Test page
3 DB-queriesselect firstname, lastname, email from user where user_id = 5;
select title, createddate, body from article order by createddate desc limit 5;
select title, createddate, body from article order by score desc limit 5;
Page just outputs result
Our base benchmark
Apachebench = useful enough
Result ?
Single webserverProxy
StaticPHPStaticPHP
Apache + PHP390017.5670017.5
Limit :CPU, networkor disk
Limit :database
Caching
What is caching ?
CACHE
What is caching ?
x = 5, y = 2n = 50Same resultCACHE
select*fromarticlejoin useron article.user_id = user.idorder bycreated desclimit10
Doesn't changeall the time
Theory of caching
DB
Cache$data = get('key')falseGET /pagePageselect data from table$data = returned resultset('key', $data)if ($data == false)
Theory of caching
DB
Cache
HIT
Caching goals - 1st goal
Reduce # of concurrent request
Reduce the load
Caching serves 3 purposes :- Firstly, to reduce the number of requests or the load at the source of information, which can be a database server, content repository, or anything else.
Caching goals - 2nd goal
Secondly, you want to improve the response time of each request.If you request a page that takes 50ms to load without caching and you get 10 hits/second, you won't be able to serve those requests with 5 Apache processes.If you could cache some of the data used on the page you might be able to return the page in 20ms. That doesn't just improve user experience, but reduces the load on the webserver, since the number of concurrent connections is a lot lower. connections closed faster handle more connections and, as a result, more hits on the same machine. If you don't : more Apache processes needed will eat memory, will eat system resources and as a result will also cause context switching.
Some figures
Pageviews : 5000 (4000 on 10 pages)
Avg. loading time : 200ms
Cache 10 pages
Avg. loading time : 20ms Total avg. loading time : 56ms
Worth it ?
Caching goals - 3rd goal
Send less data across the network / Internet
You benefit lower bill from upstream provider
Users benefit faster page load
Wait a second... that's a frontend problem !
True, but remember : the backend is transmitting it !
More tuning Reduce the amount of data that needs to be sent over the network or the Internet- benefits you, as application provider, because you have less traffic on your upstream connection.- also better for the enduser because there will be less data to be downloaded. Ofcourse part of frontend side, which we'll discuss near the end of the tutorial.
Caching techniques
#1 : Store entire pages
Company Websites
Blogs
Full pages that don't change
Render Store in cache retrieve from cache
The first way to cache in a web application, or actually more commonly a website, is to cache entire pages.This used to be a very popular caching technique, back in the days when pages were very static.
It's still popular for things like company websites, blogs, and any site that has pages that don't change a lot.
Basically, you just render the page once, put it in a cache, and from the moment onwards, you just retrieve it from the cache.
Caching techniques
#2 : Store parts of a page
Most common technique
Usually a small block in a page
Best effect : reused on lots of pages
Can be inserted on dynamic pages
Store part of a page. Probably most common + best way to cache data.- Basically what you do is, take piece of data : - data from the database - result of a calculation - an aggregation of two feeds - parsed data from CSV-file from NFS share located on the other side of the world - could be data that was stored on a USB stick your kid is now chewing on.What I mean is : it doesn't matter where the data came from. Part of a page, usually a block on a page and want save time by not having to get that data from its original source every time again.Instead of saving entire page, where you can have multiple dynamic parts, some of which might not be cached because they are really dynamic, like the current time. So store small block, so that when we render the page, all we do is get small block from cache and place it in dynamic page and output it.
Caching techniques
#3 : Store SQL queries
SQL query cacheLimited in size
Store the output of SQL queries.
Now, who of you know what SQL query caching is, MySQL query cache for example ?
Basically, the MySQL query cache is a cache which stores the output of recently run SQL queries. It's built into MySQL, it's... not enabled by default everywhere, it depends on your distribution.
And it speeds up queries by a huge margin. Disabling it is something I never do, because you gain a lot by having it enabled.
However, there's a few limitations :- First of all, the query cache is limited in size.
Caching techniques
#3 : Store SQL queries
SQL query cacheLimited in size
Resets on every insert/update/delete
Server and connection overhead
Goal :not to get rid of DB
free up DB resources for more hits !
Better :store processed data instead of raw data
store group of objects
But, basically one of the big drawbacks of MySQL query cache, is that every time you do an insert, update or delete on a table, the entire query cache for queries referencing that table, is erased.
Another drawback is that you still need to connect to the MySQL server and you still need to go through a lot of the core of MySQL to get your results.
So, storing the output of SQL queries in a separate cache, being Memcache or one of the other tools we're going to see in a moment, is actually not a bad idea. Also because of the fact that, if you have a big Website, you will still get quite a bit load on your MySQL database. So anything that takes the load off the database and takes it to where you have more resources available, is a good idea.
Better : store returned object or group of objects
Caching techniques
#4 : Store complex PHP results
Not just calculations
CPU intensive tasks :Config file parsing
XML file parsing
Loading CSV in an array
Save resources more resources available
Another caching technique I want to mention is storing the result of complex PHP processes.
- You might think about some kind of calculation, but when I mention calculation, people tend to think about getting data from here and there and then summing them.
- That's not what I mean. By complex PHP processes I mean things like parsing configuration files, parsing XML files, loading CSV-data in an array, converting mutliple XML-files into an object structure, and so on.
- End result of those complex PHP processes can be cached, especially if the data from which we started doesn't change a lot. That way you can save a lot of system resources, which can be used for other things.
Caching techniques
#xx : Your call
Only limited by your imagination !
When you have data, think :Creating time ?
Modification frequency ?
Retrieval frequency ?
There's plenty of other types of data to store in cache.The only limit there is your imagination.
All you need to think of is :- I have this data- how long did it take me to create it- how often does it change- how often will it be retrieved ?That last bit can be a difficult thing to balance out, but we'll get back to that later.
How to find cacheable data
New projects : start from 'cache everything'
Existing projects :Check page loading times
Look at MySQL slow query log
Make a complete query log (don't forget to turn it off !) Use Percona Toolkit (pt-query-digest)
Databases - pt-query-digest
# Profile# Rank Query ID Response time Calls R/Call Apdx V/M Item# ==== ================== ================ ===== ======= ==== ===== ==========# 1 0x543FB322AE4330FF 16526.2542 62.0% 1208 13.6806 1.00 0.00 SELECT output_option# 2 0xE78FEA32E3AA3221 0.8312 10.3% 6412 0.0001 1.00 0.00 SELECT poller_output poller_item# 3 0x211901BF2E1C351E 0.6811 8.4% 6416 0.0001 1.00 0.00 SELECT poller_time# 4 0xA766EE8F7AB39063 0.2805 3.5% 149 0.0019 1.00 0.00 SELECT wp_terms wp_term_taxonomy wp_term_relationships# 5 0xA3EEB63EFBA42E9B 0.1999 2.5% 51 0.0039 1.00 0.00 SELECT UNION wp_pp_daily_summary wp_pp_hourly_summary# 6 0x94350EA2AB8AAC34 0.1956 2.4% 89 0.0022 1.00 0.01 UPDATE wp_options# MISC 0xMISC 0.8137 10.0% 3853 0.0002 NS 0.0
time spent per query pattern
how many queries of that query pattern
Databases - pt-query-digest
# Query 2: 0.26 QPS, 0.00x concurrency, ID 0x92F3B1B361FB0E5B at byte 14081299# This item is included in the report because it matches --limit.# Scores: Apdex = 1.00 [1.0], V/M = 0.00# Query_time sparkline: | _^ |# Time range: 2011-12-28 18:42:47 to 19:03:10# Attribute pct total min max avg 95% stddev median# ============ === ======= ======= ======= ======= ======= ======= =======# Count 1 312# Exec time 50 4s 5ms 25ms 13ms 20ms 4ms 12ms# Lock time 3 32ms 43us 163us 103us 131us 19us 98us# Rows sent 59 62.41k 203 231 204.82 202.40 3.99 202.40# Rows examine 13 73.63k 238 296 241.67 246.02 10.15 234.30# Rows affecte 0 0 0 0 0 0 0 0# Rows read 59 62.41k 203 231 204.82 202.40 3.99 202.40# Bytes sent 53 24.85M 46.52k 84.36k 81.56k 83.83k 7.31k 79.83k# Merge passes 0 0 0 0 0 0 0 0# Tmp tables 0 0 0 0 0 0 0 0# Tmp disk tbl 0 0 0 0 0 0 0 0# Tmp tbl size 0 0 0 0 0 0 0 0# Query size 0 21.63k 71 71 71 71 0 71# InnoDB:# IO r bytes 0 0 0 0 0 0 0 0# IO r ops 0 0 0 0 0 0 0 0# IO r wait 0 0 0 0 0 0 0 0# pages distin 40 11.77k 34 44 38.62 38.53 1.87 38.53# queue wait 0 0 0 0 0 0 0 0# rec lock wai 0 0 0 0 0 0 0 0# Boolean:# Full scan 100% yes, 0% no# String:# Databases wp_blog_one (264/84%), wp_blog_tw (36/11%)... 1 more# Hosts# InnoDB trxID 86B40B (1/0%), 86B430 (1/0%), 86B44A (1/0%)... 309 more# Last errno 0# Users wp_blog_one (264/84%), wp_blog_two (36/11%)... 1 more# Query_time distribution# 1us# 10us# 100us# 1ms # 10ms ################################################################# 100ms# 1s# 10s+ # Tables# SHOW TABLE STATUS FROM `wp_blog_one ` LIKE 'wp_options'\G# SHOW CREATE TABLE `wp_blog_one `.`wp_options`\G# EXPLAIN /*!50100 PARTITIONS*/SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'\G
Caching storage - MySQL query cache
Use it
Don't rely on it
Good if you have :lots of reads
few different queries
Bad if you have :lots of insert/update/delete
lots of different queries
OK, let's talk about where cached data can be stored.
I already mentioned MySQL query cache.
Turn it onBut don't rely on it too heavily
especially if you have data that changes often.
The problem with SQL query caching
select id, name from someTable where x = 5; uncached
select id, name from someTable where x = 5; cached
update someTable set name="Jim" where x = 10;
select id, name from someTable where x = 5; uncached
Imagine :
500 select/sec, 10 updates/min 10 cache purges per min
50 select/sec, 10 update/sec 10 cache purge per sec
Caching storage - Database memory tables
Tables stored in memory
In MySQL : memory/heap table
temporary table :memory tables are persistent
temporary tables are session-specific
Faster than disk-based tables
Can be joined with disk-based tables
But : default 16MByte limit
master-slave = trouble
if you don't need join overhead of DB software
So : don't use it unless you need to join
I said I was going to discuss some do's and don'ts... This one falls under the category don't There's a second database mechanism for "caching", at least some people use it for that purpose. It's called database memory tables. MySQL has such as storage type : it's called a memory or a heap table. And basically it allows you to store data in tables that are stored in memory. Don't confuse it with a temporary table, which is only valid for your connection. This is actually a persistent table, well persistent meaning that it will survive after you disconnect, but it won't survive a server reboot, because it's in-memory only. Advantages of this storage type are that it's faster than disk-based tables and you can join it with disk-based tables. Also, there's a default limit of 16MByte per table and it can be troublesome getting it to work on a master-slave setup. So my advise is : don't use it.
Caching storage - Opcode caching
DO !
Alright, next.Opcode caching... this is definitely a DO. There's a few opcode caches out there. Now what is opcode caching ? Basically, when you run a PHP file, the PHP is converted by the PHP compiler to what is called bytecode. This code is then executed by the PHP engine and that produces the output. Now, if your PHP code doesn't change a lot, which normally it shouldn't while your application is live, there's no reason for the PHP compiler to convert your source code to bytecode over and over again, because basically it's just doing the same thing, every time. So an opcode cache caches the bytecode, the compiled version of your source code. That way, it doesn't have to compile the code, unless your source code changes. This can have a huge performance impact.
Caching storage - Opcode caching
APCDe-facto standard
Will be in PHP core in 5.4 ? 5.5 ? 6.0 ?
PECL or packages
eAccelerator
Zend Accelerator
X-Cache
WinCacheForPhp
APC is the most popular one and will probably be included in one of the next few releases. Might be 5.4, but there's still a lot of discussion about that. I'm guessing we probably won't see it before 5.5 or who knows 6.0, if that ever comes out. To enable APC, all you have to do is install the module, which can be done using PECL or through your distribution's package management system. Then make sure apc is enabled in php.ini and you're good to go. The other opcode caches are eAccelerator, which is sort of slightly outdated now, although it does in some cases produce a better performance. But since APC will be included in the PHP core, I'm not sure if it's going to survive for very long anymore. Then there's Zend Accelerator, which is built into Zend Server. Basically, it's similar to APC in terms of opcode caching functionality, but it's just bundled with the Zend products.
Caching storage - Opcode caching
APCDe-facto standard until 5.4
PECL or packages
Zend Optimizer+Built-in with PHP 5.5
eAccelerator
PHPPHP + APC42.18 req/sec206.20 req/sec
There's also a thing called X-Cache. I must admit I've never tried it. Could be good, but it's pretty hard to find decent information about it. And there's also a cache for Windows called WinCacheForPhp... has anyone tried it ?
Opcode caching on its own is ofcourse not useful to store specific data, but it will improve your PHP performance. Also, it reduces memory usage, since compiling the PHP code requires additional memory. So it's a kind of caching that falls under the tuning category ;-)
Caching storage - Disk
Data with few updates : good
Caching SQL queries : preferably not
DON'T use NFS or other network file systemshigh latency
possible problem for sessions : locking issues !
Caching storage - Memory disk (ramdisk)
Usually faster than physical disk
But : OS file caching makes difference minimal(if you have enough memory)
Slightly better than using local disk is using a local memory disk or a ramdisk. Advantage : slightly faster, on the other hand if you're using Linux the standard file caching system will cache recently accessed files anyway, so there might not be a big performance impact when comparing to standard disk caching.
Caching storage - Disk / ramdisk
Overhead : filesystem
Limited number of files per directory Subdirectories
Local5 Webservers 5 local caches
How will you keep them synchronized ? Don't say NFS or rsync !
Caching storage - APC variable cache
More than an opcode cache (PHP 5.5 use APCu)
Store user data in memory
apc_add / apc_store to add/update
apc_fetch to retrieve
apc_delete
Fast huge performance impact
Session support !
Downside :local storage hard to scale
restart Apache cache = empty
The biggest downside however is that, just like with disk cache, it stores its data locally, which means it's great if you have only 1 server, but as soon as you move to an architecture with 2 webservers, you can't use it for sessions anymore and you'll have to find a way to keep your cache synchronized, which will in fact cause a lot of overhead.
>> empty