Caching and tuning fun for high scalability

Embed Size (px)

Citation preview

Caching and tuning funfor high scalability

Wim GoddenCu.be Solutions

Who am I ?

Wim Godden (@wimgtr)

Founder of Cu.be Solutions (http://cu.be)

Open Source developer since 1997

Developer of OpenX, PHPCompatibility, Nginx SCL, ...

Speaker at PHP and Open Source conferences

Who are you ?

Developers ?

System/network engineers ?

Managers ?

Caching experience ?

Goals of this tutorial

Everything about caching and tuning

A few techniquesHow-to

How-NOT-to

Increase reliability, performance and scalability

5 visitors/day 5 million visitors/day

(Don't expect miracle cure !)

LAMP

Architecture

Test page

3 DB-queriesselect firstname, lastname, email from user where user_id = 5;

select title, createddate, body from article order by createddate desc limit 5;

select title, createddate, body from article order by score desc limit 5;

Page just outputs result

Our base benchmark

Apachebench = useful enough

Result ?

Single webserverProxy

StaticPHPStaticPHP

Apache + PHP390017.5670017.5

Limit :CPU, networkor disk

Limit :database

Caching

What is caching ?

CACHE

What is caching ?

x = 5, y = 2n = 50Same resultCACHE

select*fromarticlejoin useron article.user_id = user.idorder bycreated desclimit10

Doesn't changeall the time

Theory of caching

DB

Cache$data = get('key')falseGET /pagePageselect data from table$data = returned resultset('key', $data)if ($data == false)

Theory of caching

DB

Cache

HIT

Caching goals - 1st goal

Reduce # of concurrent request

Reduce the load

Caching serves 3 purposes :- Firstly, to reduce the number of requests or the load at the source of information, which can be a database server, content repository, or anything else.

Caching goals - 2nd goal

Secondly, you want to improve the response time of each request.If you request a page that takes 50ms to load without caching and you get 10 hits/second, you won't be able to serve those requests with 5 Apache processes.If you could cache some of the data used on the page you might be able to return the page in 20ms. That doesn't just improve user experience, but reduces the load on the webserver, since the number of concurrent connections is a lot lower. connections closed faster handle more connections and, as a result, more hits on the same machine. If you don't : more Apache processes needed will eat memory, will eat system resources and as a result will also cause context switching.

Some figures

Pageviews : 5000 (4000 on 10 pages)

Avg. loading time : 200ms

Cache 10 pages

Avg. loading time : 20ms Total avg. loading time : 56ms

Worth it ?

Caching goals - 3rd goal

Send less data across the network / Internet

You benefit lower bill from upstream provider

Users benefit faster page load

Wait a second... that's a frontend problem !

True, but remember : the backend is transmitting it !

More tuning Reduce the amount of data that needs to be sent over the network or the Internet- benefits you, as application provider, because you have less traffic on your upstream connection.- also better for the enduser because there will be less data to be downloaded. Ofcourse part of frontend side, which we'll discuss near the end of the tutorial.

Caching techniques

#1 : Store entire pages

Company Websites

Blogs

Full pages that don't change

Render Store in cache retrieve from cache

The first way to cache in a web application, or actually more commonly a website, is to cache entire pages.This used to be a very popular caching technique, back in the days when pages were very static.

It's still popular for things like company websites, blogs, and any site that has pages that don't change a lot.

Basically, you just render the page once, put it in a cache, and from the moment onwards, you just retrieve it from the cache.

Caching techniques

#2 : Store parts of a page

Most common technique

Usually a small block in a page

Best effect : reused on lots of pages

Can be inserted on dynamic pages

Store part of a page. Probably most common + best way to cache data.- Basically what you do is, take piece of data : - data from the database - result of a calculation - an aggregation of two feeds - parsed data from CSV-file from NFS share located on the other side of the world - could be data that was stored on a USB stick your kid is now chewing on.What I mean is : it doesn't matter where the data came from. Part of a page, usually a block on a page and want save time by not having to get that data from its original source every time again.Instead of saving entire page, where you can have multiple dynamic parts, some of which might not be cached because they are really dynamic, like the current time. So store small block, so that when we render the page, all we do is get small block from cache and place it in dynamic page and output it.

Caching techniques

#3 : Store SQL queries

SQL query cacheLimited in size

Store the output of SQL queries.

Now, who of you know what SQL query caching is, MySQL query cache for example ?

Basically, the MySQL query cache is a cache which stores the output of recently run SQL queries. It's built into MySQL, it's... not enabled by default everywhere, it depends on your distribution.

And it speeds up queries by a huge margin. Disabling it is something I never do, because you gain a lot by having it enabled.

However, there's a few limitations :- First of all, the query cache is limited in size.

Caching techniques

#3 : Store SQL queries

SQL query cacheLimited in size

Resets on every insert/update/delete

Server and connection overhead

Goal :not to get rid of DB

free up DB resources for more hits !

Better :store processed data instead of raw data

store group of objects

But, basically one of the big drawbacks of MySQL query cache, is that every time you do an insert, update or delete on a table, the entire query cache for queries referencing that table, is erased.

Another drawback is that you still need to connect to the MySQL server and you still need to go through a lot of the core of MySQL to get your results.

So, storing the output of SQL queries in a separate cache, being Memcache or one of the other tools we're going to see in a moment, is actually not a bad idea. Also because of the fact that, if you have a big Website, you will still get quite a bit load on your MySQL database. So anything that takes the load off the database and takes it to where you have more resources available, is a good idea.

Better : store returned object or group of objects

Caching techniques

#4 : Store complex PHP results

Not just calculations

CPU intensive tasks :Config file parsing

XML file parsing

Loading CSV in an array

Save resources more resources available

Another caching technique I want to mention is storing the result of complex PHP processes.

- You might think about some kind of calculation, but when I mention calculation, people tend to think about getting data from here and there and then summing them.

- That's not what I mean. By complex PHP processes I mean things like parsing configuration files, parsing XML files, loading CSV-data in an array, converting mutliple XML-files into an object structure, and so on.

- End result of those complex PHP processes can be cached, especially if the data from which we started doesn't change a lot. That way you can save a lot of system resources, which can be used for other things.

Caching techniques

#xx : Your call

Only limited by your imagination !

When you have data, think :Creating time ?

Modification frequency ?

Retrieval frequency ?

There's plenty of other types of data to store in cache.The only limit there is your imagination.

All you need to think of is :- I have this data- how long did it take me to create it- how often does it change- how often will it be retrieved ?That last bit can be a difficult thing to balance out, but we'll get back to that later.

How to find cacheable data

New projects : start from 'cache everything'

Existing projects :Check page loading times

Look at MySQL slow query log

Make a complete query log (don't forget to turn it off !) Use Percona Toolkit (pt-query-digest)

Databases - pt-query-digest

# Profile# Rank Query ID Response time Calls R/Call Apdx V/M Item# ==== ================== ================ ===== ======= ==== ===== ==========# 1 0x543FB322AE4330FF 16526.2542 62.0% 1208 13.6806 1.00 0.00 SELECT output_option# 2 0xE78FEA32E3AA3221 0.8312 10.3% 6412 0.0001 1.00 0.00 SELECT poller_output poller_item# 3 0x211901BF2E1C351E 0.6811 8.4% 6416 0.0001 1.00 0.00 SELECT poller_time# 4 0xA766EE8F7AB39063 0.2805 3.5% 149 0.0019 1.00 0.00 SELECT wp_terms wp_term_taxonomy wp_term_relationships# 5 0xA3EEB63EFBA42E9B 0.1999 2.5% 51 0.0039 1.00 0.00 SELECT UNION wp_pp_daily_summary wp_pp_hourly_summary# 6 0x94350EA2AB8AAC34 0.1956 2.4% 89 0.0022 1.00 0.01 UPDATE wp_options# MISC 0xMISC 0.8137 10.0% 3853 0.0002 NS 0.0

time spent per query pattern

how many queries of that query pattern

Databases - pt-query-digest

# Query 2: 0.26 QPS, 0.00x concurrency, ID 0x92F3B1B361FB0E5B at byte 14081299# This item is included in the report because it matches --limit.# Scores: Apdex = 1.00 [1.0], V/M = 0.00# Query_time sparkline: | _^ |# Time range: 2011-12-28 18:42:47 to 19:03:10# Attribute pct total min max avg 95% stddev median# ============ === ======= ======= ======= ======= ======= ======= =======# Count 1 312# Exec time 50 4s 5ms 25ms 13ms 20ms 4ms 12ms# Lock time 3 32ms 43us 163us 103us 131us 19us 98us# Rows sent 59 62.41k 203 231 204.82 202.40 3.99 202.40# Rows examine 13 73.63k 238 296 241.67 246.02 10.15 234.30# Rows affecte 0 0 0 0 0 0 0 0# Rows read 59 62.41k 203 231 204.82 202.40 3.99 202.40# Bytes sent 53 24.85M 46.52k 84.36k 81.56k 83.83k 7.31k 79.83k# Merge passes 0 0 0 0 0 0 0 0# Tmp tables 0 0 0 0 0 0 0 0# Tmp disk tbl 0 0 0 0 0 0 0 0# Tmp tbl size 0 0 0 0 0 0 0 0# Query size 0 21.63k 71 71 71 71 0 71# InnoDB:# IO r bytes 0 0 0 0 0 0 0 0# IO r ops 0 0 0 0 0 0 0 0# IO r wait 0 0 0 0 0 0 0 0# pages distin 40 11.77k 34 44 38.62 38.53 1.87 38.53# queue wait 0 0 0 0 0 0 0 0# rec lock wai 0 0 0 0 0 0 0 0# Boolean:# Full scan 100% yes, 0% no# String:# Databases wp_blog_one (264/84%), wp_blog_tw (36/11%)... 1 more# Hosts# InnoDB trxID 86B40B (1/0%), 86B430 (1/0%), 86B44A (1/0%)... 309 more# Last errno 0# Users wp_blog_one (264/84%), wp_blog_two (36/11%)... 1 more# Query_time distribution# 1us# 10us# 100us# 1ms # 10ms ################################################################# 100ms# 1s# 10s+ # Tables# SHOW TABLE STATUS FROM `wp_blog_one ` LIKE 'wp_options'\G# SHOW CREATE TABLE `wp_blog_one `.`wp_options`\G# EXPLAIN /*!50100 PARTITIONS*/SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'\G

Caching storage - MySQL query cache

Use it

Don't rely on it

Good if you have :lots of reads

few different queries

Bad if you have :lots of insert/update/delete

lots of different queries

OK, let's talk about where cached data can be stored.

I already mentioned MySQL query cache.

Turn it onBut don't rely on it too heavily

especially if you have data that changes often.

The problem with SQL query caching

select id, name from someTable where x = 5; uncached

select id, name from someTable where x = 5; cached

update someTable set name="Jim" where x = 10;

select id, name from someTable where x = 5; uncached

Imagine :

500 select/sec, 10 updates/min 10 cache purges per min

50 select/sec, 10 update/sec 10 cache purge per sec

Caching storage - Database memory tables

Tables stored in memory

In MySQL : memory/heap table

temporary table :memory tables are persistent

temporary tables are session-specific

Faster than disk-based tables

Can be joined with disk-based tables

But : default 16MByte limit

master-slave = trouble

if you don't need join overhead of DB software

So : don't use it unless you need to join

I said I was going to discuss some do's and don'ts... This one falls under the category don't There's a second database mechanism for "caching", at least some people use it for that purpose. It's called database memory tables. MySQL has such as storage type : it's called a memory or a heap table. And basically it allows you to store data in tables that are stored in memory. Don't confuse it with a temporary table, which is only valid for your connection. This is actually a persistent table, well persistent meaning that it will survive after you disconnect, but it won't survive a server reboot, because it's in-memory only. Advantages of this storage type are that it's faster than disk-based tables and you can join it with disk-based tables. Also, there's a default limit of 16MByte per table and it can be troublesome getting it to work on a master-slave setup. So my advise is : don't use it.

Caching storage - Opcode caching

DO !

Alright, next.Opcode caching... this is definitely a DO. There's a few opcode caches out there. Now what is opcode caching ? Basically, when you run a PHP file, the PHP is converted by the PHP compiler to what is called bytecode. This code is then executed by the PHP engine and that produces the output. Now, if your PHP code doesn't change a lot, which normally it shouldn't while your application is live, there's no reason for the PHP compiler to convert your source code to bytecode over and over again, because basically it's just doing the same thing, every time. So an opcode cache caches the bytecode, the compiled version of your source code. That way, it doesn't have to compile the code, unless your source code changes. This can have a huge performance impact.

Caching storage - Opcode caching

APCDe-facto standard

Will be in PHP core in 5.4 ? 5.5 ? 6.0 ?

PECL or packages

eAccelerator

Zend Accelerator

X-Cache

WinCacheForPhp

APC is the most popular one and will probably be included in one of the next few releases. Might be 5.4, but there's still a lot of discussion about that. I'm guessing we probably won't see it before 5.5 or who knows 6.0, if that ever comes out. To enable APC, all you have to do is install the module, which can be done using PECL or through your distribution's package management system. Then make sure apc is enabled in php.ini and you're good to go. The other opcode caches are eAccelerator, which is sort of slightly outdated now, although it does in some cases produce a better performance. But since APC will be included in the PHP core, I'm not sure if it's going to survive for very long anymore. Then there's Zend Accelerator, which is built into Zend Server. Basically, it's similar to APC in terms of opcode caching functionality, but it's just bundled with the Zend products.

Caching storage - Opcode caching

APCDe-facto standard until 5.4

PECL or packages

Zend Optimizer+Built-in with PHP 5.5

eAccelerator

PHPPHP + APC42.18 req/sec206.20 req/sec

There's also a thing called X-Cache. I must admit I've never tried it. Could be good, but it's pretty hard to find decent information about it. And there's also a cache for Windows called WinCacheForPhp... has anyone tried it ?

Opcode caching on its own is ofcourse not useful to store specific data, but it will improve your PHP performance. Also, it reduces memory usage, since compiling the PHP code requires additional memory. So it's a kind of caching that falls under the tuning category ;-)

Caching storage - Disk

Data with few updates : good

Caching SQL queries : preferably not

DON'T use NFS or other network file systemshigh latency

possible problem for sessions : locking issues !

Caching storage - Memory disk (ramdisk)

Usually faster than physical disk

But : OS file caching makes difference minimal(if you have enough memory)

Slightly better than using local disk is using a local memory disk or a ramdisk. Advantage : slightly faster, on the other hand if you're using Linux the standard file caching system will cache recently accessed files anyway, so there might not be a big performance impact when comparing to standard disk caching.

Caching storage - Disk / ramdisk

Overhead : filesystem

Limited number of files per directory Subdirectories

Local5 Webservers 5 local caches

How will you keep them synchronized ? Don't say NFS or rsync !

Caching storage - APC variable cache

More than an opcode cache (PHP 5.5 use APCu)

Store user data in memory

apc_add / apc_store to add/update

apc_fetch to retrieve

apc_delete

Fast huge performance impact

Session support !

Downside :local storage hard to scale

restart Apache cache = empty

The biggest downside however is that, just like with disk cache, it stores its data locally, which means it's great if you have only 1 server, but as soon as you move to an architecture with 2 webservers, you can't use it for sessions anymore and you'll have to find a way to keep your cache synchronized, which will in fact cause a lot of overhead.

>> empty