MySQL ZFS Best Practices

Archives

« April 2010Sun Mon Tue Wed Thu Fri Sat

1 2 3

4 5 6 7 8 9 10

11 12 13 14 15 16 17

18 19 20 21 22 23 24

25 26 27 28 29 30

Today

Search

Search

Past Entries

My last day atSun - 9/18/2009cmdtruss -- truss-c MySQL(COM_*)CommandsInniostat -InnoDB IOStatisticsMySQL InnodbZFS BestPracticesOptimizingMySQLPerformance withZFS - SlidesavailableMySQL 5.4 on 2Socket Nehalemsystem (Sun FireX4270)Reducing Innodbmutex contentionMySQLScalability onNehalemsystemsSSDs forPerformanceEngineersTrading offEfficiency for theSake of FlexibilityMySQL and UFSIntroduction tothe Innodb IOsubsystemBuilding MySQL5.1.28 on

NEELAKANTH NADGIR'S BLOG

All MySQL Personal Ruby Sun uperf ZFS

« Optimizing MySQL... | Main | Inniostat - InnoDB... »

Tuesday May 26, 2009

MySQL Innodb ZFS Best Practices

One of the cool things about talking about MySQL performance with ZFS is

that there is not much tuning to be done Tuning with ZFS is considered

evil, but a necessity at times. In this blog I will describe some of the tunings

that you can apply to get better performance with ZFS as well as point out

performance bugs which when fixed will nullify the need for some of these

tunings.

For the impatient, here is the summary. See below for the reasoning behind

these recommendations and some gotchas.

1. Match ZFS recordsize with Innodb page size (16KB for Innodb

Datafiles, and 128KB for Innodb log files).

2. If you have a write heavy workload, use a Seperate ZFS Intent Log.

3. If your database working set size does not fit in memory, you can get a

big boost by using a SSD as L2ARC.

4. While using storage devices with battery backed caches or while

comparing ZFS with other filesystems, turn off the cache flush.

5. Prefer to cache within MySQL/Innodb over the ZFS Adaptive

replacement cache (ARC).

6. Disable ZFS prefetch.

7. Disable Innodb double write buffer.

Lets look at all of them in detail.

WHATMatch ZFS recordsize with Innodb page size (16KB for

Datafiles, and 128KB for Innodb log files).

HOW zfs set recordsize=16k tank/db

The biggest boost in performance can be obtained by

matching the ZFS record size with the size of the IO. Since a

Innodb Page is 16KB in size, most read IO is of size 16KB

(except for some prefetch IO which can get coalesced). The

default recordsize for ZFS is 128KB. The mismatch between

the read size and the ZFS recordsize can result in severely

inflated IO. If you issue a 16KB read and the data is not

already there in the ARC, you have to read 128KB of data to

get it. ZFS cannot do a small read because the checksum is

calculated for the whole block and you have to read it all to

5.1.28 onOpensolarisusing Sun StudiocompilersLearning MySQLInternals via bugreportsInnodb just gotbetter!UnlockingMySQL : Whatshot and what'snotPeeling theMySQLScalability OnionStorage engineor MySQLserver? Wherehas the timegone?Improving filesortperformance inMySQLuperf - A networkbenchmark tool

Links

Tim Cookblogs.sun.comWeblogLogin

Today's Page Hits: 152

WHYverify data integrity. The other reason to match the IO size

and the ZFS recordsize is the read-modify-write penalty. With

a ZFS recordsize of 128KB, When Innodb modifies a page, if

the zfs record is not already in memory, it needs to be read in

from the disk and modified before writing to disk. This

increases the IO latency significantly. Luckily matching the

ZFS recordsize with the IO size removes all the problems

mentioned above.

For Innodb log file, the writes are usually sequential and

varying in size. By using ZFS recordsize of 128KB you

amortize the cost of read-modify-write.

NOTE

You need to set the recordsize before creating the database

files. If you have already created the files, you need to copy

the files to get the new recordsize. You can use the stat(2)

command to check the recordsize (look for IO Block:)

WHATIf you have a write heavy workload, use a seperate intent log

(slog).

HOW zpool add log c4t0d0 c4t1d0

WHY

Write latency is extremely critical for many MySQL workloads.

Typically, a query will read some data, do some calculations,

update some data and then commit the transaction. To

commit, the Innodb log has to be updated. Many transactions

can be committing at the same time. It is very important that

this "wait" for commit be fast. Luckily in ZFS, synchronous

writes can be accelerated up by using the Seperate Intent Log.

In our tests with Sysbench read-write, we have seen around

10-20% improvement with the slog.

NOTE

If your query execution involves a physical read from

disk, the time for the write may not be that important. Be

sure to check this suggestion with your real workload.

Until Bug 6574286 is fixed, you cannot remove a slog.

Innodb actually issues multiple kinds of writes (log write,

dataspace write, insert buffer write). Of these, the most

critical one is the Innodb log write. The slog feature is

pool wide and thus some writes (like dataspace writes),

which need not go to the slog still do. This will be fixed

via Bug 6832481 ZFS separate intent log bypass

property

It is also possible that during ZFS transaction sync time,

the ZFS IO queue (35 deep) can get full. This means

that a write has to wait for a slot to become empty. Bug

6471212: need reserved I/O scheduler slots to improve

I/O latency of critical ops solves this using reserved slots.

Bug 6721168 slog latency impacted by I/O scheduler

during spa_sync is also worth checking out.

WHAT L2ARC (or Level 2 ARC)

HOW zpool add cache c4t0d0

WHY

If your database does not fit in memory, every time you miss

the database cache, you have to read a block from disk. This

cost is quite high with regular disks. You can minimize the

database cache miss latency by using a (or multiple) SSDs as

a level-2 cache or L2ARC. Depending on your database

working set size, memory and L2ARC size you may see

several orders of magnitude improvement in performance.

NOTE

WHAT When it is safe, turn off ZFS cache flush

HOWThe ZFS Evil tuning guide has more information about setting

this tunable. Refer to it for the best way to achieve this.

WHY

ZFS is designed to work reliably with disks with caches.

Everytime it needs data to be stored persistantly on disk, it

issues a cache flush command to the disk. Disks with a

battery backed caches need not do anything (i.e the cache

flush command is a nop). Many storage devices interpret this

correctly and do the right thing when they receive a cache

flush command. However, there are still a few storage systems

which do not interpret the cache flush command correctly. For

such storage systems, preventing ZFS from sending the cache

flush command results in a big reduction in IO latency. In our

tests with Sysbench read-write test we saw a 30%

improvement in performance.

NOTE

Setting this tunable on a system without a battery backed

cache can cause inconsistencies in case of a crash.

When comparing ZFS with filesystems that blindly enable

the write cache, be sure to set this to get a fair

comparison.

WHAT Prefer to cache within MySQL/Innodb over the ARC.

HOW Via my.cnf and by limiting the ARC size

WHY

You have multiple levels of caching when you are using

MySQL/Innodb with ZFS. Innodb has its own buffer pool and

ZFS has the ARC. Both of them make independent decisions

on what to cache and what to flush. It is possible for both of

them to cache the same data. By caching inside Innodb, you

get a much shorter (and faster) code path to the data.

Moreover, when the Innodb buffer cache is full, a miss in the

Innodb buffer cache can lead to flushing of a dirty buffer, even

if the data was cached in the ARC. This leads to unnecessary

writes. Even though the ARC dynamically shrinks and expands

relative to memory pressure, it is more efficient to just limit it.In

our tests, we have found that it is better (7-200%) to cache

inside Innodb rather than ZFS.

NOTE

The ARC can be tuned to cache everything, just metadata or

nothing on a per filesystem basis. See below for tuning advise

about this.

WHAT Disable ZFS Prefetch.

HOW In /etc/system: set zfs:zfs_prefetch_disable = 1

WHY

Most filesystems implement some kind of prefetch. ZFS

prefetch detects linear (increasing and decreasing), strided,

multiblock strided IO streams and issues prefetch IO when it

will help performance. These prefetch IO have a lower priority

than regular reads and are generally very beneficial. ZFS also

has a lower level prefetch (commonly called vdev prefetch) to

help with spatial locality of data.

In Innodb, rows are stored in order of primary index. Innodb

issues two kinds of prefetch requests; one is triggered while

accessing sequential pages and other is triggered via random

access in an extent. While issuing prefetch IO, Innodb

assumes that file is laid out in the order of the primary key.

This is not true for ZFS. We are yet to investigate the impact

of Innodb prefetch.

It is well known that OLTP workloads access data in a random

order and hence do not benefit from prefetch. Thus we

recommend that you turn off ZFS prefetch.

NOTE

If you have changed the primary cache caching strategy

to just cache metadata, you will not trigger file level

prefetch.

If you have set recordsize to 16k, you will not trigger the

lower level prefetch.

WHAT Disable Innodb Double write buffer.

HOW skip-innodb_doublewrite in my.cnf

WHY

Innodb uses a double write buffer for safely updating pages in

a tablespace. Innodb first writes the changes to the double

write buffer before updating the data page. This is to prevent

partial writes. Since ZFS does not allow partial writes, you can

safely turn off the double write buffer. In our tests with

Sysbench read-write, we say a 5% improvement in

performance.

NOTE

Posted at 01:21PM May 26, 2009 by Neelakanth Nadgir in MySQL |

Comments:

Post a Comment:Comments are closed for this entry.

Documents

MySQL ZFS Best Practices