43
1 1 CONFIDENTIAL

ZFS ARC Cacche and ZIL - WordPress.com · DRAM cache (the ZFS ARC) and disk. •ZFS ARC – ZFS adjustable replacement cache ... >databases and NFS relies on speed and the need to

Embed Size (px)

Citation preview

11CONFIDENTIAL

Sun Microsystems 22

ZFS Readzilla und Logzilla

Claudia HildebrandtSystem Engineer/ConsultantSun Microsystems GmbH

Sun Microsystems 3

Hybrid Storage Pools

• Hybrid Storage Pools> Pools with SSDs> SSDs as write flash accelerators and seperate

log devices for the ZIL - aka Logzilla > SSDs as read flash accelerators – aka

Readzilla

• OpenStorage at this time> 18 GB Logzilla> 100 GB Readzilla

Sun Microsystems 4

Logzilla

• ZFS uses the ZFS Intent Log ( ZIL) to match POSIX synchronous requirements

• ZIL uses allocated blocks within the main storage pool ( default )

• Better performance with seperate ZIL (slog) – ZIL is allocated on seperate devices like a dedicated disk, also SSDs or NVRAM

• # zpool add <pool_name> log <log_device1> <log_device2>

• Note: use mirrored log devices, RAIDZ is not supported

Sun Microsystems 5

Readzilla

• aka L2ARC as secondary caching tier between the DRAM cache (the ZFS ARC) and disk.

• ZFS ARC – ZFS adjustable replacement cache> Stores ZFS data and metadata information from

all active storage pools in physical memory by default as much as possible, except 1 GB of RAM

> ZFS ARC consumes free memory as long there is free memory and releases the memory only to system when free memory is requested by another application

> With Readzilla the information in RAM can be moved to disk and cached as long as there is free space

Sun Microsystems 6

ZFS – Features“All or nothing”

“Always consistent”

“Pooled storage Model”

“Self healing”

Sun Microsystems 7

ZFS Architecture

Sun Microsystems 8

ARC Overview and Purpose• ZFS does not use page cache like UFS (except:

mmap(2))

• Adaptive Replacement Cache> Based on Megiddo & Modha (IBM) at FAST

2003– ARC: A Self-Tuning, Low Overhead

Replacement Cache> ZFS ARC differs slightly in implementation

– ZFS: Variable sized cache and contents, non-evictable contents

• DMU uses ARC to cache data objects based on DVA

• 1 ARC per system

• 2 LRU (Least Recently Used) caches plus History> Recency (MRU) and Frequency (MFU)

– ARC data survives large file scan> 1c cache and 1c history (c = cache size)

Sun Microsystems 9

Adjustable Replacement Cache (ARC )

• Central point for memory management for the SPA > Ability to evict buffers

as a result of memory pressure

• Dynamically, adaptively and self-tuning > Cache adjusts based

on I/O workload

• scan-resistant

Sun Microsystems 10

6 states of arc_buf• ARC_anon :

> Buffers not associated with a DVA> They hold dirty block copies before being written to

storage> They are considered as part of ARC_mru

• ARC_mru : Recently used and currently cached

• ARC_mru_ghost : Recently used, no longer in cache

• ARC_mfu : Frequently used and currently cached

• ARC_mfu_ghost : Frequently used, no longer cached

• ARC_l2c_only : exists only in L2ARC

• Ghost caches only contain ARC buffer headers

Sun Microsystems 11

ARC Diagram of Caches

• MRU = Most Recently Used, MFU = Most Fequently Used. Both lists plus the Ghost Caches are twice the size of the cache c

• ARC adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads

• ARC parameters initialised to:

arc_c_min = MAX(1/32 of all mem, 64Mb) arc_c_max = MAX(3/4 of all mem, all but 1Gb) arc_c = MIN(1/8 physmem, 1/8 VM size) arc_p = arc_c / 2

ARCARCc

M RU M FUp

M RU M FU

c

Ghost Caches

Sun Microsystems 12

p c - p

How it works

ARC = c

MFU ghostMRU ghostGhost caches

MRU = p MFU = c - p

MRULRU MRU LRU

Sun Microsystems 13

claudia@frodo:~/Downloads$ pfexec ./arc_summary.pl System Memory:

Physical RAM: 4052 MB Free Memory : 2312 MB LotsFree: 63 MB

ZFS Tunables (/etc/system):

ARC Size: Current Size: 772 MB (arcsize) Target Size (Adaptive): 3039 MB (c) Min Size (Hard Limit): 379 MB (zfs_arc_min) Max Size (Hard Limit): 3039 MB (zfs_arc_max)

ARC Size Breakdown: Most Recently Used Cache Size: 50% 1519 MB (p) Most Frequently Used Cache Size: 49% 1519 MB (c-p)

Sun Microsystems 14

Data is read

ARC = c

MFU ghostMRU ghost

MRU = p MFU = c - p

A arc_read request

Ghost caches

Sun Microsystems 15

Data buffer is in MRU

ARC = c

MFU ghostMRU ghost

MRU = p MFU = c - p

A

Ghost caches

Sun Microsystems 16

Same data buffer read again

A

ARC = c

MFU ghostMRU ghost

MRU = p MFU = c - p

A arc_read request

Ghost caches

Sun Microsystems 17

Data buffer moves in MFU

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

Ghost caches

Sun Microsystems 18

Cache fills up

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

D E F BC

Ghost caches

Sun Microsystems 19

MRU data buffer is read again

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

D E F BC

D arc_read request

Ghost caches

Sun Microsystems 20

MFU list is dynamically adjusted

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

F D BCE

Ghost caches

Sun Microsystems 21

Data buffer in MFU is read again

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

E F D BC

Barc_read request

Ghost caches

Sun Microsystems 22

Data buffer moves at 1st position

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

E F B CD

Ghost caches

Sun Microsystems 23

ARC Caches in Action

• If evicting during cache insert, then:> 1. Inserting in MRU & MRU < p then arc_evict(MFU)> 2. Inserting in MRU & MRU > p then arc_evict(MRU)> 3. Inserting in MFU & MFU < (c-p) then

arc_evict(MRU)> 4. Inserting in MFU & MFU > (c-p) then

arc_evict(MFU)

• Buffers change state (ie cache) in response to access> If current state is MRU, and at least ARC_MINTIME

(62ms) since last access, then new state is MFU> All other repeated accesses result in state of MFU

–Exception: Prefetching in MRU or Ghosts results in MRU

Sun Microsystems 24

Least recency data buffer evicted

A

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

G B CDF

e

G arc_read request

E

Ghost caches

Sun Microsystems 25

Least frequency data buffer evicted

C

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

G G DBF

e

G

a

arc_read request

A

Ghost caches

Sun Microsystems 26

ARC Adapting and Adjusting• Adapting...adapting to workload

> When adding new content:– If (hit in MRU_Ghost) then increase p– If (hit in MFU_Ghost) then decrease p– If (arc_size within (2*maxblocksize) of c) then

increase c

• Adjusting...adjusting contents to fit> When shrinking or reclaiming:

– If (MRU > p) then arc_evict(MRU)– If (MRU+MRU_Ghost > c) then

arc_evict(MRU_Ghost)– If (arc_size > c) then arc_evict(MFU)– If (arc_size + Ghosts > 2*c) then

arc_evict(MFU_Ghost)

Sun Microsystems 27

Data buffer not in cache

C

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

J G DBI

e

F

ah f

arc_read request

Ghost caches

Sun Microsystems 28

ARC adaptive self tuning

G

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

FJ

e ah

B

c d

I

Ghost caches

Sun Microsystems 29

ARC is to small

D

ARC = c

MFUghostMRU ghost

MRU = p MFU = c - p

J F BGI

e ah c

A

C

E

arc_read request

Ghost caches

Sun Microsystems 30

ARC Reclaiming• Reclaim...reclaiming kernel memory

> Every second (or sooner if adapting or kmem callback)> Check VM parameters: freemem, lotsfree, needfree,

desfree> If required:

– Set arc_no_grow – suspend ARC adaption growths

– Set Aggressive Reclaim Policy triggers ARC shrink– Shrinks by MAX(1/32 of current size, VM needfree)

down to arc_min– Calls arc_adjust() to adjust (ie evict) cache contents

to new sizes–Call kmem_cache_reap_now() on ZIO buffers

• Megiddo/Modha said:“We think of ARC as dynamically, adaptively and continually balancing between recency and frequency - in an online and self-tuning fashion - in response to evolving and possibly changing access patterns”

Sun Microsystems 31

L2ARC

• Enhances the ARC• Second cache layer between main

memory and disk or SSD• Boosts random read performance• Devices used can be:

> Short-stroked disks> Solid state disks> Devices with smaller read latency

Sun Microsystems 32

L2ARC – How does it populate the cache?

• L2ARC attempts to cache data from ARC before it is evicted> There is no eviction path form ARC to L2ARC

• A kernel thread scans the eviction list of MFU/MRU and copies them to L2ARC devices> Refer to l2arc_feed_thread()

Sun Microsystems 33

L2ARC – Tuning

• The performance of the L2ARC can be tweaked by a number of tunables, which may be necessary for different workloads:> l2arc_write_max : max write bytes per interval> l2arc_noprefetch : skip caching prefetched buffers> l2arc_headroom : number of max device writes to

precache> l2arc_feed_secs :seconds between L2ARC writing

Sun Microsystems 34

ZIL ZFS Intent Log

• Filesystems buffer write requests and sync these to storage periodically to improve performance

• Power loss can corrupt filesystems and/or suffer data loss> Corruption solved with TXG commits–Always on-disk consistency

• Use synchronous semantics for applications requiring data is flushed to stable pool by the time a system call returns> Open file with O_DSYNC> Flush buffered contents with fsync(3c)

• The ZIL provides synchronous semantics for ZFS

Sun Microsystems 35

ZIL Operational Overview• ZFS intent log (ZIL) saves transaction records of system calls that

change the file system in memory with enough information to replay them

• ZFS operations are organized by the DMU as transactions. Whenever a DMU transaction is opened there is also a ZIL transaction opened> A Log record holds a system call transaction> A Log block can hold many log records and blocks are chained

together> Log Blocks are dynamically allocated and freed as needed> a) ZIL blocks freed on TXG commit by DMU ( discard )> b) flushed due to synchronous requirements e.g. fsync(3C),

O_DSYNC commited to stable storage

• In the event of power failure/panic the transactions are replayed from ZIL

• 1 ZIL per file system

Sun Microsystems 36

ZIL

• ZILogs resides in m em ory or on disk

• ZIL gathers in­m em ory transactions of system  calls and pushes the list out to a  per filesystem  on­disk log

• ZILogs are written on disk in variable block sizes

> m in. 4 KB, m ax. 128 KB

Sun Microsystems 37

Seperate ZIL

• Enables the use of limited capacity but fast block devices such as NVRAM and SSDs

• ZIL allocates from main pool leads to pool fragmentation

• Performance increasement> databases and NFS relies on speed and the

need to be assured that the data are not lost

Sun Microsystems 38

ZFS Hybrid Storage Pool

Sun Microsystems 39

OpenStorage – 7000 Series

• Logzilla devices – 18 GB flash-based SSDs backed up by a supercapacitor> 10,000 write IOPS

• Readzilla devices – up to 6 100 GB read optimized SSDs> 50 -100 micro seconds

Sun Microsystems 40

Sun Storage 7000 Unified Storage System

Sun Microsystems 41

NEW

• Since 2009.03 > Triple RAIDZ ( RAID-Z3 )> Triple mirroring storage profile> Enhanced iSCSI support> Infiniband support> improved management

Sun Microsystems 42

Links Hybrid Storage Pools

http://blogs.sun.com/ahl/entry/flash_hybrid_pools_and_futurehttp://blogs.sun.com/ahl/entry/hsp_goes_glossy

Demo: Storage Simulatorhttp://www.sun.com/storage/disk_systems/unified_storage/resources.jsp?intcmp=2992

Sun Microsystems 43

Vielen Dank

Claudia [email protected]

44