111
Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@ RemoteControlDBA .com www. RemoteControlDBA .com

Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 [email protected]

Embed Size (px)

Citation preview

Page 1: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Understanding Disk I/O

By Charles Pfeiffer

(888) 235-8916

[email protected]

www.RemoteControlDBA.com

Page 2: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

AgendaAgenda

Arrive 0900 – 0910 Section 1 0910 – 1000 Break 1000 – 1010 Section 2 1010 – 1100 Break 1100 – 1110 Section 3 1110 – 1200 Break 1200 – 1330 Section 4 1330 – 1420 Break 1420 – 1430 Section 5 1430 – 1520 Break 1520 – 1530

Q&A 1530 – 1630

Page 3: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Section 1Section 1

General InformationRAIDThroughput v. Response Time

Page 4: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Who Is This Guy?Who Is This Guy?

Been an independent consultant for 11 years Sun Certified Systems Administrator Oracle Certified Professional Taught Performance and Optimization class at

Learning Tree Taught UNIX Administration class at Virginia

Commonwealth University Primarily focus on complete system performance

analysis and tuning

Page 5: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Is He Talking About?What Is He Talking About?

Disks are horrible!– Disks are slow!– Disks are a real pain to tune properly!

Multiple interfaces and points of bottlenecking! What is the best way to tune disk IO? Avoid it!

– Disks are sensitive to minor changes!– Disks don’t play well in the SAN Box!– You never get what you pay for!– Thankfully, disks are cheap!

Page 6: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Is He Talking About? What Is He Talking About? (continued)(continued)

Optimize IO for specific data transfers– Small IO is easy, based on response time

Improved with parallelism, depending on IOps Improved with better quality disks

– Large IO is much more difficult Increase transfer size. Larger IO slows response time! Spend money on quantity not quality. Stripe wider!

You don’t get what you expect (label spec)– You don’t even come close!

Page 7: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Where Do Vendors Get The Where Do Vendors Get The Speed Spec From?Speed Spec From?

160 MBps capable does not mean 160 MBps sustained– Achieved in optimal conditions

Perfectly sized and contiguous disk blocks Streamline disk processing

– Achieved via a disk-to-disk transfer No OS or FileSystem

Page 8: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Do I Need To Know?What Do I Need To Know?

What is good v. bad? What are realistic expectations in different cases? How can you get the real numbers for yourself? What should you do to optimize your IO?

Page 9: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Why Do I Care?Why Do I Care?

IO is the slowest part of the computer IO improves slower than other components

– CPU performance doubles every year or two– Memory and disk capacity double every year or two– Disk IO Throughput doubles every 10 to 12 years!

A cheap way to gain performance– Disks are bottlenecks!– Disks are cheap. SANs are not, but disk arrays are!

Page 10: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Do Storage Vendors Say?What Do Storage Vendors Say?

Buy more controllers– Sure, if you need them– How do you know what you need? – Don’t just buy them to see if it helps

Buy more disks– Average SAN disk performs at < 1%– 50 disks performing at 1% = ½ disk– Try getting 20 disks to perform at 5% instead (= 1

whole disk)

Page 11: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Do Storage Vendors Say? What Do Storage Vendors Say? (continued)(continued)

Buy more cache– Sure, but its expensive– Get all you can get out of the cheap disks first

Fast response time is good– Not if you are moving large amounts of data– Large transfers shouldn’t get super-fast response time– Fast response time means you are doing small transfers

Page 12: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Do Storage Vendors Say? What Do Storage Vendors Say? (continued)(continued)

Isolate the IO on different subsystems– Just isolate the IO on different disks

Disks are the bottleneck, not controllers, cache, etc.

– Again, expensive. Make sure you are maximizing the disks first.

Page 13: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

What Do Storage Vendors Say? What Do Storage Vendors Say? (continued)(continued)

Remove hot spots– Yes, but don’t do this blindly!– Contiguous blocks reduce IOps– Balance contention (waits) v. IOps (requests)

carefully!

RAID-5 is best– No its not, its just easier for them!

Page 14: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The Truth About SANThe Truth About SAN

SAN = scalability– Yeah, but internal disk capacity has caught up

SAN != easy to manageSAN = performance

– Who told you that lie?– SAN definitely != performance

Page 15: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The Truth About SAN (continued)The Truth About SAN (continued)

But I can stripe wider and I have cache, so performance must be good– You share IO with everyone else– You have little control over what is on each

disk Hot Spots v. Fragmentation Small transfer sizes Contention

Page 16: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

How Should I Plan?How Should I Plan?

What do you need?– Quick response for small data sets– Move large chunks of data fast– A little of both

Corvettes v. Dump Trucks– Corvettes get from A to B fast– Dump Trucks get a ton of dirt from A to B fast

Page 17: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAID Performance PenaltiesRAID Performance Penalties

Loss of performance for RAID overhead Applies against each disk in the RAID The penalties are:

– RAID-0 = None– 1, 0+1, 10 = 20%– 2 = 10%– 3, 30 = 25%– 4 = 33%– 5, 50 = 43%

Page 18: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Popular RAID ConfigurationsPopular RAID Configurations

RAID-0 (Stripe or Concatenation)– Don’t concatenate unless you have to– No fault-tolerance, great performance, cheap

RAID-1 (Mirror)– Great fault-tolerance, no performance gain, expensive

RAID-5 (Stripe With Parity)– medium fault-tolerance, low performance gain, cheap

Page 19: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Popular RAID Configurations Popular RAID Configurations (continued)(continued)

RAID-0+1 (Two or more stripes, mirrored)– Great performance/fault-tolerance, expensive

RAID-10 (Two or more mirrors, striped)– Great performance/fault-tolerance, expensive– Better than RAID-0+1– Not all hardware/software offer it yet

Page 20: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAID-10 Is Better Than RAID-RAID-10 Is Better Than RAID-0+10+1

Given: six disks– RAID-0+1

Stripe disks one through three (Stripe A) Stripe disks four through six (Stripe B) Mirror stripe A to stripe B Lose Disk two. Stripe A is gone Requires you to rebuild the stripe

Page 21: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAID-10 Is Better Than RAID-RAID-10 Is Better Than RAID-0+10+1

– RAID-10 Mirror disk one to disk two Mirror disk three to disk four Mirror disk five to disk six Stripe all six disks Lose Disk two. Just disk two is gone Only requires you to rebuild disk two as a submirror

Page 22: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The Best RAID For The JobThe Best RAID For The Job

Page 23: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Throughput Throughput IsIs Opposite Of Opposite Of Response TimeResponse Time

Page 24: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Common Throughput Speeds Common Throughput Speeds (MBps)(MBps)

Serial = 0.014IDE = 16.7, Ultra IDE = 33USB1 = 1.5, USB2 = 60Firewire = 50ATA/100 = 12.5, SATA = 150,

Ultra SATA = 187.5

Page 25: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Common Throughput Speeds Common Throughput Speeds (MBps) (continued)(MBps) (continued)

FW SCSI = 20, Ultra SCSI = 40,

Ultra3 SCSI = 80, Ultra160 SCSI = 160

Ultra320 SCSI = 320Gb Fiber = 120, 2Gb Fiber = 240,

4Gb Fiber = 480

Page 26: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Expected ThroughputExpected Throughput

Vendor specs are maximum (burst) speeds You won’t get burst speeds consistently

– Except for disk-to-disk with no OS (e.g. EMC BCV)

So what should you expect?– Fiber = 80% as best-case in ideal conditions– SCSI = 70% as best-case in ideal conditions– Disk = 60% as best-case in ideal conditions– But even that is before we get to transfer size

Page 27: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

BREAKBREAK

See you in 10 minutes

Page 28: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Section 2Section 2

Transfer SizeMkfileMetrics

Page 29: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Transfer SizeTransfer Size

Amount of data moved in one IOMust be contiguous block IO

– Fragmentation carries a large penalty!

Device IOps limits restrict throughputMaximum transfer size allowed is different

for different file systems and devicesIs Linux good or bad for large IO?

Page 30: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Transfer Size LimitsTransfer Size Limits

Controllers = UnlimitedDisks and W2K3 NTFS = 2 MB

– Remember the vendor Speed Spec

W2K NTFS, VxFS and UFS = 1 MB

Page 31: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Transfer Size Limits (continued)Transfer Size Limits (continued)

NT NTFS and ext3 = 512 KBext2 = 256 KBFAT16 = 128 KBOld Linux = 64 KBFAT = 32 KB

Page 32: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

So Linux Is Bad?!So Linux Is Bad?!

Again, what are you using the server for?– Transactional (OLTP) DB = fine– Web server, small file share = fine– DW, large file share = Might be a problem!

Page 33: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Good Transfer SizesGood Transfer Sizes

Small IO / Transactional DB– Should be 8K to 128K– Tend to average 8K to 32K

Large IO / Data Warehouse– Should be 64K to 1M– Tend to average 16K to 64K

Not very proportional compared to Small IO! And it takes some tuning to get there!

Page 34: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Find Your AVG Transfer SizeFind Your AVG Transfer Size

iostat –exn (from a live Solaris server) extended device statistics ---- errors ---

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device

2.8 1.1 570.7 365.3 0.0 0.1 2.9 19.0 1 3 0 0 0 0 d10

– (kr/s + kw/s) / (r/s + w/s)– (570.7 + 365.3) / (2.8 + 1.1) = 240K

Page 35: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Find Your AVG Transfer Size Find Your AVG Transfer Size (continued)(continued)

PerfMon

Page 36: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Find Your AVG Transfer Size Find Your AVG Transfer Size (continued)(continued)

AVG Disk Bytes / AVG Disk Transfers– Allow PerfMon to run for several minutes– Look at the average field for Disk Bytes/sec– Look at the average field for Disk Transfers/sec

Page 37: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The mkfile TestThe mkfile Test

Simple, low-overhead, write of a contiguous (as much as possible) empty file– Really is no comparison! Get cygwin/SFU on

Windows to run the same test‘time mkfile 100m /mountpoint/testfile’

– Real is total time spent– Sys is time spent on hardware (writing blocks)– User is time spent at keyboard/monitor

Page 38: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The mkfile Test (continued)The mkfile Test (continued)

User time should be minimal– Time in user space in the kernel

Not interacting with hardware Waiting for user input, etc.

– Unless its waiting for you to respond to a prompt, like to overwrite a file

Page 39: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The mkfile Test (continued)The mkfile Test (continued)

System time should be 80% of real time– Time in system space in the kernel

Interacting with hardware Doing what you want, reading from disk, etc.

Real – (System + User) = WAIT– Any time not directly accounted for by the

kernel is time spent waiting for a resource– Usually this is waiting for disk access

Page 40: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The mkfile Test (continued)The mkfile Test (continued)

Common causes for waits– Resource contention (disk or non-disk)– Disks are to busy

Need wider stripes Not using all of the disks in a stripe

– Disks repositioning Many small transfers due to fragmentation Bad block/stripe/transfer sizes

Page 41: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

The Right Block SizeThe Right Block Size

Smaller for small IO, bigger for large IO– The avg size of data written to disk per individual write– In most cases you want to be at one extreme

As big as you can for large IO / as small as you can for small IO

Balance performance v. wasted space. Disks are cheap!

Is there an application block size?– OS block size should be <= app block size

Page 42: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

More iostat MetricsMore iostat Metrics

iostat –exn (from a live Solaris server)extended device statistics ---- errors ---

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device

2.8 1.1 570.7 365.3 0.0 0.1 2.9 19.0 1 3 0 0 0 0 d10

– %w (wait) = 1. Should be <= 10.– %b (busy) = 3. Should be <= 60.– Asvc_t = 19 (ms response). Most argue that

this should be <= 5, 10 or 20 in today’s technology. Again, response v. throughput.

Page 43: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

iostat On Windowsiostat On Windows

Not so easy– PerfMon can get you %b

Physical Disk > % Disk Time

– Not available in cygwin or SFU– So what do you do for %w or asvc_t

Not much You can ID wait issues as demonstrated later Depend on the array/SAN tools

Page 44: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

vmstat Metricsvmstat Metrics

Vmstatprocs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----

r b w swpd free buff cache si so bi bo in cs us sy id wa

0 0 0 163608 77620 0 0 3 1 1 0 5 11 1 3 96 0

– b+w = (blocked/waiting) processes– Should be <= # of logical CPUs– us(er) v. sy(stem) CPU time

Page 45: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

vmstat Metrics (continued)vmstat Metrics (continued)

Is low CPU idle bad?– Low is not 0– Idle cycles = money wasted– Need to be able to process all jobs at peak– Don’t need to be able to process all jobs at peak

and have idle cycles for show!– Better off watching the run/wait/block queues– Run queue should be <= 4 * # of logical CPUs

Page 46: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

vmstat On Windowsvmstat On Windows

Cygwin works (b/w consolidated to b)

Page 47: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

vmstat On Windows (continued)vmstat On Windows (continued)

PerfMon– System time = idle time – user time

Page 48: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

vmstat on Windows (continued)vmstat on Windows (continued)

PerfMon– Run Queue is per processor (<=4)– Block/Wait queue is blocking queue length

Page 49: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Additional MetricsAdditional Metrics

Do not swap!– On UNIX you should never swap

Use your native OS commands to verify Don’t trust vmstat

– On Windows some swap is OK Use PerfMon to check Pages/sec.

– Should be <= 100

Use ‘free’ in cygwin

Page 50: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Additional Metrics (continued)Additional Metrics (continued)

Network IO issues will make your server appear slow

‘netstat –in’ displays errors/collisions– Collisions are common on auto-negotiate

networks– Hard set the switch and server link speed/mode

Use ‘net statistics workstation’ on Windows

Page 51: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

BREAKBREAK

See you in 10 minutes

Page 52: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Section 3Section 3

Measuring Oracle IOIO Factors/EquationsStriping A Stripe

Page 53: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Measuring Oracle IOMeasuring Oracle IO

Install Statspack– @?rdbms/admin/spcreate

Schedule snapshots– @?rdbms/admin/spauto

Take your own snapshots– Exec statspack.snap;

Get a report– @?rdbms/admin/spreport– Everybody gets a report

Page 54: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Measuring Oracle IO (continued)Measuring Oracle IO (continued)

Read the report– Instance Efficiency Percentages

Buffer hit % Execute to Parse % In-memory sort %

– Top 5 Timed Events Db file sequential read is usually at the top and is in

the most need of tuning

Page 55: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Measuring Oracle IO (continued)Measuring Oracle IO (continued)

Queries– Check Elapsed Time / Executions to find the long

running queries– Don’t forget to tune semi-fast queries that are executed

many times

Tablespace/Datafile IO– Physical reads– Identify hot spots– May need to move/add files

Page 56: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Measuring Oracle IO (continued)Measuring Oracle IO (continued)

Memory Advisories– Buffer cache– PGA– Shared Pool

Page 57: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

IO Performance FactorsIO Performance Factors

Controller overhead = 0.3 msBurst controller/disk speed = varies.

Vendor spec.Average Transfer Size = varies. Can be

anything between the block size and the lesser of device/FS/OS limitation

Average Seek Time = varies. Vendor spec. Most range between 1 and 10 ms

Page 58: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

IO EquationsIO Equations

Controller Transfer Time (ms) =

<avg. transfer size> / <burst controller speed> + <controller overhead>

Controller IOps Limit =

1000 / <controller transfer time>Controller Transfer Rate =

<controller iops limit> * <avg. transfer size>

Page 59: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

IO Equations (continued)IO Equations (continued)

Rotational Delay (ms) =1/(RPM/30) IO Time (ms) =<avg. transfer size / <disk burst speed> + <avg. seek time> + <rotational delay> Disk IOps Limit1000 / <io time> * <RAID factor> Disk Transfer Rate = <disk iops limit> * <avg. transfer size>

Page 60: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

IO Equations (continued)IO Equations (continued)

Optimal Disks Per Controller =

<controller iops per controller> / <disk iops per controller>

NOT

controller speed spec / disk speed spec

IOps weight heavier against disks than against controllers

Page 61: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

IO Equations (continued)IO Equations (continued)

Stripe Size = (<block size * app multiblock read/write count> / <# of data disks in the stripe>) or (<max transfer size> / <# of data disks in the stripe>)

What if I have nested stripes? (Don’t!)– Outer Stripe Size = (< block size * app multiblock

read/write count > / <# of inner stripes in the outer stripe> )or( <max transfer size> / <# of inner stripes in the outer stripe> )

– Inner Stripe Size = <outer stripe size> / <data disks in the inner stripe>

Page 62: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A StripeStriping A Stripe

Nested stripes must be planned carefully– The wrong stripe sizes can lead to degraded

performance and wasted space Assume we have 16 disks

– The backend is configured as four RAID-5 luns, each one containing four disks

– We want to stripe the four luns into one large volume on the OS with DiskSuite

Set Block Size high (e.g. 8K) and assume 32 for multiblock count

Page 63: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

The outer stripe size should = 64K8K * 32 / <number of inner stripes (4) in the

outer stripe>The inner stripe size should = 16K<outer stripe size (64K)> / <number of disks

(4) in the inner stripe>Can’t always be dead on

– Round down to the next available size

Page 64: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

We throw out parity disks and just use data disks for the illustrations in this example

Whiteboard

Page 65: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

We need to write 256K of data– Data is divided into 64K chunks– Each 64K chunk is handed to one column in the

outer stripe (a column represents an inner stripe set)

– Each 64K chunk is divided into 16K chunks– Each 16K chunk is written to one column (one

disk) in the inner stripe. – Perfect fit. All disks are used equally.

Page 66: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

64K Outer Stripe Size Diagram – 16K to each inner stripe

Page 67: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

Same scenario, but use a 32K outer stripe size with the 16K inner stripe size

Data divided into 32K chunksEach 32K chunk handed to one column in

the outer stripeEach 32K chunk divided into two 16K

chunks

Page 68: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

The 16K chunks are written to two disksYou lose up to half of the performance

value for the write and for future reads.

Page 69: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

32K Outer Stripe Size Diagram – 16K to each inner stripe

Page 70: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

Same scenario, 128K outer stripe sizeData is divided into two 128K chunksThird and Fourth RAID-5 sets (inner stripe

columns) are never hitData fits nicely within the other two RAID

sets– 128K divided into 16K chunks– Two chunks written to each of four disks

Page 71: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

128K Outer Stripe Size Diagram – 16K to each inner stripe

Page 72: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Striping A Stripe (continued)Striping A Stripe (continued)

So you lost the use of half of the raid-5 sets in your outer stripe

But you made good use of the other twoWhat if the outer stripe size had been 256K

– Lose the use of all but one RAID-set – Basically, only use four of the 16 disks

Page 73: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

BREAKBREAK

See you in 10 minutes

Page 74: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Section 4Section 4

Oracle Disk LayoutTuningRamSan

Page 75: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Oracle Disk LayoutOracle Disk Layout

Many (myself included) say stripe wide– Don’t do so at the expense of other good practices– Separation of IO is as/more important than striping IO

Depends on the type of IO Depends on the parallelism of the application

Stay away from ASM!– Oracle loves to push/sell it– Requires an extra DB

ASM DB must be online for you to start your DB

– You lose control over what goes where

Page 76: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Oracle Disk Layout (continued)Oracle Disk Layout (continued)

Striping is good, but make sure you retain control– You need to know what is on each disk. This theory kills

the big SAN concept– Redo logs should be on their own independent disks even

at the expense of striping because they are perfectly sequential

– Tables and Indexes should be separated and striped very wide on their own set of disks

If you have multiple high IO tablespaces then each of them should be contained on their own subset of disks

– Control files should be isolated and striped minimally (to conserve disks)

Page 77: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Disk Device CacheDisk Device Cache

Write Cache v. Read Cache– Writers block writers– Writers block readers– Readers block writers– Readers block readers– Cache it all! Cache is available in many places

Disk, Controller, FileSystem, Kernel Don’t double-cache one and zero-cache the other

Page 78: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Disk Device Cache (continued)Disk Device Cache (continued)

Don’t double-cache reads if you have a lot of memory for buffering on the host. Use the disk system cache for writes.– You read the same data many times, it is easy

to cache at the host– Reads are faster than writes. We know where

the blocks to read are located. We have to plan where to store the blocks for a write.

Page 79: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Sequential v. Random Sequential v. Random OptimizationOptimization

Sequential IO is 10 times faster than Random IO– Reorg/Defrag often to make data sequential

Cache writes to improve sequential layout percentage

Cache reads to aid with the performance of Random IO

Page 80: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Sequential v. Random Sequential v. Random Optimization (continued)Optimization (continued)

Random IO requires more disk seeks and more Iops– Use small transfer/stripe/block sizes– # of disks is less important– Use disks with fast seek time

Sequential IO requires more throughput and streaming disks– Use large transfer/stripe/block sizes– Use a lot of disks– Use disks with better RPM

Page 81: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune SomethingTune Something

Kernel Parameters– MAXPHYS – maximum transfer size limit

Yes there is a limit, that restricts you from reaching the maximum potential of the filesystem and/or disk device when you want to

Who thought that was a good idea? Set it to 1M, which is hard maximum

Page 82: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Kernel Parameters– sd_max_throttle – Number of IO requests

allowed to wait in queue for a busy device. Should be set to 256 / <number of luns>.

– sd_io_time – Amount of time an IO request can wait before timing out.

Should be set to 120 / <number of controllers>

Page 83: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Filesystem Parameters – Maxcontig – maximum number of contiguous blocks.

Should be <MAXPHYS> / <block size>. Set it really high if you aren’t sure. It is just a ceiling.

– Direct/Async IO & cache – Follow your application specs. If you don’t have app specs try different combinations. Large, sequential writes should NOT be double-cached. Async is usually best, but there are no guarantees from app to app

Page 84: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Filesystem Parameters – noatime/dfratime – Why waste time updating

inode access time parameters. They will be updated the next time some change happens to the file. Do you really need to know in-between? If you do fine, but this is extra overhead.

– Forcedirectio – Don’t cache writes. Good for large, sequential writes.

Page 85: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Filesystem Searching– Many people like a small number of large

filesystems because space management is easier– Filesystems are also starting points for searches– Searches are done using inodes– Try not to have too many inodes in one

filesystem

Page 86: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Driver (HBA, Veritas, etc.) Parameters– Investigate conf files in /kernel/drv– Check limits on transfer sizes (e.g. vol_maxio

for Veritas). These should usually be set to 1M per controller.

– Check settings/limits for things like direct/async IO and cache. Make sure it falls in line with the rest of your configuration

Page 87: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Driver (HBA, Veritas, etc.) Parameters– Parameters for block shifting if you are using DMP

(e.g. Veritas’ dmp_pathswitch_blks_shift should be 15).

– lun_queue_depth – limits the number of queue IO requets per lun.

Sun says 25. EMC says 32. Emulex says 20 (but their default is 30).

This is very confusing. Anything between 20 and 32 is probably good?

Well, it should really be <sd_max_throttle>.

Page 88: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

Others. – We could have a one week class. – The previous parameters follow the 90/10 rule

and give you the most bang for the buck. 10% of the parameters will give you 90% of the

benefits. This list is more like 3%, but still yields about 90%

of the benefits

Page 89: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Tune Something (continued)Tune Something (continued)

What about Windows? – Sorry, not much we can do

Can’t tune the kernel for Disk IO like you can for Network IO

Can’t tune NTFS At the mercy of Microsoft’s “Best Fit”

– HBA drivers do have parameters that can be tuned in a config file or in the registry

Page 90: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAMSANRAMSAN

Do IO on RAM, not on disk– Memory is much faster than disk!

Random memory outruns sequential disk

– Bottleneck shifts from 320 MBps (haha!) disk to 4 Gbps fiber channel adapter

Want more than 4 Gbps, just get more HBAs What can your system bus(es) handle?

– No need to optimize transfer size, stripe, etc.

Page 91: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAMSAN (continued)RAMSAN (continued)

Problem – data is lost when power is cycled– Most RAMSANs have battery backup and flush to disk

when power is lost– Data is also flushed to disk throughout the day when

performance levels are low– Only blocks that have a new value are flushed to disk

Block 1 is 0 and is flushed to disk Block 1 is updated to 1 Block 1 is updated to 0 Flush cycle runs, but block 1 doesn’t need to be copied to disk Major performance improvement over similar cache monitors

Page 92: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAMSAN (continued)RAMSAN (continued)

A leading product – TMS Tera-RamSan– www.superssd.com– 3,200,000 IOps– 24 GBps– Super High Dollar– Everyone gets some PDFs

Page 93: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

RAMSAN (continued)RAMSAN (continued)

Solid State Disks by:– TMS– Solid Data Systems– Dynamic Solutions– Infiniband

Page 94: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

BREAKBREAK

See you in 10 minutes

Page 95: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Section 5Section 5

IO CalculatorWrap Up

Page 96: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Disk IO Performance CalculatorDisk IO Performance Calculator

Spreadsheet of Performance Equations and automated formulas

Allows you to plug-n-play numbers and gauge the performance impacts

Helps determine what you need to get the bottom line throughput you are looking for

Helps determine the number of disks you can use per controller

Page 97: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Disk IO Performance Calculator Disk IO Performance Calculator (continued)(continued)

Works for both large IO and small IOContains examples to provide a better

understanding of how different IO components impact each other.

Page 98: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Lets See The CalculatorLets See The Calculator

Large Transfer Size Small Transfer Size More Disks Better RPM Better Seek Replay Large Transfer SizeKB Written 1048576.00 1048576.00 1048576.00 1048576.00 1048576.00 1048576.00# of Writes 4350.00 122314.00 122314.00 122314.00 122314.00 4350.00Avg Transfer Size (KB) 241.05 8.57 8.57 8.57 8.57 241.05# of Controllers 2.00 2.00 2.00 2.00 2.00 2.00Burst Controller Speed (KBps) 204800.00 204800.00 204800.00 204800.00 204800.00 204800.00Consistent Controller Speed (KBps) 163840.00 163840.00 163840.00 163840.00 163840.00 163840.00Controller Overhead (ms) 0.30 0.30 0.30 0.30 0.30 0.30Controller Transfer Time (ms) 1.48 0.34 0.34 0.34 0.34 1.48Controller IOps 1354.09 5850.36 5850.36 5850.36 5850.36 1354.09Consistent Controller Transfer Rate (KBps) 326404.98 50154.06 50154.06 50154.06 50154.06 326404.98

# of Disks 12.00 12.00 36.00 36.00 36.00 36.00Raid Factor 0.80 0.80 0.80 0.80 0.80 0.80Disk Burst Speed (KBps) 327680.00 327680.00 327680.00 327680.00 327680.00 327680.00Consistent Disk Speed (KBps) 196608.00 196608.00 196608.00 196608.00 196608.00 196608.00Avg Seek Time (ms) 6.00 6.00 6.00 6.00 3.00 3.00RPM 10000.00 10000.00 10000.00 15000.00 15000.00 15000.00Rotational Delay (ms) 3.00 3.00 3.00 2.00 2.00 2.00IO Time (ms) 9.74 9.03 9.03 8.03 5.03 5.74Disk IOps 986.07 1063.57 3190.72 3588.27 5730.02 5021.24Consistent Disk Transfer Rate (KBps) 237693.73 9117.84 27353.51 30761.56 49122.42 1210380.31

Optimal Disks per Controller 8.24 33.00 33.00 29.35 18.38 4.85

Page 99: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Large Transfer Size v. Small Large Transfer Size v. Small Transfer SizeTransfer Size

986 IOps v. 1,064 IOps238 MBps v. 9 MBps8 disks / controller v. 33 disks / controller

Page 100: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

12 Disks v. 36 Disks (Small 12 Disks v. 36 Disks (Small Transfer Size)Transfer Size)

1,064 IOps v. 3,191 IOps9 MBps v. 27 MBps33 disks / controller v. 33 disks / controller

Page 101: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

10K RPM v. 15K RPM (36 Disks, 10K RPM v. 15K RPM (36 Disks, Small Transfer Size)Small Transfer Size)

3,191 IOps v. 3,588 IOps27 MBps v. 31 MBps33 disks / controller v. 29 disks / controller

Page 102: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

6ms Seek v. 3ms Seek (15K 6ms Seek v. 3ms Seek (15K RPM, 36 Disks, Small Transfer)RPM, 36 Disks, Small Transfer)

3,588 IOps v. 5,730 IOps31 MBps v. 49 MBps29 disks / controller v. 18 disks / controllerAbout as good as it gets.

– 3ms Seek, 15K RPM– Yet 36 disks on two controllers only pushes 49

MBps due to small (normal) transfer size

Page 103: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Back to Large Transfer Size (3 Back to Large Transfer Size (3 ms Seek, 15K RPM, 36 Disks)ms Seek, 15K RPM, 36 Disks)

5,730 IOps v. 5,021 IOps49 MBps v. 1,210 MBps18 disks / controller v. 5 disks / controller1.2 GBps is pretty good

– But 36 disks * 160 MBps = 5.6 GBps Again, only in ideal test conditions Max Transfer Size on every transfer No OS/Filesystem overhead

Page 104: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Speed v. IOpsSpeed v. IOps

Notice we never came close to the speed threshold (multiply number of disks by consistent speed) for the disks before maxing out IOps

Notice that we did come close on two controllers with the large transfer size. If you push that much IO, you do need more controllers, but notice how big that number is

Page 105: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Large IO Requires A Large Large IO Requires A Large Transfer SizeTransfer Size

Large IO requires large (not necessarily fast) individual transfers

You have to tune your transfer sizeAvoid fragmentation

– Use good stripe sizes– Use good block sizes

Page 106: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Now Lets Really See The Now Lets Really See The CalculatorCalculator

Refer To The Spreadsheet– Everyone gets their own copy– What tests do you want to run? Follow Along.– Feel free to contact the developer at any time

Charles Pfeiffer, CRT Sr. Consultant– (888) 235-8916

[email protected]

Page 107: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

SummarySummary

You don’t get the label spec in throughput. Not even close!

Throughput is the opposite of response time!

RAID decreases per-disk performance!– Make up for it with more disks

Page 108: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Summary (continued)Summary (continued)

Striping a stripe requires careful planning– The wrong stripe size will decrease performance

Big money disk systems don’t necessarily have big benefits– The range from high-quality to low-quality isn’t that

severe– Quantity tends to win out over quality in disks

Make your vendor agree to reasonable expectations!– Use the IO Calculator!

Page 109: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

This PresentationThis Presentation

This document is not for commercial re-use or distribution without the consent of the author

Neither CRT, nor the author guarantee this document to be error free

Submit questions/corrections/comments to the author:– Charles Pfeiffer, [email protected]

Page 110: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

BREAKBREAK

See you in 10 minutes

Page 111: Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com

Are We Done Yet?Are We Done Yet?

Final Q&A

Contact Me– 804.901.3992– [email protected]