Upload
daniela-potter
View
254
Download
0
Tags:
Embed Size (px)
Citation preview
Database I/O Mechanisms
Performance and persistence
Richard BanvilleFellow, OpenEdge DevelopmentProgress Software
© 2013 Progress Software Corporation. All rights reserved.2
Agenda
1 Database I/O Types
User Data I/O
Recovery Data I/O
Other I/O
2
3
4
Summary5
© 2013 Progress Software Corporation. All rights reserved.3
File Write I/O for File Types
Logical vs Physical
• Database request vs OS I/O
• Database I/O vs O/S I/O
Physical I/O always uses file system cache (no raw I/O)
Buffered vs unbuffered I/O
• Unbuffered I/O considered durable after write system call
– Recovery data with integrity
– User data with -directio
• Buffered I/O requires file system sync. for durability
– Recovery data with no integrity
– User data
© 2013 Progress Software Corporation. All rights reserved.4
OpenEdge I/O & The File System
DatabaseBuffer Pool
BIBuffers
AIBuffers
File system cache
.d.d.d.d.d.d
.d.b.d.a
System Memory
Process Shared Memory
Physical Disk Devices
Multi level caches Multi level caches
I/O via F/S cache
© 2013 Progress Software Corporation. All rights reserved.5
OpenEdge Data I/O & The File System
DatabaseBuffer Pool
File system cache
System Memory
Process Shared Memory
.d.d.d.d.d.d
Multi level caches
Buffered I/O to F/S cache
• F/S decides when to write to disk device
• Disk device decides when to write to physical disk
• At checkpoint, made durable via fdatasync() / FlushFileBuffers()
– Required for crash recovery and Bi space reuse to work properly
Promon Checkpoints:
Flushes Duration Sync Time
0 0.20 0.02
4 0.20 0.04
4 0.17 0.02
2 0.22 0.03
Disk Devices
© 2013 Progress Software Corporation. All rights reserved.6
OpenEdge –directio I/O & The File System
DatabaseBuffer Pool
File system cache
System Memory
Process Shared Memory
.d.d.d.d.d.d
Multi level caches
-directio
• Unbuffered I/O thru F/S cache
– Not raw I/O to disk device
• Each I/O sync‘d to disk device
• Operational affects
– No need to sync at checkpoint
– Write I/O more expensive
– Additional cost to page writers
Promon Checkpoints:
Flushes Duration Sync Time
0 0.16 0.00
2 0.18 0.00
11 0.16 0.00
0 0.18 0.00
Disk Devices
© 2013 Progress Software Corporation. All rights reserved.7
OpenEdge –directio I/O Performance
DatabaseBuffer Pool
File system cache
System Memory
Process Shared Memory
.d.d.d.d.d.d
Multi level caches
How could the more expensive writes of –directio improve performance?
• APWs absorb the additional cost
• If they do all the writing without adding OLTP contention
• Lower checkpoint costs
– Each I/O sync‘d to disk device
– No sync needed during checkpoint
– Higher throughput due to less pause
• May help on inadequate file system
• Less useful for
– Well tuned deployments
– Properly sized systems
– When buffers flushed at checkpoint
Disk Devices
© 2013 Progress Software Corporation. All rights reserved.8
OpenEdge Recovery I/O & The File System
Unbuffered I/O to F/S cache
• Each I/O sync‘d to disk device
• For .bi, called “reliable I/O”
BI blocks written when:• BIW notices full block in out buffer
• APW writes data block with bi dependancy
• Broker notices aged commit (-Mf)
• User can‘t find empty bi block to store update notes
• User must perform checkpoint
BIBuffers
File system cache
System Memory
Process Shared Memory
Multi level caches
.d.b
Disk Devices
© 2013 Progress Software Corporation. All rights reserved.9
OpenEdge Recovery: Making it unreliable
Never in production
Specific maintenance only
-r: BI writes are buffered (un-reliable) to F/S
All change notes recorded• Rollback will work
• Crash recovery likely to work
• Recovery from OS crash will most likely fail
• idxbuild some index, !“some; !”
BIBuffers
File system cache
System Memory
Process Shared Memory
Multi level caches
.d.b
*** An earlier -r session crashed, the database may be damaged. (514)
Disk Devices
© 2013 Progress Software Corporation. All rights reserved.10
OpenEdge Recovery: Making it more unreliable
Never ever in production
Specific maintenance only
-i: no-integrity• BI writes are buffered
• No data dependency check (!WAL)
• No F/S sync at checkpoint
• No record of purely physical notes
• Rollback might work
• OS, DB crash, abnormal termination
– Must restore from backup
BIBuffers
File system cache
System Memory
Process Shared Memory
Multi level caches
.d.b** Your database cannot be repaired. You must restore a backup copy. (510)
Disk Devices
© 2013 Progress Software Corporation. All rights reserved.11
Agenda
1 Database I/O Types
User Data I/O
Recovery Data I/O
Other I/O
2
3
4
Summary5
© 2013 Progress Software Corporation. All rights reserved.12
Buffer Pool I/O
Database
Buffer Pool (-B, -B2)
4 160 32 128 64 …
2 144 192 112 80 … LRU buffer eviction policy
LRU2 buffer eviction policy
Database
Buffer
Lookup
If not found via hash table lookup
• Incur O/S read I/O – “page-in”
• But where do you read into?
1 Buffer pool cache
1 Hash table
Multiple LRUreplacement chainsB
uffe
r p
oo
l ha
sh t
ab
le
© 2013 Progress Software Corporation. All rights reserved.13
Buffer Pool I/O
Database
Buffer Pool (-B, -B2)
C C C C D …
D D C C D … LRU buffer eviction policy
LRU2 buffer eviction policy
Database
Buffer
Lookup
Start at LRU end of buffer replacement chain
• Look for first “non-dirty” buffer (to avoid write)
• Can’t find one after 10 tries?
– “Page-out” least recently used buffer (O/S write I/O) “LRU writes”
– May force (multiple) BI/AI writes, usually partial writes!
– “Page-in” your block to available buffer (O/S read I/O)
1 Buffer pool cache
1 Hash table
Multiple LRUreplacement chainsB
uffe
r p
oo
l ha
sh t
ab
le
© 2013 Progress Software Corporation. All rights reserved.14
Data Read I/O Tuning
Avoiding read I/O
• Large buffer pool (-B)
• Utilize alternate buffer pool (-B2)
• Improve queries; Avoid table scans; Cache data locally
• Private “read-only” buffers (–Bp), utilities too!
Increase pool when read I/O unacceptable for properly tuned application
Too many buffers may cause O/S paging
• Decrease file system cache
• Avoid non-essential activities on production server
• Consider buying more memory
Database Buffer Pool
-B & -B2 buffers
I/O
DB
Increase performance by decreasing I/O
© 2013 Progress Software Corporation. All rights reserved.15
Promon R&D => Performance indicators
Promon R&D => Buffer cache
• O/S reads and O/S writes
• Flushed at checkpoint
• LRU Writes
• APW enqueues*
Data I/O Performance Monitoring - Promon
What about buffer pool hit ratio % (BHR)?
• Too easily skewed by bad queries
• Not a fine enough metric (hits / requests)
– 270,000 database read requests / second
– Buffer hit ratio % of 98
– Still means 5,400 O/S Read I/Os per second!
– Fast F/S access still 75x slower than -B
Database Buffer Pool
-B & -B2 buffers
I/O
DB
A low BHR indicates apoorly tuned system
A high BHR does notdenote a well tuned system
© 2013 Progress Software Corporation. All rights reserved.16
Data Write I/O Tuning
Avoiding write I/O
• Large buffer pool lessens forced “page-outs”
• Improve queries in the application
• Reduce checkpoint frequency (see next section)
• Run with APWs (Have someone else do it!)
– Avoids user and server writes
– Decreases LRU writes (forced “page-outs”)
– Reduces checkpoint time
– Performs DB buffer pool I/O
– May flush AI and BI data
Database Buffer Pool
-B & -B2 buffers
I/O
DB
Increase performance by decreasing I/O
© 2013 Progress Software Corporation. All rights reserved.17
Asynchronous Page Writer Activities
CheckpointQueue
Primary –B buffer poolAND
Alternate –B2 buffer pool
C D C …
D D C …LRU
chains
4 148 200 120
BI
WAL
APW
DB
D D D …
APW Queue
Forced bi write only if cluster > 95% full
New adaptive mechanism for checkpoint processing
Avoids buffers flushed
10.2b FCS
#1
#2
#3
© 2013 Progress Software Corporation. All rights reserved.18
Asynchronous Page Writer Performance
CheckpointQueue
Primary –B buffer poolAND
Alternate –B2 buffer pool
R U R …
U U R …
LRU chains
4 148 200 120
BI
WAL
APW
DB
U U U …
APW Queue
Promon R&D => Page Writers
• APW queue writes
• Checkpoint queue writes
• Buffers scanned
• Scan writes
Tuning
• Increase until 0 blocks flushed at checkpoint
• Decrease if partial BI writes increase
• Increasing BI cluster size can avoid:
– partial BI writes
– forcing BI writes (95% full less of the time)
• Typically need more if running with Direct I/O
© 2013 Progress Software Corporation. All rights reserved.19
Agenda
1 Database I/O Types
User Data I/O
Recovery Data I/O
Other I/O
2
3
4
Summary5
© 2013 Progress Software Corporation. All rights reserved.20
Rollback Processing
BI Buffer Pool
-bibufs 10
Free(a)
Free(b)
Free(c)
Free(d)
Free(e)
32 31
30
29
Modified QueueFree List
15
Current Input Buffer
9
Backout Buffer
12
Backout Buffer
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
© 2013 Progress Software Corporation. All rights reserved.21
BI Buffer Pool – Recording a change
-bibufs 10
Free(a)
Free(b)
Free(c)
Free(d)
Free(e)
32 31
30
29
Modified QueueFree List
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
B I W
User
Empty buffer waits
Busy buffer waits
BIB latch contention
• -bwdelay in ms (30ms)
• Nap time when nothing dirty
• Not much positive tuning affect
© 2013 Progress Software Corporation. All rights reserved.22
BI Buffer Pool – Forced Write I/O
-bibufs 10
Free(a)
Free(b)
Free(c)
Free(d)
Free(e)
32 31
30
29
Modified QueueFree List
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
User
Buffer Pool
172
128
Associated BI notedependency ctr (based on fill %)
Data Blocks
WALAPW
DB
256
512
768
CheckpointQueue
© 2013 Progress Software Corporation. All rights reserved.23
BI Buffer Pool – Write I/O
-bibufs 10
Free(a)
Free(b)
Free(c)
Free(d)
Free(e)
32 31
30
29
Modified QueueFree List
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
Broker
User
Is it OK to buffer modified BI blocks?
YES
Is it OK to buffer committed BI data?
Delayed commit (-Mf) is up to you!
Delayed commit (Durability)
Based on –Mf value, Broker may flush BI buffers to disk
For aged txn ends
-Mf default 3
Increasing -Mf Pros/Cons:
© 2013 Progress Software Corporation. All rights reserved.24
Rollback Processing
BI Buffer Pool – Change rollback
-bibufs 10
Free(a)
Free(b)
Free(c)
Free(d)
Free(e)
32 31
30
29
Modified QueueFree List
15
Current Input Buffer
9
Backout Buffer
12
Backout Buffer
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
1 shared input buffer
Multiple privateback out buffers
© 2013 Progress Software Corporation. All rights reserved.25
Rollback Processing
BI Buffer Pool – Change rollback
32 31
30
29
Modified Queue
15
Current Input Buffer
9
Back out Buffer
12
Back out Buffer
BI
Current Output Buffer
– Read I/O to find notes
– Write I/O when undoing
Promon:
• BI Reads
• Input buffer hits
• Output buffer hits
• Mod buffer hits
• BO buffer hits
© 2013 Progress Software Corporation. All rights reserved.26
Tuning the Bi Buffer Pool
-bibufs 10
Free(a)
Free(b)
Free(c)
Free(d)
Free(e)
32 31
30
29
Modified QueueFree List
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
B I W
User
Run BIW
Promon: 5. BI Log Activity
Empty buffer waits – all full
• Increase –bibufs (online)
• -aibufs >= -bibufs
• Start with –bibuf 150
Partial (forced) writes
• -Mf expired
– Increase if not risk adverse
• Too many APWs
• Tune checkpoint processing
Busy buffer waits – busy - OK
Log force waits/write – 2PC commit
© 2013 Progress Software Corporation. All rights reserved.27
Monitoring BI Activity & Performance Summary
Activity
Forward Activity• Total BI writes• Records (notes) written• Clusters closed
Undo• Total BI reads• Notes read • Input buffer hits• Output buffer hits• Mod buffer hits• BO Buffer Hits
Performance
OK Waits & Writes• Busy buffer waits• BIW writes
Bad Waits & Writes• Empty buffer waits• Partial writes• Forced writes (2PC)• Flushed at checkpoint• Checkpoint duration (wait)
© 2013 Progress Software Corporation. All rights reserved.28
Checkpoint Processing
© 2013 Progress Software Corporation. All rights reserved.29
Checkpoint Processing
Quiet DB
Database changes halted
Page writers continue
Flush bibufs
Output, Mod buffers
May cause 1 partial write
Scan buffer pool
Write bufs on chkpt queue
Dirty buffs added to
chkpt queue
“Fuzzy” checkpoint
Hopefully flushed prior to next chkpt
Flush aibufs
Output, Mod buffers
May cause 1 partial write
Sync File System
F/S Sync system call
No more sync delay
Resume database activity
© 2013 Progress Software Corporation. All rights reserved.30
Promon Checkpoint Data
No. Time … CPT Q Scan APW Q Flushes (Cont.)
27 10:23:12 … 0 384 52 0 …26 10:22:46 … 0 381 381 3 …25 10:22:18 … 0 380 380 2 …24 10:21:50 … 201 158 158 0 …
--------- Database Writes ---------
APW Specific Activity…
CPT Q: # data buffers APW wrote from checkpoint queue (from prev chkpt)
Scan: # data buffers APW wrote while scanning -B
APW Q: # data buffers APW wrote from APW Q
Dirty buffers added to APWQ from -B LRU eviction
© 2013 Progress Software Corporation. All rights reserved.31
Promon Checkpoint Data
No. Time … CPT Q Scan APW Q Flushes (Cont.)
27 10:23:12 … 0 384 52 0 …26 10:22:46 … 0 381 381 3 …25 10:22:18 … 0 380 380 2 …24 10:21:50 … 201 158 158 0 …
--------- Database Writes ---------
Flushes:
• Number of database blocks written during checkpoint
– Very costly operation (db updates paused)
– Should add ai/bi flushes
• Marked from previous checkpoint
• Avoid with APWs and larger cluster sizes
© 2013 Progress Software Corporation. All rights reserved.32
Promon Checkpoint Data
No. Time … Duration Sync Time
27 10:23:12 … 0.12 0.04
26 10:22:46 … 0.11 0.03
25 10:22:18 … 0.11 0.04
24 10:21:50 … 0.13 0.04
----- New Columns -----
Duration:
• Time to process checkpoint including:
– Write chkpt queue, buffer pool scan, bi/ai flush, F/S Sync
Sync Time: Amount of time in seconds it took for fdatasync() or FlushFileBuffers()
• Limit file system cache size and flush frequency
• Faster disks for data files
• Avoid with –directio (but increases all write I/Os)
File System Cache
File System Cache
DB
© 2013 Progress Software Corporation. All rights reserved.33
Tuning Checkpoint Processing
Physical
BI truncate• Values in K
• -bi (cluster size in KB)
• -biblocksize (size in KB)
Before-image block size set to 8 or 16 kb
• Followed by sync command
Runtime BI bufs
BIW
proutil <db> -C truncate bi -biblocksize 8 -bi 8192
proutil <db> -C bigrow 8 -r
© 2013 Progress Software Corporation. All rights reserved.34
Summary: Recovery Subsystem
AI/BI buffers
• No LRU replacement mechanism
• Database changes recorded orderly
• Forward processing causes BI write I/O
• Rollback may cause read I/O
– Backout Buffers (BOB) help rollback contention
Checkpoints
• Buffers flushed during checkpoint
Page writers
• BIW/AIW processing
• APW processing
© 2013 Progress Software Corporation. All rights reserved.35
Agenda
1 Database I/O Types
User Data I/O
Recovery Data I/O
Other I/O
2
3
4
Summary5
© 2013 Progress Software Corporation. All rights reserved.36
Database Extend
Maintenance cost
Performance
Concurrency
Frequency
Database extend
• Storage area locked - no other extends
• Writes performed 16K at a time
• Extend by 64 blocks or cluster size
Recovery extend (AI/BI)
Acquire space from F/S
Unbuffered write
Bi grow after truncate
• Performance Improvements
• F/S interaction for extent create 11.3
• BI extend, format & grow in 11.3
© 2013 Progress Software Corporation. All rights reserved.37
Monitoring I/O With Promon R&D
2. Activity Displays ...
1. Summary
3. Buffer Cache
4. Page Writers
5. BI Log / 6. AI Log
8. I/O Operations by Type
9. I/O Operations by File
Database Accesses vs File I/O
• Database writes
• O/S Writes
3. Other Displays…
1. Performance Indicators
2. I/O Operations by Process
4. Checkpoints
5. I/O Operations by User by Table
6. I/O Operations by User by Index
© 2013 Progress Software Corporation. All rights reserved.38
Agenda
1 Database I/O Types
User Data I/O
Recovery Data I/O
Other I/O
2
3
4
Summary5
© 2013 Progress Software Corporation. All rights reserved.39
Summary
• Always uses file system cache (no raw I/O)• Buffered vs unbuffered I/O• User data files: .d’s and recovery files: .ai, .bi, .tl
I/O Types
• Checkpoint process• Page writers (APW, BIW, AIW)
Data and recovery I/O
• Monitor via promon, VSTs and OS tools• Tuning tips
Performance
© 2013 Progress Software Corporation. All rights reserved.40
Questions?
October 6–9, 2013 • Boston #PRGS13
www.progress.com/exchange-pug
Special low rate of $495 for PUG Challenge attendees with the code PUGAM
And visit the Progress booth to learn more about the Progress App Dev Challenge!