Upload
bryce-fleming
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 2
Shadow paging
Shadow paging is a copy-on-write technique to avoid in-place updates
When a page is modified, a shadow page is allocated
Because old and new version of a page (on disk) may exist a mapping from logical page number to its physical location is necessary (converter)
Why shadow paging?
Provide atomicity and durability on page level(two of the ACID* properties to guarantee DB transactions to be reliable)
Transition form one valid state to the next one is done by savepointing
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 3
Converter (1)
ConverterMaps Logical Page Numbers to Physical Page Numbers (Volume Index + Offset) Is implemented as a tree
– Inner nodes just point to their children
– Leaf nodes contain mapping information
Restart page points to converter root
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 4
Converter (2)
Data Volumes
Anchor Page Data PageRestart Page Converter Index Page Converter Leaf Page
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 5
Log PNo
Volume Index + Block No
4711 1 / 82
4712 1 / 9
Shadow paging (1)Initial State
time
Converter page (Version 1)Resource Container
R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(1)4711(1)
4712(1)C(1)R(1)
Undo Log
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 6
Shadow paging (2)Modify Data Page
time
Converter page (Version 1)Log PNo
Volume Index + Block No
4711 -
4712 1 / 9
- Update page content- Clear mapping in converter page- Write redo+undo log entries
updatePage 4711
Resource Container
*R *CA
…*4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(1)4711(1)
4712(1)C(1)R(1)
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 7
Shadow paging (3a)Savepoint : Flush modified Data Pages
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
- Assign new physical page number for modified data pages- Flush modified data pages
Resource Container
*R *CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(1)4711(1)
4711(2)4712(1)C(1)R(1)
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 8
Shadow paging (3b)Savepoint : Flush modified Converter Pages
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
- Flush modified converter pages
Resource Container
*R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(1)4711(1)
4711(2)4712(1)
C(2)C(1)
R(1)
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 9
Shadow paging (3c)Savepoint : Write Restart Page
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
- Write restart page with infomation about converter root and current log position
Resource Container
R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(1)R(2)
4711(1)
4711(2)4712(1)
C(2)C(1)
R(1)
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 10
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
Shadow paging (3d)Savepoint : Write Anchor Page
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
Resource Container
4712(1)
Redo Log
Data Volume
- Update anchor page and write to disk atomically
A(2)
R
R(2)4711(1)
4711(2)4712(1)
C
C(2)C(1)
R(1)
A
…4711 …… … 4712
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 11
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
Shadow paging (3e)Savepoint : Free Shadow Pages
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
- Free pages from last savepoint cycle
Resource Container
R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(2)R(2)
4711(2)4712(1)
C(2)
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 12
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
Shadow paging (4)Commit/Rollback
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
- Commit: Delete undo log entry- Rollback: Apply undo log entry to restore previous version
and delete it afterwards
Resource Container
R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(2)R(2)
4711(2)4712(1)
C(2)
Undo Log
Log Entry
Log Entry
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 13
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
Shadow paging (5a)Restart after emergency shutdown
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
Resource Container
R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(2)R(2)
4711(2)4712(1)
C(2)
Undo Log
Log Entry
Log Entry
Crash before commit/rollback
- restart with latest converter version- apply undo log (written before savepoint)
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 14
Log PNo
Volume Index + Block No
4711 1 / 67
4712 1 / 9
Shadow paging (5b)Restart after emergency shutdown
time
Converter page (Version 2)
updatePage 4711
Savepoint version 2
Resource Container
R CA
…4711 …… … 4712
4712(1)
Redo Log
Data Volume
A(2)R(2)
4711(2)4712(1)
C(2)
Undo Log
Log Entry
Crash after commit/rollback
- restart with latest converter version- apply redo log (written after savepoint) for commited transactions- apply undo log (written before savepoint) for rollbacked transactions
Log Entry
Commit/Rollback
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 15
Savepoint Phases
Write changed pages in parallel (up to 3 times)
Acquire lock to prevent modification of pages
Determine log position
Remember open transactions
Copy modified pages and trigger write
Increase savepoint version
Release lockWait for IO-requests to finish
Write anchor page
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 17
PersCol Desc C1
PersCol Desc C2
PersCol Desc C3
L2 Delta PersistencyData Structures Overview
Δ Col Frag C1DATA DICT
Δ Col Frag C2DATA DICT
Δ Col Frag C3DATA DICT
Data Page Chain Dictionary Page Chain
Table Container
Delta FragmentMain Fragment
MVCC Page Chain
Pers Desc
Pers Desc
Pers Desc
Pers Desc
Pers Desc
Pers Desc
MVCC Object
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 18
L2 Delta Persistency PAX Format Data Pages
Generic LP Header DataPage Fixed Hdr
(n) Column Info Blocks
Row IDs
col 0 value ids col 1 vidscol 2 vids col 3 vids
col (n) vids
Value id blocks for columns 4…n-1
# of columns, # of rows, first row position, etc.
Bit-size encoding, data type, offset within the page. One block per column.
Materialized RowID (can be optimized when all RowIDs are contiguous)
Blocks of n-bit packed value IDs per column.
Each block contains the same number of rows, but are of different size due to differences in encoding
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 19
L2 Delta PersistencyColumn Data Array
Column data array delta persistency implemented using PAX pages• Data page uses PAX (Partition Attributes Across) format
• Keeps complete rows (i.e., same number of values for each column) on a given page• Physical data placement grouped by columns
• PAX format chosen to optimize storage for large number of small ERP tables• Single page per table (for small number of rows)
In-memory contiguous column data array (aka index vector) survives• Key decision to preserve OLAP performance• Keeps AttributeEngine access methods simple – no need for access methods
over PAX pages
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 20
L2 Delta PersistencyColumn Data Array
Data Page Population• Asynchronous population of data pages from in-memory data vector• Flushing of data page not tied to transaction commit• Flushing synchronized with database savepoint• Data pages evicted as soon as they are full (low memory utilization)
N-Bit Encoding Rollover• Affected in-memory column data array is re-encoded• Re-encoding of affected and subsequent data pages• Lazy copy to data page: nothing to re-encode on the page if data not copied yet
Delta Store Loading• In-memory column data arrays populated from data pages• Data pages evicted (except for the last page) afterwards
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 21
L2 Delta Persistency Data Page Directory and Page Chain
Compact page directory (array) is always in memory
Fully populated pages are flushed to disk and not resident (most of the chain)
Pages are resident until all rows they represent have been inserted (usually just the tail of the chain)
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 22
Blo
ck
Two Types of Dictionaries
• Value in Array (ViA)• Small fixed-size types (value length <= 16 bytes) • Value stored directly in the dictionary’s value array • Value array (transient) is vector<T>
• Pointer in Array (PiA) • Strings (both VARCHAR and CHAR; fixed size not leveraged) • Value stored in:
• Physical blocks; not compressed
• Pointer to the value stored in the dictionary’s value array
• Pointer points to a string block:• 1st 1, 2 or 4 bytes of the value are the length• 1st 2 bits in length indicates 1, 2 or 4 bytes
• Value array (transient) is vector<char *>
Val
ue A
rray
ValueID
v v
Val
ue A
rray
ValueID
p
vv
L2 Delta PersistencyDictionary
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 23
L2 Delta PersistencyDictionary
Dictionary persistency implemented using dictionary pages• Pages subdivided into blocks• A block stores a chunk of dictionary values for a single column• Value ordering
• Implicit – for ViA dictionary
• Explicit via logical value pointer – for PiA dictionary
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 24
L2 Delta PersistencyViA Dictionary
PGH BH C1
BH C5
BH C1
Fragmentation
BH C2
BH C4
…
C1 C2 C5
Container Based Persistency
Transient dictionaries
ViA Page
• No or little Fragmentation
• ViA pages evicted when full
• Value ID based on implicit ordering
• Value copy maintained in the value vector
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 25
L2 Delta PersistencyPiA Dictionary
PiA Dictionary values are stored on PiA pages.• Different columns values interleaved in a single block per page
• Value placement in PiA pages is unordered
Explicit value ordering is achieved using logical pointers• Logical pointers are implicitly ordered in a block
• Blocks containing logical pointers use ViA pages (which are evicted as they become full)
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 26
L2 Delta PersistencyPiA Dictionary
• ERP Dictionary Data Distribution
Container Based Persistency
Varchar Size % of All Occurrences
NVARCHAR(1-4) 52 %
NVARCHAR(5-10) 23 %
NVARCHAR(11-20) 10 %
NVARCHAR(21-30) 6 %
NVARCHAR(31-40) 5 %
NVARCHAR(41-50) 1 %
NVARCHAR(51-100) 2 %
NVARCHAR(100-200) 0.5
Data Type #Occurrences
% of All Occurrences
NVARCHAR 741779 89 %
DECIMAL 67918 8 %
INTEGERS 11443 1.3 %
VARBINARY 7751 0.92 %
SMALLINT 4250 0.5 %
DOUBLE 2275 0.27 %
CLOB 1384 0.16 %
BLOB 742 0.08 %
PiA dictionary storage design considers ERP dictionary data distribution• Optimizations for small strings (<=7 bytes)
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 27
L2 Delta PersistencyDML Runtime
Column C1
Column C2
Data Array
Dictionary store
Dict Index
Column Cn
Inverted Index
Data Volume
Log Entry
Redo Logs
Log Entry Undo
UndoEntry Undo
Undo
Write Operations (INSERT/UPDATE/DELETE)
Delta Fragment
Data Log Record
Container Based Persistency
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 28
L2 Delta PersistencyRecovery
• Delta Store recovery happens as part of the DB Recovery• At restart, delta fragment’s persistent state is reverted to what it was at the last savepoint
table (via converter table switch)• Any pages (data, dictionary, and MVCC) written after the savepoint are lost• Replay of redo log records recovers the DB state (and table’s delta store as well) to last
committed transaction• Still-open transactions after recovery are closed and their UNDO executed
• AE is not fully available during recovery• Recovery must be self contained in the UT layer
• Replay of redo log needs fully instantiated dictionary
• Column data array needs to be instantiated as well
• This is why backing array for dictionary value vector and column data array are moved to UT layer
Data Volume
Log Entry
Redo Logs
Log Entry
MVCC
Delta Fragment
Dict Data
Container Based Persistency
Undo
UndoEntry Undo
Undo
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 29
L2 Delta PersistencyRecovery
• When first redo log record for a table is hit, delta fragment is instantiated from on disk image of delta store
Data Volume
Log Entry
Redo Logs
Log Entry
DB Recovery
MVCC
Delta Fragment
Dict Data
Data Object
Delta Fragment
IV DICTΔ Col Frag C1
IV DICTΔ Col Frag C2
IV DICTΔ Col Frag Cn
In-Memory MVCC Info
Container Based Persistency
Undo
UndoEntry Undo
Undo
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 30
L2 Delta PersistencyRecovery
• After Delta Fragment is instantiated, log records for the table can be replayed over UT• Incomplete transactions rolled back using undo files
Data Volume
Log Entry
Redo Logs
Log Entry
DB Recovery
MVCC
Delta Fragment
Dict Data
Data Object
Delta Fragment
IV DICTΔ Col Frag C1
IV DICTΔ Col Frag C2
IV DICTΔ Col Frag Cn
In-Memory MVCC Info
Replay Redo Log Record
Dirty pages written out when full
Container Based Persistency
Undo
UndoEntry Undo
Undo
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 31
L2 DeltaLock-less Structures
Lock-less Structures
Legacy Implementation • DMLs execute concurrently, but blocked at a data structure access level
• Data Structure level locking hurts OLTP performance• A write to an index vector locks the entire vector (unlike classical page based scheme that requires only affected
page to be locked)
L2 Delta uses lock-less structures• Lock-less versioned vectors
• Column data array, dictionary value vector, etc.
• Lock-less B-tree and/or lock-less hash mapfor dictionary index
• No locking of structures even for write
• Most coarse locks from AE already removed
• AE has several other locks
• Most of them will be removed, but some may survive
• Existing code has silent assumptions
Index Vector
Dict Value Vector
Dict Index
Attribute C1
Inverted Index
© 2013 SAP AG or an SAP affiliate company. All rights reserved. 32
Unified Table ContainerControl Structure Versioning Example (Simplified)
Operations:• Add Page 3
• Start reader
• Add Page 4
• Add Page 5• Clone vector• Clone table• Set vector• Set anchor• Link page
• Read data
• Add Page 6
• End reader
Old version dropped
Anchor
Table
Δ Page Vector
Page 1
Page 2
Page 3
Page 4
Page 5
Table’
Reader
Ref# 1
Page 6
Ref# 0
Δ Page Vector’
2 1
Transient
Persistent
Meta
2
Lock-less Structures