32
HANA Persistence Shadow Pages

HANA Persistence Shadow Pages. ©2013 SAP AG or an SAP affiliate company. All rights reserved.2 Shadow paging Shadow paging is a copy-on-write technique

Embed Size (px)

Citation preview

HANA PersistenceShadow Pages

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 2

Shadow paging

Shadow paging is a copy-on-write technique to avoid in-place updates

When a page is modified, a shadow page is allocated

Because old and new version of a page (on disk) may exist a mapping from logical page number to its physical location is necessary (converter)

Why shadow paging?

Provide atomicity and durability on page level(two of the ACID* properties to guarantee DB transactions to be reliable)

Transition form one valid state to the next one is done by savepointing

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 3

Converter (1)

ConverterMaps Logical Page Numbers to Physical Page Numbers (Volume Index + Offset) Is implemented as a tree

– Inner nodes just point to their children

– Leaf nodes contain mapping information

Restart page points to converter root

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 4

Converter (2)

Data Volumes

Anchor Page Data PageRestart Page Converter Index Page Converter Leaf Page

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 5

Log PNo

Volume Index + Block No

4711 1 / 82

4712 1 / 9

Shadow paging (1)Initial State

time

Converter page (Version 1)Resource Container

R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(1)4711(1)

4712(1)C(1)R(1)

Undo Log

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 6

Shadow paging (2)Modify Data Page

time

Converter page (Version 1)Log PNo

Volume Index + Block No

4711 -

4712 1 / 9

- Update page content- Clear mapping in converter page- Write redo+undo log entries

updatePage 4711

Resource Container

*R *CA

…*4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(1)4711(1)

4712(1)C(1)R(1)

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 7

Shadow paging (3a)Savepoint : Flush modified Data Pages

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

- Assign new physical page number for modified data pages- Flush modified data pages

Resource Container

*R *CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(1)4711(1)

4711(2)4712(1)C(1)R(1)

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 8

Shadow paging (3b)Savepoint : Flush modified Converter Pages

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

- Flush modified converter pages

Resource Container

*R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(1)4711(1)

4711(2)4712(1)

C(2)C(1)

R(1)

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 9

Shadow paging (3c)Savepoint : Write Restart Page

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

- Write restart page with infomation about converter root and current log position

Resource Container

R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(1)R(2)

4711(1)

4711(2)4712(1)

C(2)C(1)

R(1)

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 10

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

Shadow paging (3d)Savepoint : Write Anchor Page

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

Resource Container

4712(1)

Redo Log

Data Volume

- Update anchor page and write to disk atomically

A(2)

R

R(2)4711(1)

4711(2)4712(1)

C

C(2)C(1)

R(1)

A

…4711 …… … 4712

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 11

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

Shadow paging (3e)Savepoint : Free Shadow Pages

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

- Free pages from last savepoint cycle

Resource Container

R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(2)R(2)

4711(2)4712(1)

C(2)

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 12

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

Shadow paging (4)Commit/Rollback

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

- Commit: Delete undo log entry- Rollback: Apply undo log entry to restore previous version

and delete it afterwards

Resource Container

R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(2)R(2)

4711(2)4712(1)

C(2)

Undo Log

Log Entry

Log Entry

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 13

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

Shadow paging (5a)Restart after emergency shutdown

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

Resource Container

R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(2)R(2)

4711(2)4712(1)

C(2)

Undo Log

Log Entry

Log Entry

Crash before commit/rollback

- restart with latest converter version- apply undo log (written before savepoint)

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 14

Log PNo

Volume Index + Block No

4711 1 / 67

4712 1 / 9

Shadow paging (5b)Restart after emergency shutdown

time

Converter page (Version 2)

updatePage 4711

Savepoint version 2

Resource Container

R CA

…4711 …… … 4712

4712(1)

Redo Log

Data Volume

A(2)R(2)

4711(2)4712(1)

C(2)

Undo Log

Log Entry

Crash after commit/rollback

- restart with latest converter version- apply redo log (written after savepoint) for commited transactions- apply undo log (written before savepoint) for rollbacked transactions

Log Entry

Commit/Rollback

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 15

Savepoint Phases

Write changed pages in parallel (up to 3 times)

Acquire lock to prevent modification of pages

Determine log position

Remember open transactions

Copy modified pages and trigger write

Increase savepoint version

Release lockWait for IO-requests to finish

Write anchor page

Delta Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 17

PersCol Desc C1

PersCol Desc C2

PersCol Desc C3

L2 Delta PersistencyData Structures Overview

Δ Col Frag C1DATA DICT

Δ Col Frag C2DATA DICT

Δ Col Frag C3DATA DICT

Data Page Chain Dictionary Page Chain

Table Container

Delta FragmentMain Fragment

MVCC Page Chain

Pers Desc

Pers Desc

Pers Desc

Pers Desc

Pers Desc

Pers Desc

MVCC Object

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 18

L2 Delta Persistency PAX Format Data Pages

Generic LP Header DataPage Fixed Hdr

(n) Column Info Blocks

Row IDs

col 0 value ids col 1 vidscol 2 vids col 3 vids

col (n) vids

Value id blocks for columns 4…n-1

# of columns, # of rows, first row position, etc.

Bit-size encoding, data type, offset within the page. One block per column.

Materialized RowID (can be optimized when all RowIDs are contiguous)

Blocks of n-bit packed value IDs per column.

Each block contains the same number of rows, but are of different size due to differences in encoding

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 19

L2 Delta PersistencyColumn Data Array

Column data array delta persistency implemented using PAX pages• Data page uses PAX (Partition Attributes Across) format

• Keeps complete rows (i.e., same number of values for each column) on a given page• Physical data placement grouped by columns

• PAX format chosen to optimize storage for large number of small ERP tables• Single page per table (for small number of rows)

In-memory contiguous column data array (aka index vector) survives• Key decision to preserve OLAP performance• Keeps AttributeEngine access methods simple – no need for access methods

over PAX pages

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 20

L2 Delta PersistencyColumn Data Array

Data Page Population• Asynchronous population of data pages from in-memory data vector• Flushing of data page not tied to transaction commit• Flushing synchronized with database savepoint• Data pages evicted as soon as they are full (low memory utilization)

N-Bit Encoding Rollover• Affected in-memory column data array is re-encoded• Re-encoding of affected and subsequent data pages• Lazy copy to data page: nothing to re-encode on the page if data not copied yet

Delta Store Loading• In-memory column data arrays populated from data pages• Data pages evicted (except for the last page) afterwards

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 21

L2 Delta Persistency Data Page Directory and Page Chain

Compact page directory (array) is always in memory

Fully populated pages are flushed to disk and not resident (most of the chain)

Pages are resident until all rows they represent have been inserted (usually just the tail of the chain)

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 22

Blo

ck

Two Types of Dictionaries

• Value in Array (ViA)• Small fixed-size types (value length <= 16 bytes) • Value stored directly in the dictionary’s value array • Value array (transient) is vector<T>

• Pointer in Array (PiA) • Strings (both VARCHAR and CHAR; fixed size not leveraged) • Value stored in:

• Physical blocks; not compressed

• Pointer to the value stored in the dictionary’s value array

• Pointer points to a string block:• 1st 1, 2 or 4 bytes of the value are the length• 1st 2 bits in length indicates 1, 2 or 4 bytes

• Value array (transient) is vector<char *>

Val

ue A

rray

ValueID

v v

Val

ue A

rray

ValueID

p

vv

L2 Delta PersistencyDictionary

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 23

L2 Delta PersistencyDictionary

Dictionary persistency implemented using dictionary pages• Pages subdivided into blocks• A block stores a chunk of dictionary values for a single column• Value ordering

• Implicit – for ViA dictionary

• Explicit via logical value pointer – for PiA dictionary

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 24

L2 Delta PersistencyViA Dictionary

PGH BH C1

BH C5

BH C1

Fragmentation

BH C2

BH C4

C1 C2 C5

Container Based Persistency

Transient dictionaries

ViA Page

• No or little Fragmentation

• ViA pages evicted when full

• Value ID based on implicit ordering

• Value copy maintained in the value vector

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 25

L2 Delta PersistencyPiA Dictionary

PiA Dictionary values are stored on PiA pages.• Different columns values interleaved in a single block per page

• Value placement in PiA pages is unordered

Explicit value ordering is achieved using logical pointers• Logical pointers are implicitly ordered in a block

• Blocks containing logical pointers use ViA pages (which are evicted as they become full)

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 26

L2 Delta PersistencyPiA Dictionary

• ERP Dictionary Data Distribution

           

Container Based Persistency

Varchar Size % of All Occurrences

NVARCHAR(1-4) 52 %

NVARCHAR(5-10) 23 %

NVARCHAR(11-20) 10 %

NVARCHAR(21-30) 6 %

NVARCHAR(31-40) 5 %

NVARCHAR(41-50) 1 %

NVARCHAR(51-100) 2 %

NVARCHAR(100-200) 0.5

Data Type #Occurrences

% of All Occurrences

NVARCHAR 741779 89 %

DECIMAL 67918 8 %

INTEGERS 11443 1.3 %

VARBINARY 7751 0.92 %

SMALLINT 4250 0.5 %

DOUBLE 2275 0.27 %

CLOB 1384 0.16 %

BLOB 742 0.08 %

PiA dictionary storage design considers ERP dictionary data distribution• Optimizations for small strings (<=7 bytes)

           

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 27

L2 Delta PersistencyDML Runtime

Column C1

Column C2

Data Array

Dictionary store

Dict Index

Column Cn

Inverted Index

Data Volume

Log Entry

Redo Logs

Log Entry Undo

UndoEntry Undo

Undo

Write Operations (INSERT/UPDATE/DELETE)

Delta Fragment

Data Log Record

Container Based Persistency

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 28

L2 Delta PersistencyRecovery

• Delta Store recovery happens as part of the DB Recovery• At restart, delta fragment’s persistent state is reverted to what it was at the last savepoint

table (via converter table switch)• Any pages (data, dictionary, and MVCC) written after the savepoint are lost• Replay of redo log records recovers the DB state (and table’s delta store as well) to last

committed transaction• Still-open transactions after recovery are closed and their UNDO executed

• AE is not fully available during recovery• Recovery must be self contained in the UT layer

• Replay of redo log needs fully instantiated dictionary

• Column data array needs to be instantiated as well

• This is why backing array for dictionary value vector and column data array are moved to UT layer

Data Volume

Log Entry

Redo Logs

Log Entry

MVCC

Delta Fragment

Dict Data

Container Based Persistency

Undo

UndoEntry Undo

Undo

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 29

L2 Delta PersistencyRecovery

• When first redo log record for a table is hit, delta fragment is instantiated from on disk image of delta store

Data Volume

Log Entry

Redo Logs

Log Entry

DB Recovery

MVCC

Delta Fragment

Dict Data

Data Object

Delta Fragment

IV DICTΔ Col Frag C1

IV DICTΔ Col Frag C2

IV DICTΔ Col Frag Cn

In-Memory MVCC Info

Container Based Persistency

Undo

UndoEntry Undo

Undo

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 30

L2 Delta PersistencyRecovery

• After Delta Fragment is instantiated, log records for the table can be replayed over UT• Incomplete transactions rolled back using undo files

Data Volume

Log Entry

Redo Logs

Log Entry

DB Recovery

MVCC

Delta Fragment

Dict Data

Data Object

Delta Fragment

IV DICTΔ Col Frag C1

IV DICTΔ Col Frag C2

IV DICTΔ Col Frag Cn

In-Memory MVCC Info

Replay Redo Log Record

Dirty pages written out when full

Container Based Persistency

Undo

UndoEntry Undo

Undo

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 31

L2 DeltaLock-less Structures

Lock-less Structures

Legacy Implementation • DMLs execute concurrently, but blocked at a data structure access level

• Data Structure level locking hurts OLTP performance• A write to an index vector locks the entire vector (unlike classical page based scheme that requires only affected

page to be locked)

L2 Delta uses lock-less structures• Lock-less versioned vectors

• Column data array, dictionary value vector, etc.

• Lock-less B-tree and/or lock-less hash mapfor dictionary index

• No locking of structures even for write

• Most coarse locks from AE already removed

• AE has several other locks

• Most of them will be removed, but some may survive

• Existing code has silent assumptions

Index Vector

Dict Value Vector

Dict Index

Attribute C1

Inverted Index

© 2013 SAP AG or an SAP affiliate company. All rights reserved. 32

Unified Table ContainerControl Structure Versioning Example (Simplified)

Operations:• Add Page 3

• Start reader

• Add Page 4

• Add Page 5• Clone vector• Clone table• Set vector• Set anchor• Link page

• Read data

• Add Page 6

• End reader

Old version dropped

Anchor

Table

Δ Page Vector

Page 1

Page 2

Page 3

Page 4

Page 5

Table’

Reader

Ref# 1

Page 6

Ref# 0

Δ Page Vector’

2 1

Transient

Persistent

Meta

2

Lock-less Structures