1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Transactions, Failure & Recovery These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

1

Advanced Database Topics

Copyright © Ellis Cohen 2002-2005

Transactions,Failure & Recovery

These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.

For more information on how you may use them, please see http://www.openlineconsult.com/db

© Ellis Cohen 2002-2005 2

Topics

Transactions & CommitAbort & RollbackNested Transactions & SavepointsTransactions, Failure & RecoveryServer Page CachingEnsuring Atomicity & Durability

with Shadow PagingEnsuring Atomicity & Durability

with Undo LoggingRedo LoggingUndo/Redo LoggingEnsuring Longer-Term DurabilityHandling Consistency Failure


ACID Properties of Transactions

Atomicity *All of the updates of a transaction are done or none

are done

Consistency *Each transaction leaves the database in a consistent

state (preferably via consistency predicates)

IsolationEach transaction, when executed concurrently with

other transactions, should have the same effect as if executed by itself

Durability *Once a transaction has successfully committed, its

changes to the database should be permanent


Transactions and Commit


Transaction

Logical unit of work that must be either entirely carried out or abortedExample:

a sequence of SQL commands, grouped together,e.g. in an SQL*Plus script

If only part of the transaction were

carried out, the database could be left in an inconsistent state


Example SQL*Plus ScriptThis script moves money

from one account to another. Parameters:

&srcacct - The account to move money from&dstacct - The account to move money to&amt - The amount of money to be moved

UPDATE checkingSET balance = balance - &amtWHERE acctid = &srcacct;

UPDATE checkingSET balance = balance + &amtWHERE acctid = &dstacct;

Suppose a crash occurs right here!


Transactions & COMMIT

Modify

Modify

Modify

Modify

Transaction starts

Transaction commitsModifications persisted to DB

• Each modification is visible to the SQL commands executed after it in the same transaction

• But the modification is not actually persisted to the database until the transaction commits

• So, if a crash occurs in the middle of a transaction, after some modifications have been done,

the DB acts as if the modifications never happened!

All SQL commands are performed within a transaction

Transaction ensures

these are done

atomically


Uncommitted & Committed Transactions

start

transaction

modify modify modify COMMIT

Modifications persisted to DB

start

transaction

modify modify

Modifications not persisted


SQL*Plus Commit Example

SQL> set autocommit off

SQL> UPDATE checkingSET balance = balance - &amtWHERE acctid = &srcacct;

SQL> UPDATE checkingSET balance = balance + &amtWHERE acctid = &dstacct;

SQL> COMMIT;

Transaction started automatically at first update if not already in progress

© Ellis Cohen 2002-2005 10

Starting TransactionsThe COMMIT command ends a transactionHow do transactions start?• Most databases start a new transaction

automatically– on the first access to the DB within a

session, and – on the first access following a

COMMIT• Some databases have a START

TRANSACTION command (to support complex nested transactions)

© Ellis Cohen 2002-2005 11

Transactions & DB Requests

Data TierMiddle Tier

UPDATE …UPDATE …COMMITUPDATE …COMMIT

Cross-Request

Transactions

Within-Request

Transactions

PROCEDURE StoredProc ISBEGIN UPDATE … UPDATE … COMMIT UPDATE … COMMITEND;

Execute Stored Procedure

© Ellis Cohen 2002-2005 12

Automatic CommitUpdates may persist even when COMMIT is

not explicitly called

• Most databases support — either on the server or just through the client-side API — an autocommit mode which automatically does a commit after execution of each request made to the database. This is often the default.

• Most databases automatically COMMIT when a client cleanly closes their connection to the database.

• Most databases (including Oracle) do not allow DDL statements (e.g. CREATE TABLE) to be part of a larger transaction, and automatically do a commit before and after executing a DDL statement.

© Ellis Cohen 2002-2005 13

Java Commit ExampleConnection conn = …;conn.setAutoCommit( false );movemoney( conn, 30479, 61925, 2000 );…

//--------------------------------------- static void movemoney( Connection conn,

int srcacct, int dstacct, float amt ){

Statement stmt = conn.createStatement();String sqlstr = "update checking" + " set balance = balance - " + amt + " where acctid = " + srcacct;stmt.executeUpdate( sqlstr );

sqlstr = "update checking" + " set balance = balance + " + amt + " where acctid = " + dstacct;stmt.executeUpdate( sqlstr );

conn.commit();}

© Ellis Cohen 2002-2005 14

Abort & Rollback

© Ellis Cohen 2002-2005 15

AbortAborting a transaction undoes the

effects of the transaction -- it is as if the transaction never started

Transactions are aborted in 3 ways:1. The system crashes: All active

transactions are aborted

2. An uncorrectable error occurs while executing the transaction

3. The transaction explicitly aborts (this is called a ROLLBACK)

A transaction completes when it either commits or aborts

© Ellis Cohen 2002-2005 16

Rollback

Rollback aborts a transaction

• SQL*PlusROLLBACK

• Javaconn.rollback()

© Ellis Cohen 2002-2005 17

Commit vs Rollback

start

transaction

modify modify modify ROLLBACK

start

transaction

modify modify modify COMMIT

© Ellis Cohen 2002-2005 18

Explicit Rollback

SQL> COMMIT;

SQL> UPDATE Emps SET job = 'COOK';

SQL> UPDATE Emps SET sal = sal + 200;

SQL> ROLLBACK;

After the ROLLBACK, the state is exactly as it was following the COMMIT.

It is as if the two UPDATEs never happened!

With AUTOCOMMIT OFF

© Ellis Cohen 2002-2005 19

Explicit Rollback in Java

{Connection conn = …;conn.setAutoCommit( false );

Statement stmt = conn.createStatement();String sqlstr = …;stmt.executeUpdate( sqlstr );

…if (…)

conn.commit();else

conn.rollback();}

© Ellis Cohen 2002-2005 20

Rollback Past Commit

Rollback rolls the state back to the beginning of the transaction.

Why not allow some form of rollback that goes back to some earlier point?

© Ellis Cohen 2002-2005 21

Commit Semantics & Compensating Transactions

Because commits are durable!When a transaction commits, the user or application is

notified that the commit succeeded and can't be undone, and may take other actions outside the database based on that– Display output to a user– Send a message to another process– Launch nuclear missile

Some systems allow a compensating transaction to be associated with a transaction when it commits.

The compensating transaction can be executed to "undo" the effects of the associated committed transaction (possibly within some time limit)– Output a retraction– Send a compensating message– Destroy the nuclear missile

© Ellis Cohen 2002-2005 22

Nested Transactions& Savepoints

© Ellis Cohen 2002-2005 23

Nested TransactionsTransaction can nest

modify modify modify

starttransaction

Only the outermost transaction can commit and persist data

Nested transaction can control the degree of rollback

Nested transactions in SQL are implemented using SAVEPOINTs

© Ellis Cohen 2002-2005 24

SavepointsSAVEPOINT <name>

Explicitly start new named nested transaction

ROLLBACK to SAVEPOINT <name>Rolls back to state at start of named nested

transaction

RELEASE SAVEPOINT <name>Releases savepoint and associated transaction

[not supported by Oracle](Setting a savepoint with the same name as an existing

savepoint releases the existing one)

COMMITReleases all savepoints within outermost

transaction & commits

start

transaction

setsavepoint

a

setsavepoint

b

setsavepoint

crollback

to bcommit

© Ellis Cohen 2002-2005 25

Using Savepoints

Set savepoint to try something that is quick but doesn’t always worke.g. access to some remote database

that is not always available

On failure, back up to the savepoint(undoing any changes to the DB you have made) and try slower but more reliable technique

© Ellis Cohen 2002-2005 26

Alternative Path in PL/SQL

BEGIN DoUsefulSetup( … ); BEGIN SET SAVEPOINT RetryPoint; DoQuickUnreliableUpdates(…); EXCEPTION WHEN OTHERS THEN ROLLBACK TO RetryPoint; DoSlowReliableUpdates(…); END;END;

© Ellis Cohen 2002-2005 27

Alternative Path in Java

Connection conn = …;Statement stmt = conn.createStatement();

DoUsefulSetup(…);try { Savepoint spRetry =

conn.setSavepoint( "RetryPoint" ); DoQuickUnreliableUpdates(…); }catch( Exception e ) { conn.rollback( spRetry ); DoSlowReliableUpdates(…); }

© Ellis Cohen 2002-2005 28

Statement-Level TransactionsEvery SQL statement executes within a nested

transactionA statement can fail

– E.g. due to violation of an integrity constraint, e.g.check( enddate > startdate)

Result of statement failure:– The statement is rolled back.

If an update statement would update 100 records, but updating the 11th records causes failure of an integrity constraint, the 10 previously updated records are rolled back to their old state

– In embedded SQL, it then raises an exception, which can eventually cause the outermost transaction to abort if not caught

Result of statement success– Statement-level transaction is released

© Ellis Cohen 2002-2005 29

Autonomous Nested Transactions

When a transaction fails, all modifications made during that transaction are undone.

That may not be what you want!– Suppose you want to add an audit record (to the

EmpsAudit table) every time someone tries to update the Emps table.

– You want to add that audit record even if the operation which updates Emps is ultimately rolled back.

Solution: Add the audit record inside an autonomous nested transaction.– Autonomous transactions can durably commit inside

of a parent transaction– If the parent transaction is aborted after the nested

autonomous transaction commits, modifications made inside the autonomous transaction will NOT be undone.

© Ellis Cohen 2002-2005 30

Transactions, Failures

& Recovery

© Ellis Cohen 2002-2005 31

Types of Failures

Transaction FailureTransaction aborts for some reasons

(e.g. uncaught exception )

System FailureProcessor / system crashMain memory lost, disk ok

Media Failure & CatastrophesDisk or Controller error / Head crash

User/Program Errors& SabotageLoss or corruption of data

Atomicity

Durability

Consistency

Potentially Violate

© Ellis Cohen 2002-2005 32

Failures & Recovery

Atomicity-Related FailuresReturn all data changed by a transaction to its

state at the beginning of the transaction

Durability-Related FailuresDepends on keeping a backup of the dataRecover the state of the data from the backup

Consistency-Related FailuresRecover affected data (as for a durability failure)Deal with cascading effects of committed

transactions that modified or depended upon incorrect data

Very difficult to deal with; won't deal with these in general

© Ellis Cohen 2002-2005 33

Shadow CopyingPrimitive Recovery Mechanism to

Ensure Atomicity & (limited) Durability

Assumes 1 transaction at a timeInitially there's the Main DB, and db_ptr (also on disk) holds its disk address

Then, when a transaction startsA copy of the Main DB is made: Current DB Copy The transaction is executed using Current DB CopyIn effect, Main DB becomes a "shadow copy"

How is a ROLLBACK handled?

db_ptr

Current DB Copy

Main DB

© Ellis Cohen 2002-2005 34

Failure and Shadow CopyingOn Commit1) Force cached pages out to Current DB Copy

Crash before (2): As if the transaction never started

2) Change db_ptr to point to Current DB CopyCrash after (2): Transaction state is completely updated

3) Discard the old Main DB

on disk

db_ptr

Current DB Copy

MainDB

A single atomic operation (changing the db_ptr) moves the system from one consistent state to another

© Ellis Cohen 2002-2005 35

Shadow Copy in Practice

- Takes too much space to make a copy of the entire DB

- Too slow to make an entire copy of the DB for each transaction

Perhaps we will find the ideas of shadow copying useful later on …

© Ellis Cohen 2002-2005 36

Server Page Caching

© Ellis Cohen 2002-2005 37

Disk Structure

same size as memory

page

© Ellis Cohen 2002-2005 38

Disk Block OrganizationDivide database into disk blocks

(which correspond to memory pages)A block is 1 or more contiguous disk

sectors

Generally, eitherA block holds 1 or more complete rows (i.e.

tuples), usually from the same tableNo row straddles a blockBlock has contiguous rows,

or a row directory which keeps track of the offsets of the rows in the block

Or (for long rows)A row spans 1 or more blocks (internal chaining)No block holds pieces of 2 or more rows

Really large fields (LOB's) are stored separately.

© Ellis Cohen 2002-2005 39

Addressing Tuples

3049625973

Identifiesa specific database

block

Identifies a slot in the

block's row directory

Every tuple in a database is addressed by a ROWID, which indicates where it may be found

© Ellis Cohen 2002-2005 40

Migration & Forward Chaining

An update may increase the size of a tuple so much that it can no longer fit in the same block, so we have to move it to another block.

But we want the tuple to still be identified by its ROWID, which refers to the old block

The data for the row in the old block holds a forwarding id -- the id of the ROWID for the row in the new block

© Ellis Cohen 2002-2005 41

Block Access & UpdateTo read any row in a block,

the block is read into core memory(if not already there)[may also prefetch adjacent blocks]

To insert/delete/update a row in a block

1) the block is read into core memory (if needed)2) the page is modified3) the page is eventually written back to disk

CoreMemory(pages)

Disk Memory (blocks)

1

32

© Ellis Cohen 2002-2005 42

Blocks & Pages

Frequently, the DB block size (the smallest unit of data transfer between the DB disk memory and core memory) is chosen to be the same size as a virtual memory page

We will use the terms page and block interchangeably.

© Ellis Cohen 2002-2005 43

Server Page CachingAfter a read or update, the page may be

cached (i.e. retained) in the DB server's memory.If the page is still in memory next time it is

needed, there is no need to read it from diskWhen the cache is full, room is made for a new

page by replacing some other pageMost metadata tables are always in the cache

Core Memory

Disk Memory

Cache

© Ellis Cohen 2002-2005 44

Memory & Disk Specs

130G Disk512 bytes/sector, 256 sectors/track65K tracks/head, 16 heads/disk (8 platters) 1M tracks/disk, 256M sectors/disk10 ms max seek time, 1 ms track-to-track4 ms avg latencySustainable data transfer rate: 65Mbps

(4K bits per sector 60s / sector)

Average time to check 2K bytes from diskseek + latency + transfer + core check times0-10ms + 4ms + .25ms + 1s = 4.25-14.25ms

Disk/Core ratio = ~10ms/1s = 10,000:1

© Ellis Cohen 2002-2005 45

Page Caching & Virtual Memory

• OS allocates DB a fixed (perhaps changeable) # of pages of disk and memory which the DB managescan unnecessarily constrain memory management

• Persistent DB state stored in ordinary files, and the page cache is in virtual memorycauses duplication of effort

if VM page is backed to disk

• OS and DB storage management are integratedOS (e.g. Mach) has a file mapping API which can

be used by the database

© Ellis Cohen 2002-2005 46

Dirty and Active Pages

Dirty PagePage that has been modified, and has

not been written back to disk (since it was modified).

A clean page either• Has not been modified since it was read• Has not been modified since it was last

written back to disk

Active PagePage that has been accessed by a

transaction that has not yet completed (i.e. committed or aborted)

© Ellis Cohen 2002-2005 47

Page States

Same contents as on disk

Every transaction that used it has completed

Same contents as on disk

Some transaction that used it is active

Page has been modified, but not

written back to diskEvery transaction that used it has completed

Page has been modified, but not

written back to diskSome transaction that

used it is active

Clean

Dirty

Inactive Active

Consider the page that has been least recently used.Which of these states could it be in?

(Consider the states in the order indicated)

1 2

34

© Ellis Cohen 2002-2005 48

LRU Page States

The transactions which used this page finished a long time

agoAny modifications were written out

A transaction using this page started a

long time ago, but has not yet finished

Any modifications were written out

The transactions which used this page finished a long time

agoModifications not

written out

A transaction using this page started a

long time ago, but has not yet finished

Modifications not written out

Clean

Dirty

Inactive Active

Why are Dirty Inactive pages a problem for Durability?

© Ellis Cohen 2002-2005 49

What happens to a dirty page when the transaction which modified it commits?

FORCE:It is written back to disk.Necessary for durability unless some other

mechanism is available.Effect: No dirty inactive pages

NO-FORCE:The page is not written back on commit.Avoids overhead at commit time.If the system crashes after a transaction

commits, and the page is not written back, how is durability ensured?

Forcing

© Ellis Cohen 2002-2005 50

The Replacement Problem

What if a page needs to be loaded into memory, but cache memory is full.

We need to replace some page with the new page.

Which page should we replace?

© Ellis Cohen 2002-2005 51

Replacement Algorithms

LRU: Choose the page which has been used least recently. Based on the (often true) notion that pages used most recently will most likely be used again in the near future.

Clock Algorithm: Approximates LRU, but is more efficient. Cycle through the pages in order. Choose the next page in order that was not used since that page was considered in the previous cycle.

Also, first replace pages read in during full table scans (in fact, if the table is large, throw out earlier pages read when scanning later pages)

© Ellis Cohen 2002-2005 52

Cost of Writing Dirty Pages

Suppose the page chosen for replacement is dirty – we need to first write the dirty

page back to disk (which impacts performance)

– before a newly read page can replace it in the cache

Why is this so?

Is there a way to improve the performance?

© Ellis Cohen 2002-2005 53

Pre-Write LRU Dirty Pages

Use a separate Cleaner Process to find dirty pages which have not been used recently and write them back to disk

– The disk scheduler doesn’t need to write them back immediately, but when it is most efficient to do so

The dirty page is not immediately replaced, it just becomes clean (instead of dirty).

– This allow the replacement algorithm to always find a clean page (not used recently) to replace, without needing to wait for it to be written back

But what should the Replacement Algorithm or the Cleaner Process do when it considers an active dirty page?

© Ellis Cohen 2002-2005 54

StealingSTEAL: May choose a dirty active page to

clean/replace.– What is the danger if you do NOT write out the

page? (What if the transaction that modified the page commits?)

– What is the danger if you DO write out the page? (What if the transaction that modified the page aborted?)

NO-STEAL: Skip over dirty active pages– If there are no clean pages, forces some

transaction to abort.– If few clean pages, transactions may thrash

(continually reread pages which have recently been replaced)

You can always choose a dirty inactive page;just write it out first

© Ellis Cohen 2002-2005 55

Effect of Processor Crash

As if the transaction

never happened

Atomicity Failure

Atomicity Failure

Durability Failure

Atomicity and

Durability Failure

Transaction saved

successfully!

Active Transaction

Committed Transaction

No Modified Pages on Disk

All Modified Pages on Disk

Some Modified Pages on Disk

Problem even if you FORCE & don't STEAL:Crash in the middle of commit while forcing pages to disk;only some modified pages may be on disk

STEALNo

FO

RC

E

Is there a recovery mechanism based on shadow copying which can solve this problem?

© Ellis Cohen 2002-2005 56

Using Shadow Copies

At commit time, we first use shadow copying for all the dirty pages

We change the database so it points to those pages instead of the original pages (how do we do that atomically?)

Assuming we can make that work,can we allow page stealing?

© Ellis Cohen 2002-2005 57

Ensuring Atomicity and

Durabilitywith

Shadow Paging

© Ellis Cohen 2002-2005 58

Page Tablesmain page table ptr

A

B

C

D

E

F

G

AB

CD

E

GF

A Database can be organized using a page table.The table maps the LOGICAL block #

(which is used in ROWIDs) to the PHYSICAL blocks #

(where the block actually lives on the disk)

© Ellis Cohen 2002-2005 59

Commit-Time Shadow Pagingmain page table ptr commit-time page table

ptr for transaction TA

B

C

D

E

F

G F'

D'

B'A

BC

DE

GF

A

B

C

D

E

F

G

At COMMIT time of transaction T1. A commit-time copy of the page table is made2. T's dirty pages (B, D, F) are forced to disk (but DO NOT

overwrite the originals), and the commit-time page table copy is changed to point to the new modified copies

3. The main page table ptr is switched to point to the commit-time page table. THIS IS WHEN THE COMMIT HAPPENS!

4. The old copies of B, D, F and the old page table are freed

© Ellis Cohen 2002-2005 60

Shadow Paging Issues

1. Can we support stealing with shadow paging?

2. The page table is too big to copy on every transaction. How can we improve performance?

3. When a tuple is updated, what pages are changed?

© Ellis Cohen 2002-2005 61

Stolen Page Map

T's stolen page map

B

F

F'

B'

If T is using a dirty page that needs to be replaced or cleaned

Write it to disk, and note it in T's private stolen page map

If T needs to access that page again,

look for it the stolen page map before looking in the main page map

When T commits, use T's stolen page map to help build T's commit-time page table copy

Two transactions are unable to modify (different rows on) the same page.

Requires page-level locking (discussedin the next lecture)

© Ellis Cohen 2002-2005 62

Multi-Level Page Tables

main page table ptr

PT0

PT1

PT2

PT3

PT4

…

PT99

P101

P400

P401

P9999

P499

P100

P101

…

P199

P400

P401

…

P499

P9900

P9901

…

P9999

P400

P401

…

P499

P401'

P499'

PT0

PT1

PT2

PT3

PT4

…

PT99

commit-time page table

ptr for transaction T

© Ellis Cohen 2002-2005 63

Auxiliary Affected Pages What pages are affected when a tuple is updated?• The page containing the original tuple• If the update makes the tuple so large that there is

no room for it in the old page, it is moved to a new page (a forwarded page)– If so, the corresponding page of the page table is

affected as well.• If any of the updated fields are indexed, then the

corresponding index entry for the tuple will have to be moved (e.g. deleted from its old position in the B+ tree, and inserted at the new position).– Both the page of the old entry and the page of the new

entry will be affected– Adding a new entry to an index page may cause that

page to be split, which will then affect the corresponding page of the page table

– Removing an entry from an index page may cause that page to be combined with an adjacent page, which also affects the corresponding page of the page table.

• Pages containing the portions of the page table hierarchy used to reference those pages!

© Ellis Cohen 2002-2005 64

Shadow PagingCharacteristics

Ensures Atomicity & DurabilityRequires Forcing (dirty pages must be written back

at commit time) to ensure durabilityAllows Stealing (dirty pages can be written back

before transaction commits, though not overwritten)

Assumes if one transaction modifies a page, no other transaction can read or modify it(i.e. page-level locking)

MechanismUses a page table (on disk) which keeps

a list of all pages in the DBKeeps a shadow copy of each page writtenDoes shadow copying of the page tableMain result: At commit time, moves the system

instantly from one consistent state to another

© Ellis Cohen 2002-2005 65

Pages Changed by Multiple Transactions

What if the same tuple is changed by two concurrent transactions?– Assume this doesn't happen.– In the next lecture, we will talk about

concurrency control mechanisms which prevent this

What if two different tuples on the same page are changed by concurrent transactions?

This is a real problem with shadow paging. Either– Allow one transaction at a time to use a page

(using page locks), or– Don't actually make the changes to the page

until just before commit (using intention lists)

© Ellis Cohen 2002-2005 66

Problems with Shadow Paging• Commit Bottleneck

Only one transaction can commit at a time, if we want the page table to be correct (how might this be fixed?)

• Limits on ConcurrencyCan't have different transactions modify

independent parts of pages (could be addressed by deferred modification and intentions lists)

• Cost of ShadowingOverhead of allocating and freeing shadow copies

• Data Fragmentation For read efficiency, you want logically adjacent data

to be kept physically adjacent (e.g. using extents)For write efficiency, this implies

in-place updating, not shadow copying(could possibly be addressed by ongoing defragmentation [+ sorting] in the background)

© Ellis Cohen 2002-2005 67

Overview of Logging(the Alternative to Shadow Paging)

Main features– Uses a log to support recovery (the log

itself may span multiple pages)– No shadowing; Uses in-place updating– Can track modifications at the row

(rather than the page) level– No Page Tables, but still depends on

Server Page Caching

Three approachesUndo-Only Logging (Backward Recovery)

Allows stealing, but still requires force on commit

Redo-Only Logging (Forward Recovery)Avoids force on commit, but no stealing

Combined (Undo/Redo) LoggingAvoids force on commit, and allows stealing

© Ellis Cohen 2002-2005 68

Ensuring Atomicity and

Durabilitywith Undo Logging

© Ellis Cohen 2002-2005 69

Backward Recovery with Undo Logs

MechanismOn every modification made to any tuple in the

database, append an Undo Log entry to an Undo Log.

On Transaction Abort: use the Undo Log to undo all modifications made by the aborted transaction, in backwards order.

Crash Recovery: Abort all uncommitted transactions

CharacteristicsRequires Forcing (dirty pages must be written back

at commit time) no way to redo on crash

Allows Stealing (dirty pages can be written back before transaction commits) because undo-able

All the advantages of shadow paging, with none of the disadvantages

© Ellis Cohen 2002-2005 70

Describing ModificationsSuppose the Emps table contains

(ROWID) EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO------- ----- ---–-- -------- ----- ---------- ----- ------ ------3479000 7369 SMITH CLERK 7902 17-DEC-80 800 203479001 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 303479002 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 303479003 7566 JONES DEPTMGR 7839 02-APR-81 2975 203479004 7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 303479005 7698 BLAKE DEPTMGR 7839 01-MAY-81 2850 303479006 7782 CLARK DEPTMGR 7839 09-JUN-81 2450 103479007 7788 SCOTT ANALYST 7566 19-APR-87 3000 203479008 7839 KING PRESIDENT 17-NOV-81 5000 103479009 7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 303479010 7876 ADAMS CLERK 7788 23-MAY-87 1100 203479011 7900 JAMES CLERK 7698 03-DEC-81 950 303479012 7902 FORD ANALYST 7566 03-DEC-81 3000 203479013 7934 MILLER CLERK 7782 23-JAN-82 1300 10

Transaction T3 executesUPDATE Emps SET sal = sal + 100 WHERE deptno = 10

What changes were made to which tuples?

© Ellis Cohen 2002-2005 71

Tuple Modifications

The following changes were made:Tuple 3479006: sal 2450 2550Tuple 3479008: sal 5000 5100Tuple 3479013: sal 1300 1400

Suppose:

The operation was executed,The pages containing these tuples were modified (in the server page cache)Those pages were written out (due to STEALing)Then Transaction T3 was ABORTed

What is the minimum information we would need to know about the affected tuples to undo the effects of the operation?

© Ellis Cohen 2002-2005 72

Tuple Before StateWe need to know that:

Tuple 3479006: sal was 2450Tuple 3479008: sal was 5000Tuple 3479013: sal was 1300

For each tuple that was updated,we need to know

what the value was for each modified fieldbefore the operation

This is the information that is written into the undo log.Many systems write the contents of the entire tuplebefore the operation – this is called the before image

What do we need to know to undo a DELETE or INSERT?

© Ellis Cohen 2002-2005 73

Undoing INSERT & DELETE

To undo an INSERTWe just need to record the ROWID of the

tuple, so we can delete it

To undo a DELETEWe need to record the ROWID of the

tuple plus the entire contents of the tuple, so we can re-insert it!

What do we need to record when we do other operations: e.g.

CREATE TABLE or DROP TABLE?

© Ellis Cohen 2002-2005 74

Logging System Operations

In an RDB, all system state (e.g. which tables are created, what their fields are, etc.) is stored in Metadata tables.

Any system operation (e.g. CREATE TABLE, DROP TABLE) is implementing by modifying the metadata tables.

We just log those modifications, just as we log modification to tuples in user tables!

© Ellis Cohen 2002-2005 75

Separate vs Integrated LoggingSeparate Logging

Some systems use a separate undo log for every transaction or for every threadMay affect performance if different logs are on different tracks of the same diskBUT: Very easy to abort a single transaction. Just walk backwards through that transaction's log. Every log entry is for the transaction being aborted.

Integrated LoggingIn an integrated log, all log entries are appended to a single log, which interleaves entries from multiple transactionsEach entry must identify the associated transaction.To undo a transaction, it is necessary to locate the log entries for that transaction. Typically, each entry points to the previous entries for the same transaction, and there is an entry which identifies the START of the transaction.

© Ellis Cohen 2002-2005 76

Modification Entries for an Integrated Undo Log

T3 Insert 3049625973T3 Delete 3049218695

67 'Marketing'T3 Update 3049218696

23 'Sales'

Before state

Before state

Transaction, Operation, ROWID,Before State

An UNDO log efficiently stores information needed to restore modified pages to their old state.

Just like keeping shadow pages, but more efficient!

Executed by transaction T3:INSERT INTO Depts VALUES( 30, 'Accounting')DELETE Depts WHERE deptno = 67UPDATE Depts SET dname = 'Gift' WHERE deptno = 23

© Ellis Cohen 2002-2005 77

Implementing Abort• Traverse the integrated log (starting at

the end and going backwards) to find all the entries for that transaction

NOTE: this is more efficient if the entries for each transaction are linked together

• For each such modification log entry, restore the before state.

NOTE: If the page/block the entry refers to was stolen, it will first need to be read back into the cache.

Logs are APPEND-ONLY. This makes them much more efficient to implement.

So, Abort does NOT delete the undo entries after using them to implement an abort.

What if a different transaction has modified a different tuple on the same page as a change which is undone?

Why not find the start of the transaction & undo going forwards?

© Ellis Cohen 2002-2005 78

Pages Changed by Multiple Transactions

What if the same tuple is changed by two concurrent transactions?What if a tuple modified by the aborted transaction was read by another transaction?– Assume this doesn't happen.– There are separate concurrency control

mechanisms which prevent this

What if two different tuples on the same page are changed by concurrent transactions?– This is NOT a problem for logging.– Log entries pinpoint a specific tuple on a page,

which can be undone leaving other tuples on the same page modified.

© Ellis Cohen 2002-2005 79

Auxiliary Affected Pages

Modifying a tuple on a page may cause modification to many other pages

– forwarded pages (oversized updates)– index pages– table directory pages (i.e. which pages hold

data for a table)

Two approaches– Explicit: Add entries to the log for each of

these modified pages as well. After all, these represent change that will have to be undone as well.

– Implicit: Do not add entries to the log for changes other than to the tuple itself. Changes to other affected pages can be done automatically as part of undoing the change to the tuple.

© Ellis Cohen 2002-2005 80

Physiological Logging

Our undo log uses "physiological" log entries

• They physically indicate the block of the tuple that was modified (the block # of the ROWID)

• They logically provide information needed to restore the tuple to the state prior to the modification

To undo an INSERT, you only need the fact that it was an Insert along with its logical position in the block (the slot # of the ROWID), because you will undo the INSERT by freeing the contents of that slot.

To undo a DELETE, you need to know all the values of the deleted tuple as well

© Ellis Cohen 2002-2005 81

Write Ahead Logging (WAL)Suppose there is a crash

– Before the commit of a transaction is complete– After a page modified by the transaction has

been written out (at commit time or due to stealing)

Use the undo log to ensure atomicity: undo the changes made to the page

But only if the undo log is already on the disk!

Write Ahead Logging

Before writing out a page, force out the undo log (or at least the parts of the undo log which have entries

that refer to that page, implicitly or explicitly).

© Ellis Cohen 2002-2005 82

Transaction Entries for an Integrated Undo Log

T# Startappended to the log when transaction T# starts (if a transaction' s entries are linked together, this is not needed; START is implied by an entry with a NULL backwards link)

T# CommitCompleteappended to the log after all pages modified by transaction T# have been forced out.

T# AbortCompleteappended to the log after all pages which have been undone for transaction T# have been forced out.

How are these transactional entries used along with the modification entries to recover from a crash?

Log ForcingA COMMIT appends CommitComplete to the log

(after all its modified pages have been written out), and then forces the log out.

That's when the COMMIT is actually complete.

© Ellis Cohen 2002-2005 83

Backward Recovery• Traverse the entire log (starting at the

end and going backwards)• Skip over a modification entry if

– its transaction's CommitComplete entry has been encountered (all its modified pages have been forced out; it doesn't need to be undone)

– its transaction's AbortComplete entry has been encountered (all pages it modified have already been undone and forced out; they don't need to be undone again)

• Otherwise, perform the undo action for the modification entry

Why does the entire log have to be traversed?What could you do to avoid that?

If a crash occurs in the midst of a transaction, some modifications will be undone that were never persisted.

Why is that true? Is that a problem?

© Ellis Cohen 2002-2005 84

Checkpoint EntriesWhen a crash occurs

All transactions which have not completed (forced out a CommitComplete or AbortComplete entry) must be undone.

But a transaction might have started a long, long time ago, made a modification, but not made any other modifications since then. We have to look through the entire log to find entries for these transactions.

Solution: Regularly add Checkpoint entriesAdd a Checkpoint entry to the log at regular

intervals with a list of all the active transactionsDuring crash recovery, stop traversing the log

when a Checkpoint entry is found where all the active transactions listed have completed (i.e. their CommitComplete or AbortComplete entries have already been encountered).

How do Start entries allow even earlier stopping?

© Ellis Cohen 2002-2005 85

Undoing Un-persisted ChangesSuppose Transaction T3 executes

UPDATE Emps SET sal = sal + 100 WHERE deptno = 10

And the following sequence of events occurs1. The operation updates tuples with ROWIDs 3479006,

3479008, 3479013 (in the server page cache)2. UNDO entries for the operation are written to the log3. The log is forced out4. The page containing tuple 3479008 is written out5. The system crashes

When the system recovers, it will go through the log and execute the UNDO entries for 3479006, 3479008, 3479013, even though the changes for 3479006 and 3479013 were never persisted.

Undo just restores the BEFORE state. If the change being undone was never persisted, at worst this has no effect. (Implicit changes must be handled a little more carefully)If a crash occurs in the midst of aborting a transaction or recovering from a previous crash, some actions that have

already been undone will be undone again. Why is that true? Is that a problem?

© Ellis Cohen 2002-2005 86

Idempotence

If a crash occurs in the midst of aborting a transaction or recovering from a previous crash, some actions that have

already been undone will be undone again. Why is that true? Is that a problem?

After undoing some of the actions,the pages of some of the restored tuples

could be written back (due to STEALing, as usual).

Re-undoing these is not a problem, because, at worst, we are re-restoring the

BEFORE state.

So UNDO of physiological logs is idempotent.

(Doing it additional times has no effect)

© Ellis Cohen 2002-2005 87

Redo Logging

© Ellis Cohen 2002-2005 88

Forward Recovery with Redo Logs


database, append an Redo Log entry to a Redo Log.

On Transaction Abort: Discard pages dirtied by the transaction from the server page cache; Use the Redo Log to redo other modifications made to those pages,

Crash Recovery: Use the Redo Log to redo modifications to pages of committed transactions that were not forced to disk.

CharacteristicsForcing Not Required (dirty pages need not be

written back at commit time) because redo-able

No Stealing (dirty pages CANNOT be written back before transaction commits) since no way to undo on abort

© Ellis Cohen 2002-2005 89

Redo Log Modification Entries

Transaction, Operation, ROWID,Before State

Executed by transaction T3:INSERT INTO Depts VALUES( 30, 'Accounting')DELETE Depts WHERE deptno = 67UPDATE Depts SET dname = 'Gift' WHERE deptno = 23

T3 Insert 304962597330 'Accounting'

T3 Delete 3049218695T3 Update 3049218696

23 'Gift'

After state

After state

These are physiological log entries

© Ellis Cohen 2002-2005 90

Implementing Abort

Invalidate all pages modified by the transaction

Starting at the beginning of the integrated log, and traversing forward:

Find all log entries for uncommitted transactions that affect the invalidated pages and redo them(as well as implicit changes to auxiliary affected pages)

There are ways to speed this up, but still, this can be slow

© Ellis Cohen 2002-2005 91

Transaction Entries for Redo Logs

T# Commitappended to the log when the a request is made to commit the transaction

Log ForcingA COMMIT appends a Commit entry to the log

when a commit request is made,and then forces the log out.

That's when the COMMIT is actually complete.

The only transaction log entry needed for a REDO log is Commit

© Ellis Cohen 2002-2005 92

Forward Recovery• [Analysis Phase] Traverse the log backwards to

find all committed transactions (easier if all Commit entries are linked together)

• [Redo Phase] Then traverse the entire log (starting at the beginning and going forwards)– Redo every modification entry of a committed

transaction, bringing the necessary block/page into the cache if it is not already there.

• This may redo changes which have already been persisted. Not a problem, since redoing a change that was already made cannot hurt.

– Redoing an entry makes the modification to the cached page. Since there is no forcing, these will eventually be written to disk just as during regular operation.

It is really only necessary to redo modifications made to a page after the page was last persisted.

How can this be arranged?

© Ellis Cohen 2002-2005 93

Log Sequence Numbers (LSN's)

The entries in the log can be numbered (1, 2, … ). These are called log sequence numbers or LSN's.

Every time a page is modified, the LSN of the corresponding log entry is placed in the page, and is written out to disk along with the page.

A redo log entry only needs to be redone if its LSN is greater than that of the page it is on.

© Ellis Cohen 2002-2005 94

Unwritten Dirty PagesPages are never forced out– After a commit, a dirty page can be written out– However, another transaction could start reading it

(or might already be reading it), which would prevent it from being written out until that transaction completed.

– Using LRU or clock replacement, a dirty page that is continually used might never be written out(We could prevent new transactions from using long-time dirty pages)

We have no way of knowing how far back in the log is the first modification made – by a committed transaction – to a page that was not saved, especially if there are

no explicit log entries for auxiliary affected pages.

That's why we have to start redoing from the very beginning of the log.We'd like to find a way to avoid that

© Ellis Cohen 2002-2005 95

Use Fuzzy Checkpointing

At regular intervals, just write a "fuzzy" Checkpoint entry, which includes

– a link to the previous checkpoint entry

– a list of inactive dirty pages along with the transaction that dirtied each one of them

– a list of transactions which have committed since the previous checkpoint

Explain crash recovery based on this checkpoint information

© Ellis Cohen 2002-2005 96

Fuzzy Checkpoint Recovery

• Traverse backwards through the log to the last checkpoint, keeping track of transactions with Commit entries.

• Traverse backward through the checkpoints, adding to the list of committed transactions as you go.

• Stop traversing when you get to a checkpoint which has no page/ transaction pairs that match any in the last checkpoint.That's the most recent point at which we know that all active dirty pages were eventually saved.

• Start redoing from that point forwards.

© Ellis Cohen 2002-2005 97

Undo/Redo Logging

© Ellis Cohen 2002-2005 98

Undo/Redo Logging

CharacteristicsForcing Not Required (dirty pages need not be

written back at commit time) because redo-able

Allows Stealing (dirty pages can be written back before transaction commits) because undo-able


database, append an Undo/Redo Log entry to an Undo/Redo Log

On Transaction Abort: use the Log to undo all modifications made by the aborted transaction, in backwards order

Crash Recovery: First Redo all changes to ensure durability, then Undo changes made by uncommitted transactions to ensure atomicity (Aries)

© Ellis Cohen 2002-2005 99

Undo/Redo Log Modification EntriesExecuted by transaction T3:

INSERT INTO Depts VALUES( 30, 'Accounting')DELETE Depts WHERE deptno = 67UPDATE Depts SET dname = 'Gift' WHERE deptno = 23

These are physiological log entries

T3 Insert 304962597330 'Accounting'

T3 Delete 3049218695 67 'Marketing'

T3 Update 3049218696 23 'Sales'23 'Gift'

After state

Before state

Before state

After state

© Ellis Cohen 2002-2005 100

Logical Log EntriesLogical Log Entry:

based on OPERATIONS, not tuplesRedo Logical Entry:

logs the actual SQL statementUndo Logical Entry:

logs a compensating SQL statmentUndo/Redo Logical Entry: logs both

If the SQL statement is INSERT INTO Depts VALUES ( 30, 'Accounting' )the compensating SQL statement is DELETE FROM Depts WHERE deptno = 30

Logical Log Entries often used for backup, replication, recovery from inconsistency.

Can be used cautiously for undo/redo, since SQL statements not generally idempotent.

© Ellis Cohen 2002-2005 101

Ensuring Longer-Term Durability

© Ellis Cohen 2002-2005 102

Storage Stability

• Volatile storageMain memory

• Semi-stable storageOrdinary disk memory

• Stable storageStorage that survives failure– Redundant RAID levels (e.g.

Mirroring, Parity)– Relative to degree of failure or

catastrophe

© Ellis Cohen 2002-2005 103

Approaches to Ensuring Durability

Stable StorageRedundant RAID Levels

Non-Local ReplicationDistributed Replicated Data

ArchivingRegular (Fuzzy) Backup

may be used with local redundant logRemote Logging

Send log records to be maintained on a remote machine

© Ellis Cohen 2002-2005 104

Remote Logging Issues

Frequency of Sending ChangesContinuouslyAt Regular IntervalsAt Commit

Format of ChangesOperations (logical redo log entries)Values or Deltas

(physiological redo log entries)Commit

– Just Communicate Commit (1-safe)– Jointly Commit (2-safe)Both are special cases of data replication

© Ellis Cohen 2002-2005 105

Recovery with Remote Backup

1. Use the backup to restore the primary disk (or a hot spare)

2. The backup machine takes over as the primary machine(at least until the primary disk is restored)

© Ellis Cohen 2002-2005 106

Handling Consistency

Failure

© Ellis Cohen 2002-2005 107

Enforcing ConsistencyHow do database applications enforce consistency?

• Constant Monitoring– Using constraints, assertion, triggers or

application code

– Prevent/abort operation that lead to inconsistent states, try to correct the problem, or immediately notify the DBA

• Interval-Based– At (regular) intervals, check that the system is in

a consistent state. If not, correct it, or notify the DBA.

• Ignore– Hope that nothing bad happens. If it does,

scramble …

© Ellis Cohen 2002-2005 108

Result of Consistency Failures(due to User Error or Sabotage)

tbl1

tbl2

tbl3

tbl4Erroneous

change discovered

Erroneous change

committed

T1 T2

Erroneous changes which are discovered later can propagate errors widely

It can be quite a while before an erroneous

change is discovered

© Ellis Cohen 2002-2005 109

Why is Consistency Failure Recovery Hard?

• Need to rollback state from T2 to T1 undoing all changes– Use the log to rollback the system to just before

the error

– Must compensate for external side-effects -- e.g. send report, launch missile

• Need to roll forward and redo committed transactions, other than erroneous changes– Can't use physiological log entries, because

old/new values may no longer match restored values (from tbl2 and then propagated elsewhere)

– Could use logical log entries, which logs operations done (with parameters and perhaps with system values -- e.g. time)

© Ellis Cohen 2002-2005 110

Operation LevelsUsing an operation log to roll forwards implies that

the DB operations executed would be the same, even if the state were different.

UPDATE …UPDATE …COMMITUPDATE …COMMIT

An application or a user operation contains multiple DB operations (within multiple transactions) and uses the current state to decide which operations to execute.

A replayed application might be in a completely different state (since T1 was not executed) and execute a completely different sequence of DB operations.

Rolling forward from T1 really requires a log of the higher level user operations or applications executed (and even those might differ if the state were different).

Documents

1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Transactions, Failure & Recovery These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike