Upload
cameron-shepherd
View
218
Download
3
Embed Size (px)
Citation preview
1
Advanced Database Topics
Copyright © Ellis Cohen 2002-2005
Transactions,Failure & Recovery
These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.
For more information on how you may use them, please see http://www.openlineconsult.com/db
© Ellis Cohen 2002-2005 2
Topics
Transactions & CommitAbort & RollbackNested Transactions & SavepointsTransactions, Failure & RecoveryServer Page CachingEnsuring Atomicity & Durability
with Shadow PagingEnsuring Atomicity & Durability
with Undo LoggingRedo LoggingUndo/Redo LoggingEnsuring Longer-Term DurabilityHandling Consistency Failure
© Ellis Cohen 2002-2005 3
ACID Properties of Transactions
Atomicity *All of the updates of a transaction are done or none
are done
Consistency *Each transaction leaves the database in a consistent
state (preferably via consistency predicates)
IsolationEach transaction, when executed concurrently with
other transactions, should have the same effect as if executed by itself
Durability *Once a transaction has successfully committed, its
changes to the database should be permanent
© Ellis Cohen 2002-2005 4
Transactions and Commit
© Ellis Cohen 2002-2005 5
Transaction
Logical unit of work that must be either entirely carried out or abortedExample:
a sequence of SQL commands, grouped together,e.g. in an SQL*Plus script
If only part of the transaction were
carried out, the database could be left in an inconsistent state
© Ellis Cohen 2002-2005 6
Example SQL*Plus ScriptThis script moves money
from one account to another. Parameters:
&srcacct - The account to move money from&dstacct - The account to move money to&amt - The amount of money to be moved
UPDATE checkingSET balance = balance - &amtWHERE acctid = &srcacct;
UPDATE checkingSET balance = balance + &amtWHERE acctid = &dstacct;
Suppose a crash occurs right here!
© Ellis Cohen 2002-2005 7
Transactions & COMMIT
Modify
Modify
Modify
Modify
Transaction starts
Transaction commitsModifications persisted to DB
• Each modification is visible to the SQL commands executed after it in the same transaction
• But the modification is not actually persisted to the database until the transaction commits
• So, if a crash occurs in the middle of a transaction, after some modifications have been done,
the DB acts as if the modifications never happened!
All SQL commands are performed within a transaction
Transaction ensures
these are done
atomically
© Ellis Cohen 2002-2005 8
Uncommitted & Committed Transactions
start
transaction
modify modify modify COMMIT
Modifications persisted to DB
start
transaction
modify modify
Modifications not persisted
© Ellis Cohen 2002-2005 9
SQL*Plus Commit Example
SQL> set autocommit off
SQL> UPDATE checkingSET balance = balance - &amtWHERE acctid = &srcacct;
SQL> UPDATE checkingSET balance = balance + &amtWHERE acctid = &dstacct;
SQL> COMMIT;
Transaction started automatically at first update if not already in progress
© Ellis Cohen 2002-2005 10
Starting TransactionsThe COMMIT command ends a transactionHow do transactions start?• Most databases start a new transaction
automatically– on the first access to the DB within a
session, and – on the first access following a
COMMIT• Some databases have a START
TRANSACTION command (to support complex nested transactions)
© Ellis Cohen 2002-2005 11
Transactions & DB Requests
Data TierMiddle Tier
UPDATE …UPDATE …COMMITUPDATE …COMMIT
Cross-Request
Transactions
Within-Request
Transactions
PROCEDURE StoredProc ISBEGIN UPDATE … UPDATE … COMMIT UPDATE … COMMITEND;
Execute Stored Procedure
© Ellis Cohen 2002-2005 12
Automatic CommitUpdates may persist even when COMMIT is
not explicitly called
• Most databases support — either on the server or just through the client-side API — an autocommit mode which automatically does a commit after execution of each request made to the database. This is often the default.
• Most databases automatically COMMIT when a client cleanly closes their connection to the database.
• Most databases (including Oracle) do not allow DDL statements (e.g. CREATE TABLE) to be part of a larger transaction, and automatically do a commit before and after executing a DDL statement.
© Ellis Cohen 2002-2005 13
Java Commit ExampleConnection conn = …;conn.setAutoCommit( false );movemoney( conn, 30479, 61925, 2000 );…
//--------------------------------------- static void movemoney( Connection conn,
int srcacct, int dstacct, float amt ){
Statement stmt = conn.createStatement();String sqlstr = "update checking" + " set balance = balance - " + amt + " where acctid = " + srcacct;stmt.executeUpdate( sqlstr );
sqlstr = "update checking" + " set balance = balance + " + amt + " where acctid = " + dstacct;stmt.executeUpdate( sqlstr );
conn.commit();}
© Ellis Cohen 2002-2005 14
Abort & Rollback
© Ellis Cohen 2002-2005 15
AbortAborting a transaction undoes the
effects of the transaction -- it is as if the transaction never started
Transactions are aborted in 3 ways:1. The system crashes: All active
transactions are aborted
2. An uncorrectable error occurs while executing the transaction
3. The transaction explicitly aborts (this is called a ROLLBACK)
A transaction completes when it either commits or aborts
© Ellis Cohen 2002-2005 16
Rollback
Rollback aborts a transaction
• SQL*PlusROLLBACK
• Javaconn.rollback()
© Ellis Cohen 2002-2005 17
Commit vs Rollback
start
transaction
modify modify modify ROLLBACK
start
transaction
modify modify modify COMMIT
© Ellis Cohen 2002-2005 18
Explicit Rollback
SQL> COMMIT;
SQL> UPDATE Emps SET job = 'COOK';
SQL> UPDATE Emps SET sal = sal + 200;
SQL> ROLLBACK;
After the ROLLBACK, the state is exactly as it was following the COMMIT.
It is as if the two UPDATEs never happened!
With AUTOCOMMIT OFF
© Ellis Cohen 2002-2005 19
Explicit Rollback in Java
{Connection conn = …;conn.setAutoCommit( false );
Statement stmt = conn.createStatement();String sqlstr = …;stmt.executeUpdate( sqlstr );
…if (…)
conn.commit();else
conn.rollback();}
© Ellis Cohen 2002-2005 20
Rollback Past Commit
Rollback rolls the state back to the beginning of the transaction.
Why not allow some form of rollback that goes back to some earlier point?
© Ellis Cohen 2002-2005 21
Commit Semantics & Compensating Transactions
Because commits are durable!When a transaction commits, the user or application is
notified that the commit succeeded and can't be undone, and may take other actions outside the database based on that– Display output to a user– Send a message to another process– Launch nuclear missile
Some systems allow a compensating transaction to be associated with a transaction when it commits.
The compensating transaction can be executed to "undo" the effects of the associated committed transaction (possibly within some time limit)– Output a retraction– Send a compensating message– Destroy the nuclear missile
© Ellis Cohen 2002-2005 22
Nested Transactions& Savepoints
© Ellis Cohen 2002-2005 23
Nested TransactionsTransaction can nest
modify modify modify
starttransaction
Only the outermost transaction can commit and persist data
Nested transaction can control the degree of rollback
Nested transactions in SQL are implemented using SAVEPOINTs
© Ellis Cohen 2002-2005 24
SavepointsSAVEPOINT <name>
Explicitly start new named nested transaction
ROLLBACK to SAVEPOINT <name>Rolls back to state at start of named nested
transaction
RELEASE SAVEPOINT <name>Releases savepoint and associated transaction
[not supported by Oracle](Setting a savepoint with the same name as an existing
savepoint releases the existing one)
COMMITReleases all savepoints within outermost
transaction & commits
start
transaction
setsavepoint
a
setsavepoint
b
setsavepoint
crollback
to bcommit
© Ellis Cohen 2002-2005 25
Using Savepoints
Set savepoint to try something that is quick but doesn’t always worke.g. access to some remote database
that is not always available
On failure, back up to the savepoint(undoing any changes to the DB you have made) and try slower but more reliable technique
© Ellis Cohen 2002-2005 26
Alternative Path in PL/SQL
BEGIN DoUsefulSetup( … ); BEGIN SET SAVEPOINT RetryPoint; DoQuickUnreliableUpdates(…); EXCEPTION WHEN OTHERS THEN ROLLBACK TO RetryPoint; DoSlowReliableUpdates(…); END;END;
© Ellis Cohen 2002-2005 27
Alternative Path in Java
Connection conn = …;Statement stmt = conn.createStatement();
DoUsefulSetup(…);try { Savepoint spRetry =
conn.setSavepoint( "RetryPoint" ); DoQuickUnreliableUpdates(…); }catch( Exception e ) { conn.rollback( spRetry ); DoSlowReliableUpdates(…); }
© Ellis Cohen 2002-2005 28
Statement-Level TransactionsEvery SQL statement executes within a nested
transactionA statement can fail
– E.g. due to violation of an integrity constraint, e.g.check( enddate > startdate)
Result of statement failure:– The statement is rolled back.
If an update statement would update 100 records, but updating the 11th records causes failure of an integrity constraint, the 10 previously updated records are rolled back to their old state
– In embedded SQL, it then raises an exception, which can eventually cause the outermost transaction to abort if not caught
Result of statement success– Statement-level transaction is released
© Ellis Cohen 2002-2005 29
Autonomous Nested Transactions
When a transaction fails, all modifications made during that transaction are undone.
That may not be what you want!– Suppose you want to add an audit record (to the
EmpsAudit table) every time someone tries to update the Emps table.
– You want to add that audit record even if the operation which updates Emps is ultimately rolled back.
Solution: Add the audit record inside an autonomous nested transaction.– Autonomous transactions can durably commit inside
of a parent transaction– If the parent transaction is aborted after the nested
autonomous transaction commits, modifications made inside the autonomous transaction will NOT be undone.
© Ellis Cohen 2002-2005 30
Transactions, Failures
& Recovery
© Ellis Cohen 2002-2005 31
Types of Failures
Transaction FailureTransaction aborts for some reasons
(e.g. uncaught exception )
System FailureProcessor / system crashMain memory lost, disk ok
Media Failure & CatastrophesDisk or Controller error / Head crash
User/Program Errors& SabotageLoss or corruption of data
Atomicity
Durability
Consistency
Potentially Violate
© Ellis Cohen 2002-2005 32
Failures & Recovery
Atomicity-Related FailuresReturn all data changed by a transaction to its
state at the beginning of the transaction
Durability-Related FailuresDepends on keeping a backup of the dataRecover the state of the data from the backup
Consistency-Related FailuresRecover affected data (as for a durability failure)Deal with cascading effects of committed
transactions that modified or depended upon incorrect data
Very difficult to deal with; won't deal with these in general
© Ellis Cohen 2002-2005 33
Shadow CopyingPrimitive Recovery Mechanism to
Ensure Atomicity & (limited) Durability
Assumes 1 transaction at a timeInitially there's the Main DB, and db_ptr (also on disk) holds its disk address
Then, when a transaction startsA copy of the Main DB is made: Current DB Copy The transaction is executed using Current DB CopyIn effect, Main DB becomes a "shadow copy"
How is a ROLLBACK handled?
db_ptr
Current DB Copy
Main DB
© Ellis Cohen 2002-2005 34
Failure and Shadow CopyingOn Commit1) Force cached pages out to Current DB Copy
Crash before (2): As if the transaction never started
2) Change db_ptr to point to Current DB CopyCrash after (2): Transaction state is completely updated
3) Discard the old Main DB
on disk
db_ptr
Current DB Copy
MainDB
A single atomic operation (changing the db_ptr) moves the system from one consistent state to another
© Ellis Cohen 2002-2005 35
Shadow Copy in Practice
- Takes too much space to make a copy of the entire DB
- Too slow to make an entire copy of the DB for each transaction
Perhaps we will find the ideas of shadow copying useful later on …
© Ellis Cohen 2002-2005 36
Server Page Caching
© Ellis Cohen 2002-2005 37
Disk Structure
same size as memory
page
© Ellis Cohen 2002-2005 38
Disk Block OrganizationDivide database into disk blocks
(which correspond to memory pages)A block is 1 or more contiguous disk
sectors
Generally, eitherA block holds 1 or more complete rows (i.e.
tuples), usually from the same tableNo row straddles a blockBlock has contiguous rows,
or a row directory which keeps track of the offsets of the rows in the block
Or (for long rows)A row spans 1 or more blocks (internal chaining)No block holds pieces of 2 or more rows
Really large fields (LOB's) are stored separately.
© Ellis Cohen 2002-2005 39
Addressing Tuples
3049625973
Identifiesa specific database
block
Identifies a slot in the
block's row directory
Every tuple in a database is addressed by a ROWID, which indicates where it may be found
© Ellis Cohen 2002-2005 40
Migration & Forward Chaining
An update may increase the size of a tuple so much that it can no longer fit in the same block, so we have to move it to another block.
But we want the tuple to still be identified by its ROWID, which refers to the old block
The data for the row in the old block holds a forwarding id -- the id of the ROWID for the row in the new block
© Ellis Cohen 2002-2005 41
Block Access & UpdateTo read any row in a block,
the block is read into core memory(if not already there)[may also prefetch adjacent blocks]
To insert/delete/update a row in a block
1) the block is read into core memory (if needed)2) the page is modified3) the page is eventually written back to disk
CoreMemory(pages)
Disk Memory (blocks)
1
32
© Ellis Cohen 2002-2005 42
Blocks & Pages
Frequently, the DB block size (the smallest unit of data transfer between the DB disk memory and core memory) is chosen to be the same size as a virtual memory page
We will use the terms page and block interchangeably.
© Ellis Cohen 2002-2005 43
Server Page CachingAfter a read or update, the page may be
cached (i.e. retained) in the DB server's memory.If the page is still in memory next time it is
needed, there is no need to read it from diskWhen the cache is full, room is made for a new
page by replacing some other pageMost metadata tables are always in the cache
Core Memory
Disk Memory
Cache
© Ellis Cohen 2002-2005 44
Memory & Disk Specs
130G Disk512 bytes/sector, 256 sectors/track65K tracks/head, 16 heads/disk (8 platters) 1M tracks/disk, 256M sectors/disk10 ms max seek time, 1 ms track-to-track4 ms avg latencySustainable data transfer rate: 65Mbps
(4K bits per sector 60s / sector)
Average time to check 2K bytes from diskseek + latency + transfer + core check times0-10ms + 4ms + .25ms + 1s = 4.25-14.25ms
Disk/Core ratio = ~10ms/1s = 10,000:1
© Ellis Cohen 2002-2005 45
Page Caching & Virtual Memory
• OS allocates DB a fixed (perhaps changeable) # of pages of disk and memory which the DB managescan unnecessarily constrain memory management
• Persistent DB state stored in ordinary files, and the page cache is in virtual memorycauses duplication of effort
if VM page is backed to disk
• OS and DB storage management are integratedOS (e.g. Mach) has a file mapping API which can
be used by the database
© Ellis Cohen 2002-2005 46
Dirty and Active Pages
Dirty PagePage that has been modified, and has
not been written back to disk (since it was modified).
A clean page either• Has not been modified since it was read• Has not been modified since it was last
written back to disk
Active PagePage that has been accessed by a
transaction that has not yet completed (i.e. committed or aborted)
© Ellis Cohen 2002-2005 47
Page States
Same contents as on disk
Every transaction that used it has completed
Same contents as on disk
Some transaction that used it is active
Page has been modified, but not
written back to diskEvery transaction that used it has completed
Page has been modified, but not
written back to diskSome transaction that
used it is active
Clean
Dirty
Inactive Active
Consider the page that has been least recently used.Which of these states could it be in?
(Consider the states in the order indicated)
1 2
34
© Ellis Cohen 2002-2005 48
LRU Page States
The transactions which used this page finished a long time
agoAny modifications were written out
A transaction using this page started a
long time ago, but has not yet finished
Any modifications were written out
The transactions which used this page finished a long time
agoModifications not
written out
A transaction using this page started a
long time ago, but has not yet finished
Modifications not written out
Clean
Dirty
Inactive Active
Why are Dirty Inactive pages a problem for Durability?
© Ellis Cohen 2002-2005 49
What happens to a dirty page when the transaction which modified it commits?
FORCE:It is written back to disk.Necessary for durability unless some other
mechanism is available.Effect: No dirty inactive pages
NO-FORCE:The page is not written back on commit.Avoids overhead at commit time.If the system crashes after a transaction
commits, and the page is not written back, how is durability ensured?
Forcing
© Ellis Cohen 2002-2005 50
The Replacement Problem
What if a page needs to be loaded into memory, but cache memory is full.
We need to replace some page with the new page.
Which page should we replace?
© Ellis Cohen 2002-2005 51
Replacement Algorithms
LRU: Choose the page which has been used least recently. Based on the (often true) notion that pages used most recently will most likely be used again in the near future.
Clock Algorithm: Approximates LRU, but is more efficient. Cycle through the pages in order. Choose the next page in order that was not used since that page was considered in the previous cycle.
Also, first replace pages read in during full table scans (in fact, if the table is large, throw out earlier pages read when scanning later pages)
© Ellis Cohen 2002-2005 52
Cost of Writing Dirty Pages
Suppose the page chosen for replacement is dirty – we need to first write the dirty
page back to disk (which impacts performance)
– before a newly read page can replace it in the cache
Why is this so?
Is there a way to improve the performance?
© Ellis Cohen 2002-2005 53
Pre-Write LRU Dirty Pages
Use a separate Cleaner Process to find dirty pages which have not been used recently and write them back to disk
– The disk scheduler doesn’t need to write them back immediately, but when it is most efficient to do so
The dirty page is not immediately replaced, it just becomes clean (instead of dirty).
– This allow the replacement algorithm to always find a clean page (not used recently) to replace, without needing to wait for it to be written back
But what should the Replacement Algorithm or the Cleaner Process do when it considers an active dirty page?
© Ellis Cohen 2002-2005 54
StealingSTEAL: May choose a dirty active page to
clean/replace.– What is the danger if you do NOT write out the
page? (What if the transaction that modified the page commits?)
– What is the danger if you DO write out the page? (What if the transaction that modified the page aborted?)
NO-STEAL: Skip over dirty active pages– If there are no clean pages, forces some
transaction to abort.– If few clean pages, transactions may thrash
(continually reread pages which have recently been replaced)
You can always choose a dirty inactive page;just write it out first
© Ellis Cohen 2002-2005 55
Effect of Processor Crash
As if the transaction
never happened
Atomicity Failure
Atomicity Failure
Durability Failure
Atomicity and
Durability Failure
Transaction saved
successfully!
Active Transaction
Committed Transaction
No Modified Pages on Disk
All Modified Pages on Disk
Some Modified Pages on Disk
Problem even if you FORCE & don't STEAL:Crash in the middle of commit while forcing pages to disk;only some modified pages may be on disk
STEALNo
FO
RC
E
Is there a recovery mechanism based on shadow copying which can solve this problem?
© Ellis Cohen 2002-2005 56
Using Shadow Copies
At commit time, we first use shadow copying for all the dirty pages
We change the database so it points to those pages instead of the original pages (how do we do that atomically?)
Assuming we can make that work,can we allow page stealing?
© Ellis Cohen 2002-2005 57
Ensuring Atomicity and
Durabilitywith
Shadow Paging
© Ellis Cohen 2002-2005 58
Page Tablesmain page table ptr
A
B
C
D
E
F
G
AB
CD
E
GF
A Database can be organized using a page table.The table maps the LOGICAL block #
(which is used in ROWIDs) to the PHYSICAL blocks #
(where the block actually lives on the disk)
© Ellis Cohen 2002-2005 59
Commit-Time Shadow Pagingmain page table ptr commit-time page table
ptr for transaction TA
B
C
D
E
F
G F'
D'
B'A
BC
DE
GF
A
B
C
D
E
F
G
At COMMIT time of transaction T1. A commit-time copy of the page table is made2. T's dirty pages (B, D, F) are forced to disk (but DO NOT
overwrite the originals), and the commit-time page table copy is changed to point to the new modified copies
3. The main page table ptr is switched to point to the commit-time page table. THIS IS WHEN THE COMMIT HAPPENS!
4. The old copies of B, D, F and the old page table are freed
© Ellis Cohen 2002-2005 60
Shadow Paging Issues
1. Can we support stealing with shadow paging?
2. The page table is too big to copy on every transaction. How can we improve performance?
3. When a tuple is updated, what pages are changed?
© Ellis Cohen 2002-2005 61
Stolen Page Map
T's stolen page map
B
F
F'
B'
If T is using a dirty page that needs to be replaced or cleaned
Write it to disk, and note it in T's private stolen page map
If T needs to access that page again,
look for it the stolen page map before looking in the main page map
When T commits, use T's stolen page map to help build T's commit-time page table copy
Two transactions are unable to modify (different rows on) the same page.
Requires page-level locking (discussedin the next lecture)
© Ellis Cohen 2002-2005 62
Multi-Level Page Tables
main page table ptr
PT0
PT1
PT2
PT3
PT4
…
PT99
P101
P400
P401
P9999
P499
P100
P101
…
P199
P400
P401
…
P499
P9900
P9901
…
P9999
P400
P401
…
P499
P401'
P499'
PT0
PT1
PT2
PT3
PT4
…
PT99
commit-time page table
ptr for transaction T
© Ellis Cohen 2002-2005 63
Auxiliary Affected Pages What pages are affected when a tuple is updated?• The page containing the original tuple• If the update makes the tuple so large that there is
no room for it in the old page, it is moved to a new page (a forwarded page)– If so, the corresponding page of the page table is
affected as well.• If any of the updated fields are indexed, then the
corresponding index entry for the tuple will have to be moved (e.g. deleted from its old position in the B+ tree, and inserted at the new position).– Both the page of the old entry and the page of the new
entry will be affected– Adding a new entry to an index page may cause that
page to be split, which will then affect the corresponding page of the page table
– Removing an entry from an index page may cause that page to be combined with an adjacent page, which also affects the corresponding page of the page table.
• Pages containing the portions of the page table hierarchy used to reference those pages!
© Ellis Cohen 2002-2005 64
Shadow PagingCharacteristics
Ensures Atomicity & DurabilityRequires Forcing (dirty pages must be written back
at commit time) to ensure durabilityAllows Stealing (dirty pages can be written back
before transaction commits, though not overwritten)
Assumes if one transaction modifies a page, no other transaction can read or modify it(i.e. page-level locking)
MechanismUses a page table (on disk) which keeps
a list of all pages in the DBKeeps a shadow copy of each page writtenDoes shadow copying of the page tableMain result: At commit time, moves the system
instantly from one consistent state to another
© Ellis Cohen 2002-2005 65
Pages Changed by Multiple Transactions
What if the same tuple is changed by two concurrent transactions?– Assume this doesn't happen.– In the next lecture, we will talk about
concurrency control mechanisms which prevent this
What if two different tuples on the same page are changed by concurrent transactions?
This is a real problem with shadow paging. Either– Allow one transaction at a time to use a page
(using page locks), or– Don't actually make the changes to the page
until just before commit (using intention lists)
© Ellis Cohen 2002-2005 66
Problems with Shadow Paging• Commit Bottleneck
Only one transaction can commit at a time, if we want the page table to be correct (how might this be fixed?)
• Limits on ConcurrencyCan't have different transactions modify
independent parts of pages (could be addressed by deferred modification and intentions lists)
• Cost of ShadowingOverhead of allocating and freeing shadow copies
• Data Fragmentation For read efficiency, you want logically adjacent data
to be kept physically adjacent (e.g. using extents)For write efficiency, this implies
in-place updating, not shadow copying(could possibly be addressed by ongoing defragmentation [+ sorting] in the background)
© Ellis Cohen 2002-2005 67
Overview of Logging(the Alternative to Shadow Paging)
Main features– Uses a log to support recovery (the log
itself may span multiple pages)– No shadowing; Uses in-place updating– Can track modifications at the row
(rather than the page) level– No Page Tables, but still depends on
Server Page Caching
Three approachesUndo-Only Logging (Backward Recovery)
Allows stealing, but still requires force on commit
Redo-Only Logging (Forward Recovery)Avoids force on commit, but no stealing
Combined (Undo/Redo) LoggingAvoids force on commit, and allows stealing
© Ellis Cohen 2002-2005 68
Ensuring Atomicity and
Durabilitywith Undo Logging
© Ellis Cohen 2002-2005 69
Backward Recovery with Undo Logs
MechanismOn every modification made to any tuple in the
database, append an Undo Log entry to an Undo Log.
On Transaction Abort: use the Undo Log to undo all modifications made by the aborted transaction, in backwards order.
Crash Recovery: Abort all uncommitted transactions
CharacteristicsRequires Forcing (dirty pages must be written back
at commit time) no way to redo on crash
Allows Stealing (dirty pages can be written back before transaction commits) because undo-able
All the advantages of shadow paging, with none of the disadvantages
© Ellis Cohen 2002-2005 70
Describing ModificationsSuppose the Emps table contains
(ROWID) EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO------- ----- ---–-- -------- ----- ---------- ----- ------ ------3479000 7369 SMITH CLERK 7902 17-DEC-80 800 203479001 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 303479002 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 303479003 7566 JONES DEPTMGR 7839 02-APR-81 2975 203479004 7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 303479005 7698 BLAKE DEPTMGR 7839 01-MAY-81 2850 303479006 7782 CLARK DEPTMGR 7839 09-JUN-81 2450 103479007 7788 SCOTT ANALYST 7566 19-APR-87 3000 203479008 7839 KING PRESIDENT 17-NOV-81 5000 103479009 7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 303479010 7876 ADAMS CLERK 7788 23-MAY-87 1100 203479011 7900 JAMES CLERK 7698 03-DEC-81 950 303479012 7902 FORD ANALYST 7566 03-DEC-81 3000 203479013 7934 MILLER CLERK 7782 23-JAN-82 1300 10
Transaction T3 executesUPDATE Emps SET sal = sal + 100 WHERE deptno = 10
What changes were made to which tuples?
© Ellis Cohen 2002-2005 71
Tuple Modifications
The following changes were made:Tuple 3479006: sal 2450 2550Tuple 3479008: sal 5000 5100Tuple 3479013: sal 1300 1400
Suppose:
The operation was executed,The pages containing these tuples were modified (in the server page cache)Those pages were written out (due to STEALing)Then Transaction T3 was ABORTed
What is the minimum information we would need to know about the affected tuples to undo the effects of the operation?
© Ellis Cohen 2002-2005 72
Tuple Before StateWe need to know that:
Tuple 3479006: sal was 2450Tuple 3479008: sal was 5000Tuple 3479013: sal was 1300
For each tuple that was updated,we need to know
what the value was for each modified fieldbefore the operation
This is the information that is written into the undo log.Many systems write the contents of the entire tuplebefore the operation – this is called the before image
What do we need to know to undo a DELETE or INSERT?
© Ellis Cohen 2002-2005 73
Undoing INSERT & DELETE
To undo an INSERTWe just need to record the ROWID of the
tuple, so we can delete it
To undo a DELETEWe need to record the ROWID of the
tuple plus the entire contents of the tuple, so we can re-insert it!
What do we need to record when we do other operations: e.g.
CREATE TABLE or DROP TABLE?
© Ellis Cohen 2002-2005 74
Logging System Operations
In an RDB, all system state (e.g. which tables are created, what their fields are, etc.) is stored in Metadata tables.
Any system operation (e.g. CREATE TABLE, DROP TABLE) is implementing by modifying the metadata tables.
We just log those modifications, just as we log modification to tuples in user tables!
© Ellis Cohen 2002-2005 75
Separate vs Integrated LoggingSeparate Logging
Some systems use a separate undo log for every transaction or for every threadMay affect performance if different logs are on different tracks of the same diskBUT: Very easy to abort a single transaction. Just walk backwards through that transaction's log. Every log entry is for the transaction being aborted.
Integrated LoggingIn an integrated log, all log entries are appended to a single log, which interleaves entries from multiple transactionsEach entry must identify the associated transaction.To undo a transaction, it is necessary to locate the log entries for that transaction. Typically, each entry points to the previous entries for the same transaction, and there is an entry which identifies the START of the transaction.
© Ellis Cohen 2002-2005 76
Modification Entries for an Integrated Undo Log
T3 Insert 3049625973T3 Delete 3049218695
67 'Marketing'T3 Update 3049218696
23 'Sales'
Before state
Before state
Transaction, Operation, ROWID,Before State
An UNDO log efficiently stores information needed to restore modified pages to their old state.
Just like keeping shadow pages, but more efficient!
Executed by transaction T3:INSERT INTO Depts VALUES( 30, 'Accounting')DELETE Depts WHERE deptno = 67UPDATE Depts SET dname = 'Gift' WHERE deptno = 23
© Ellis Cohen 2002-2005 77
Implementing Abort• Traverse the integrated log (starting at
the end and going backwards) to find all the entries for that transaction
NOTE: this is more efficient if the entries for each transaction are linked together
• For each such modification log entry, restore the before state.
NOTE: If the page/block the entry refers to was stolen, it will first need to be read back into the cache.
Logs are APPEND-ONLY. This makes them much more efficient to implement.
So, Abort does NOT delete the undo entries after using them to implement an abort.
What if a different transaction has modified a different tuple on the same page as a change which is undone?
Why not find the start of the transaction & undo going forwards?
© Ellis Cohen 2002-2005 78
Pages Changed by Multiple Transactions
What if the same tuple is changed by two concurrent transactions?What if a tuple modified by the aborted transaction was read by another transaction?– Assume this doesn't happen.– There are separate concurrency control
mechanisms which prevent this
What if two different tuples on the same page are changed by concurrent transactions?– This is NOT a problem for logging.– Log entries pinpoint a specific tuple on a page,
which can be undone leaving other tuples on the same page modified.
© Ellis Cohen 2002-2005 79
Auxiliary Affected Pages
Modifying a tuple on a page may cause modification to many other pages
– forwarded pages (oversized updates)– index pages– table directory pages (i.e. which pages hold
data for a table)
Two approaches– Explicit: Add entries to the log for each of
these modified pages as well. After all, these represent change that will have to be undone as well.
– Implicit: Do not add entries to the log for changes other than to the tuple itself. Changes to other affected pages can be done automatically as part of undoing the change to the tuple.
© Ellis Cohen 2002-2005 80
Physiological Logging
Our undo log uses "physiological" log entries
• They physically indicate the block of the tuple that was modified (the block # of the ROWID)
• They logically provide information needed to restore the tuple to the state prior to the modification
To undo an INSERT, you only need the fact that it was an Insert along with its logical position in the block (the slot # of the ROWID), because you will undo the INSERT by freeing the contents of that slot.
To undo a DELETE, you need to know all the values of the deleted tuple as well
© Ellis Cohen 2002-2005 81
Write Ahead Logging (WAL)Suppose there is a crash
– Before the commit of a transaction is complete– After a page modified by the transaction has
been written out (at commit time or due to stealing)
Use the undo log to ensure atomicity: undo the changes made to the page
But only if the undo log is already on the disk!
Write Ahead Logging
Before writing out a page, force out the undo log (or at least the parts of the undo log which have entries
that refer to that page, implicitly or explicitly).
© Ellis Cohen 2002-2005 82
Transaction Entries for an Integrated Undo Log
T# Startappended to the log when transaction T# starts (if a transaction' s entries are linked together, this is not needed; START is implied by an entry with a NULL backwards link)
T# CommitCompleteappended to the log after all pages modified by transaction T# have been forced out.
T# AbortCompleteappended to the log after all pages which have been undone for transaction T# have been forced out.
How are these transactional entries used along with the modification entries to recover from a crash?
Log ForcingA COMMIT appends CommitComplete to the log
(after all its modified pages have been written out), and then forces the log out.
That's when the COMMIT is actually complete.
© Ellis Cohen 2002-2005 83
Backward Recovery• Traverse the entire log (starting at the
end and going backwards)• Skip over a modification entry if
– its transaction's CommitComplete entry has been encountered (all its modified pages have been forced out; it doesn't need to be undone)
– its transaction's AbortComplete entry has been encountered (all pages it modified have already been undone and forced out; they don't need to be undone again)
• Otherwise, perform the undo action for the modification entry
Why does the entire log have to be traversed?What could you do to avoid that?
If a crash occurs in the midst of a transaction, some modifications will be undone that were never persisted.
Why is that true? Is that a problem?
© Ellis Cohen 2002-2005 84
Checkpoint EntriesWhen a crash occurs
All transactions which have not completed (forced out a CommitComplete or AbortComplete entry) must be undone.
But a transaction might have started a long, long time ago, made a modification, but not made any other modifications since then. We have to look through the entire log to find entries for these transactions.
Solution: Regularly add Checkpoint entriesAdd a Checkpoint entry to the log at regular
intervals with a list of all the active transactionsDuring crash recovery, stop traversing the log
when a Checkpoint entry is found where all the active transactions listed have completed (i.e. their CommitComplete or AbortComplete entries have already been encountered).
How do Start entries allow even earlier stopping?
© Ellis Cohen 2002-2005 85
Undoing Un-persisted ChangesSuppose Transaction T3 executes
UPDATE Emps SET sal = sal + 100 WHERE deptno = 10
And the following sequence of events occurs1. The operation updates tuples with ROWIDs 3479006,
3479008, 3479013 (in the server page cache)2. UNDO entries for the operation are written to the log3. The log is forced out4. The page containing tuple 3479008 is written out5. The system crashes
When the system recovers, it will go through the log and execute the UNDO entries for 3479006, 3479008, 3479013, even though the changes for 3479006 and 3479013 were never persisted.
Undo just restores the BEFORE state. If the change being undone was never persisted, at worst this has no effect. (Implicit changes must be handled a little more carefully)If a crash occurs in the midst of aborting a transaction or recovering from a previous crash, some actions that have
already been undone will be undone again. Why is that true? Is that a problem?
© Ellis Cohen 2002-2005 86
Idempotence
If a crash occurs in the midst of aborting a transaction or recovering from a previous crash, some actions that have
already been undone will be undone again. Why is that true? Is that a problem?
After undoing some of the actions,the pages of some of the restored tuples
could be written back (due to STEALing, as usual).
Re-undoing these is not a problem, because, at worst, we are re-restoring the
BEFORE state.
So UNDO of physiological logs is idempotent.
(Doing it additional times has no effect)
© Ellis Cohen 2002-2005 87
Redo Logging
© Ellis Cohen 2002-2005 88
Forward Recovery with Redo Logs
MechanismOn every modification made to any tuple in the
database, append an Redo Log entry to a Redo Log.
On Transaction Abort: Discard pages dirtied by the transaction from the server page cache; Use the Redo Log to redo other modifications made to those pages,
Crash Recovery: Use the Redo Log to redo modifications to pages of committed transactions that were not forced to disk.
CharacteristicsForcing Not Required (dirty pages need not be
written back at commit time) because redo-able
No Stealing (dirty pages CANNOT be written back before transaction commits) since no way to undo on abort
© Ellis Cohen 2002-2005 89
Redo Log Modification Entries
Transaction, Operation, ROWID,Before State
Executed by transaction T3:INSERT INTO Depts VALUES( 30, 'Accounting')DELETE Depts WHERE deptno = 67UPDATE Depts SET dname = 'Gift' WHERE deptno = 23
T3 Insert 304962597330 'Accounting'
T3 Delete 3049218695T3 Update 3049218696
23 'Gift'
After state
After state
These are physiological log entries
© Ellis Cohen 2002-2005 90
Implementing Abort
Invalidate all pages modified by the transaction
Starting at the beginning of the integrated log, and traversing forward:
Find all log entries for uncommitted transactions that affect the invalidated pages and redo them(as well as implicit changes to auxiliary affected pages)
There are ways to speed this up, but still, this can be slow
© Ellis Cohen 2002-2005 91
Transaction Entries for Redo Logs
T# Commitappended to the log when the a request is made to commit the transaction
Log ForcingA COMMIT appends a Commit entry to the log
when a commit request is made,and then forces the log out.
That's when the COMMIT is actually complete.
The only transaction log entry needed for a REDO log is Commit
© Ellis Cohen 2002-2005 92
Forward Recovery• [Analysis Phase] Traverse the log backwards to
find all committed transactions (easier if all Commit entries are linked together)
• [Redo Phase] Then traverse the entire log (starting at the beginning and going forwards)– Redo every modification entry of a committed
transaction, bringing the necessary block/page into the cache if it is not already there.
• This may redo changes which have already been persisted. Not a problem, since redoing a change that was already made cannot hurt.
– Redoing an entry makes the modification to the cached page. Since there is no forcing, these will eventually be written to disk just as during regular operation.
It is really only necessary to redo modifications made to a page after the page was last persisted.
How can this be arranged?
© Ellis Cohen 2002-2005 93
Log Sequence Numbers (LSN's)
The entries in the log can be numbered (1, 2, … ). These are called log sequence numbers or LSN's.
Every time a page is modified, the LSN of the corresponding log entry is placed in the page, and is written out to disk along with the page.
A redo log entry only needs to be redone if its LSN is greater than that of the page it is on.
© Ellis Cohen 2002-2005 94
Unwritten Dirty PagesPages are never forced out– After a commit, a dirty page can be written out– However, another transaction could start reading it
(or might already be reading it), which would prevent it from being written out until that transaction completed.
– Using LRU or clock replacement, a dirty page that is continually used might never be written out(We could prevent new transactions from using long-time dirty pages)
We have no way of knowing how far back in the log is the first modification made – by a committed transaction – to a page that was not saved, especially if there are
no explicit log entries for auxiliary affected pages.
That's why we have to start redoing from the very beginning of the log.We'd like to find a way to avoid that
© Ellis Cohen 2002-2005 95
Use Fuzzy Checkpointing
At regular intervals, just write a "fuzzy" Checkpoint entry, which includes
– a link to the previous checkpoint entry
– a list of inactive dirty pages along with the transaction that dirtied each one of them
– a list of transactions which have committed since the previous checkpoint
Explain crash recovery based on this checkpoint information
© Ellis Cohen 2002-2005 96
Fuzzy Checkpoint Recovery
• Traverse backwards through the log to the last checkpoint, keeping track of transactions with Commit entries.
• Traverse backward through the checkpoints, adding to the list of committed transactions as you go.
• Stop traversing when you get to a checkpoint which has no page/ transaction pairs that match any in the last checkpoint.That's the most recent point at which we know that all active dirty pages were eventually saved.
• Start redoing from that point forwards.
© Ellis Cohen 2002-2005 97
Undo/Redo Logging
© Ellis Cohen 2002-2005 98
Undo/Redo Logging
CharacteristicsForcing Not Required (dirty pages need not be
written back at commit time) because redo-able
Allows Stealing (dirty pages can be written back before transaction commits) because undo-able
MechanismOn every modification made to any tuple in the
database, append an Undo/Redo Log entry to an Undo/Redo Log
On Transaction Abort: use the Log to undo all modifications made by the aborted transaction, in backwards order
Crash Recovery: First Redo all changes to ensure durability, then Undo changes made by uncommitted transactions to ensure atomicity (Aries)
© Ellis Cohen 2002-2005 99
Undo/Redo Log Modification EntriesExecuted by transaction T3:
INSERT INTO Depts VALUES( 30, 'Accounting')DELETE Depts WHERE deptno = 67UPDATE Depts SET dname = 'Gift' WHERE deptno = 23
These are physiological log entries
T3 Insert 304962597330 'Accounting'
T3 Delete 3049218695 67 'Marketing'
T3 Update 3049218696 23 'Sales'23 'Gift'
After state
Before state
Before state
After state
© Ellis Cohen 2002-2005 100
Logical Log EntriesLogical Log Entry:
based on OPERATIONS, not tuplesRedo Logical Entry:
logs the actual SQL statementUndo Logical Entry:
logs a compensating SQL statmentUndo/Redo Logical Entry: logs both
If the SQL statement is INSERT INTO Depts VALUES ( 30, 'Accounting' )the compensating SQL statement is DELETE FROM Depts WHERE deptno = 30
Logical Log Entries often used for backup, replication, recovery from inconsistency.
Can be used cautiously for undo/redo, since SQL statements not generally idempotent.
© Ellis Cohen 2002-2005 101
Ensuring Longer-Term Durability
© Ellis Cohen 2002-2005 102
Storage Stability
• Volatile storageMain memory
• Semi-stable storageOrdinary disk memory
• Stable storageStorage that survives failure– Redundant RAID levels (e.g.
Mirroring, Parity)– Relative to degree of failure or
catastrophe
© Ellis Cohen 2002-2005 103
Approaches to Ensuring Durability
Stable StorageRedundant RAID Levels
Non-Local ReplicationDistributed Replicated Data
ArchivingRegular (Fuzzy) Backup
may be used with local redundant logRemote Logging
Send log records to be maintained on a remote machine
© Ellis Cohen 2002-2005 104
Remote Logging Issues
Frequency of Sending ChangesContinuouslyAt Regular IntervalsAt Commit
Format of ChangesOperations (logical redo log entries)Values or Deltas
(physiological redo log entries)Commit
– Just Communicate Commit (1-safe)– Jointly Commit (2-safe)Both are special cases of data replication
© Ellis Cohen 2002-2005 105
Recovery with Remote Backup
1. Use the backup to restore the primary disk (or a hot spare)
2. The backup machine takes over as the primary machine(at least until the primary disk is restored)
© Ellis Cohen 2002-2005 106
Handling Consistency
Failure
© Ellis Cohen 2002-2005 107
Enforcing ConsistencyHow do database applications enforce consistency?
• Constant Monitoring– Using constraints, assertion, triggers or
application code
– Prevent/abort operation that lead to inconsistent states, try to correct the problem, or immediately notify the DBA
• Interval-Based– At (regular) intervals, check that the system is in
a consistent state. If not, correct it, or notify the DBA.
• Ignore– Hope that nothing bad happens. If it does,
scramble …
© Ellis Cohen 2002-2005 108
Result of Consistency Failures(due to User Error or Sabotage)
tbl1
tbl2
tbl3
tbl4Erroneous
change discovered
Erroneous change
committed
T1 T2
Erroneous changes which are discovered later can propagate errors widely
It can be quite a while before an erroneous
change is discovered
© Ellis Cohen 2002-2005 109
Why is Consistency Failure Recovery Hard?
• Need to rollback state from T2 to T1 undoing all changes– Use the log to rollback the system to just before
the error
– Must compensate for external side-effects -- e.g. send report, launch missile
• Need to roll forward and redo committed transactions, other than erroneous changes– Can't use physiological log entries, because
old/new values may no longer match restored values (from tbl2 and then propagated elsewhere)
– Could use logical log entries, which logs operations done (with parameters and perhaps with system values -- e.g. time)
© Ellis Cohen 2002-2005 110
Operation LevelsUsing an operation log to roll forwards implies that
the DB operations executed would be the same, even if the state were different.
UPDATE …UPDATE …COMMITUPDATE …COMMIT
An application or a user operation contains multiple DB operations (within multiple transactions) and uses the current state to decide which operations to execute.
A replayed application might be in a completely different state (since T1 was not executed) and execute a completely different sequence of DB operations.
Rolling forward from T1 really requires a log of the higher level user operations or applications executed (and even those might differ if the state were different).