44
CS346: Advanced Databases Graham Cormode [email protected]. uk Transaction Processing

CS346: Advanced Databases Graham Cormode [email protected] Transaction Processing

Embed Size (px)

Citation preview

CS346: Advanced DatabasesGraham Cormode [email protected]

Transaction Processing

Outline

Chapter: “Introduction to Transaction Processing Concepts and Theory” in Elmasri and Navathe

¨ Introduce the concepts of transactions and concurrency control¨ ACID properties: atomicity, consistency, independence, durability ¨ System logging, commit points, and failure recovery¨ Schedules, serializability, and conflicts¨ Transactions in SQL

CS346 Advanced Databases2

Why?

¨ Transaction processing is an important part of databases– Jim Gray won the Turing award for his work on transactions

¨ A meeting of theory and practice– Theory explains how to produce effective sequences of transactions– Simple protocols to schedule transactions in practice

¨ An introduction to the topic of concurrency control and locking– Of importance in distributed systems and managing distributed data

CS346 Advanced Databases3

Transaction Processing

¨ Transaction Processing Systems:– Airline reservations– Banking/credit card processing– Ecommerce / online purchasing / auctions– Stock markets

¨ Common requirements– Many concurrent users making concurrent requests– High availability, fast response time– Don’t sell the same item (seat, share) to two different people!

¨ Based on the idea of atomic transactions– A transaction either succeeds or is declined

CS346 Advanced Databases4

Transaction Processing Concepts

¨ Many examples of databases so far look like single user systems¨ In practice, most databases are multiuser

– Hundreds or thousands (or more) users submitting transactions¨ Make use of wide availability of parallelism in modern systems

– Multithreaded execution per core– Multiple cores per CPU– Multiple CPUs per system (cluster)

¨ Will study management of concurrent access to shared resources– Here, data items are shared resources

CS346 Advanced Databases5

Transactions

¨ Transaction: the logical unit of database processing– Including at least one insertion, deletion or retrieval operation– May form part of a program, or specified via SQL

¨ A complex program may be broken into many basic transactions– Programmer may explicitly specify start and end of a transaction

¨ Distinguish between read-only and read-write transactions– Read-only seem easier, but still need a consistent view of data

CS346 Advanced Databases6

Databases and items

¨ Transaction processing adopts a very simple model of a database– Here: a database is a collection of named data items

¨ The granularity of the database is the size of a single data item– Can work at the level of a single database record– A higher level item: a single disk block or whole file– Lower level items: individual field (attribute) of a record– Data item may correspond to a basic concept, e.g. a seat on a flight

¨ Each data item has a unique name (identifier) used internally– E.g. the disk block address – not used by the programmer

CS346 Advanced Databases7

Basic Data Operations

¨ With this simplified database model, the basic operations are– Read(X): read the item named X into local memory– Write(X): write the item named X from local memory

¨ These cover the various substeps of data access– Map from the name (X) to the relevant disk block containing X– Moving data to/from disk via OS calls and buffers etc. – Managing cache memory to speed up operations

¨ Transactions are formed by a sequence of read/write operations

CS346 Advanced Databases8

Example Transactions

¨ All operations of a transaction must complete successfully for the transaction to be successful

¨ Read (write) set: the set of items read (written) by a transaction– Read set (a) = {X, Y}. Read set (b) = {X}– Write set (a) = {X,Y}. Write set (b) = {X}

¨ Need concurrency control and recovery– What happens if we try to run (a) and (b) at the same time?

CS346 Advanced Databases9

Need for concurrency control

¨ E.g. airline booking: don’t sell the same seat twice!– Previous transaction (a): move N reservations from X to Y– Transaction (b): reserve M sets on flight corresponding to X

¨ Bad things can happen with concurrency due to interleaving– Lost updates– Temporary updates– Incorrect aggregation– Unrepeatable reads

CS346 Advanced Databases10

Lost Updates

¨ Lost updates: when two transactions are interleaved– If T1 and T2 run as shown, the update to X from T1 is lost

¨ E.g. if X=80 to begin, N = 5 and M = 4, this order results in X = 84 rather than X=79

CS346 Advanced Databases11

Temporary Update (Dirty Read)

¨ Happens if a value is read in the middle of a transaction that fails– If a transaction fails (T1), it is rolled back to the previous state– Meanwhile, another transaction may update the intermediate value

¨ Value of X read by T2 is dirty data as it has not been committed– Hence this is sometimes called the dirty read problem

CS346 Advanced Databases12

Incorrect Summary

¨ One transaction computes an aggregate while another updates– Can include some values before update, others after update

¨ Generates a result that doesn’t correspond to before or after– In example, the correct result is same before and after T1

CS346 Advanced Databases13

Unrepeatable read

¨ Concurrency can cause problems with read-only transaction– Suppose a transaction reads item X twice at different times– The value of X is changed by another transaction in between– The first transaction gets different values for the same item!– Can arise in booking transactions: check availability, then update

CS346 Advanced Databases14

Transaction recovery

¨ Transactions should be “all or nothing”: called atomic– A transaction either complete successfully (and correctly): commit– Or has no effect on the database or other transactions: abort

¨ If a transaction fails after some operations, it must be undone– Roll-back the earlier operations– Many possible reasons for transaction failure

CS346 Advanced Databases15

Reasons for transaction failure

1. Computer failure: disk error, memory read error, crash2. Transaction/system error: divide by zero, integer overflow

– May also have out-of-bounds parameters, program bug3. Local errors or exceptions during the transaction

– E.g., can’t find the referenced item– E.g., insufficient funds for balance transfer

4. Concurrency control enforcement– System may decide to abort a transaction to ensure correctness– May need to abort to resolve “deadlock” between transactions

5. Disk failure: data on disk has got corrupted6. Physical problems: fire, theft, flood, operator error

– PEBCAK: Problem exists between chair and keyboardCS346 Advanced Databases16

Transaction states

¨ To ensure transaction atomicity, system needs to track the state– The recovery manager needs to keep track of each operation

¨ Transactions can be in one of a number of states– Active state (after starting execution, can read and write)– Partially committed state after it has finished operations

Need to reach a point where system failure would still leave the data in a consistent state

– Committed state: transaction is completed, a commit point is made– Failed state: if a check fails or transaction is aborted

May have to roll back some writes– Terminated state: the transaction leaves the system

¨ Failed or aborted transactions may be started (afresh) later

CS346 Advanced Databases17

System Log

¨ To recover from transaction failures, the system keeps a log– Track all transaction operations that affect the database

¨ The system log is a sequential, append-only file kept on disk– So more likely to survive system failure/crash

¨ Use memory to buffer the most recent updates– Write out buffers to disk when they are full– Ensure buffers are flushed to disk at a commit point– Periodically back up the log to archival storage

¨ The log consists of a sequence of log records

CS346 Advanced Databases18

System log records

¨ [start_transaction, T]: T is a unique transaction id¨ [write_item, T, X, old_value, new_value]

– Transaction T affects item X– Technically, only old_value needed for rollback

¨ [read_item, T, X]: (read entry not strictly needed for rollback)– May be included for other purposes e.g. auditing

¨ [commit, T]: T has successfully completed, and can be committed¨ [abort, T]: T has been aborted

CS346 Advanced Databases19

Failure recovery

¨ Recovering from failure means either undoing or redoing steps¨ Undo: undo each WRITE operation

– Trace backwards through the log and write the old_value¨ Redo: repeat each WRITE operation using new_value

– Needed if a failure means the writes may not have all completed– Ensures that all operations have been applied successfully

CS346 Advanced Databases20

Commit Points

¨ Commit points mark successful completion of transactions– All operations of transaction T have been executed successfully– AND the effect of all operations is recorded in the log

¨ The transaction is then committed and is permanently recorded– Write a [commit, T] record in the log

¨ If a system failure occurs: – Find all transactions T that have started but not committed– Roll back their associated operations to undo their effect– May have to redo some transactions to ensure correctness

CS346 Advanced Databases21

ACID properties of transaction processing¨ Atomicity: a transaction is an atomic unit of processing

– It is either performed completely, or not at all– Controlled by the transaction recovery subsystem of the DBMS

¨ Consistency: transactions should preserve database consistency– If a transaction is done fully, it should keep DB in a consistent state– Responsibility of the programmers, integrity constraints

¨ Isolation: effect should be independent of other transactions– It should be as if it is the only transaction executing– Enforced by the concurrency control system

¨ Durability: changes made must persist in the database– Changes made by a transaction should not be lost by any failure– Enforced by the transaction recovery subsystem

CS346 Advanced Databases22

Schedules of operations

¨ The order of execution of operations is called the schedule– Schedule S orders the operations of n transactions T1, T2, ... Tn – Operations from different transactions can be interleaved– Operations from the same transaction must be in order– S is a total order: for any two operations, one is before the other

¨ The main concern is the interleaving of read and write operations– Notation: b, r, w, e, c, a for begin, read, write, end, commit, abort– Can omit begin and end without loss of clarity– Use transaction id (number) as a subscript for each operation– E.g. S = r1(X); r2(X); w1(X); r1(Y); w2(X); w1(Y)

CS346 Advanced Databases23

Conflicts

¨ Two operations in a schedule conflict if:– They belong to different transactions– They access the same item X– At least one operation is a write_item(X)

¨ Example: S = r1(X); r2 (X); w1(X); r1(Y); w2(X); w1(Y)¨ r1(X) and w2(X) are in conflict; r2(X) and w1(X) are in conflict¨ r1(X) and r2 (X) do not conflict with each other (why?)¨ w2(X) and w1(Y) do not conflict (why?)¨ r1(X) and w1(X) do not conflict (why?)

¨ Two operations conflict if swapping them results in a different outcome– E.g. swapping r1(X) and w2(X) can change value of X read by T1

CS346 Advanced Databases24

Complete Schedule

¨ A schedule S of n transactions is a complete schedule if:– The operations in S are exactly those of T1, T2, ... Tn including a

commit or abort operation as the last in each transaction So no active transactions at the end of the schedule

– Every pair of transactions in Ti is in the same order in S as it is in Ti

– The order of every pair of conflicting operations is specified in S Don’t have to specify the order of nonconflicting operations Hence can be a partial order on the operations

¨ In live systems, schedules are rarely complete– New transactions are always starting

¨ Define the Committed projection of a schedule S, C(S)– Only operations in S belonging to committed transactions

CS346 Advanced Databases25

Recoverability of Schedules

¨ Some schedules are more easy to recover from than others– Attempt to characterize schedules that are easily recoverable

¨ Recoverable schedule: once T is committed, never have to undo T– Helps ensure the durability property– Nonrecoverable schedules should not be allowed by DBMS

¨ Formal definition for schedule S to be recoverable: – T reads transaction T’ if item X is first written by T’, later read by T– S is recoverable if no transaction T in S commits until all transactions T’

read by T have committed first– AND T’ must not have been aborted before T reads X

CS346 Advanced Databases26

Recoverable schedules

¨ There is always a way to recover a recoverable schedule– But it may still be quite complex to do so

¨ Example: S = r1(X); r2(X); w1(X); r1(Y); w2(X); c2; w1(Y); c1; – Schedule S is recoverable by the previous definition– Note: S suffers from lost updates: this does not affect recoverability

¨ The following schedule is not recoverable – why?– r1(X); w1(X); r2(X); r1(Y); w2(X); c2 ; a1

¨ Possible fixes to the schedule: – Postpone the commit c2: r1

(X); w1(X); r2(X); r1(Y); w2(X); w1(Y); c1; c2

– Abort both: r1(X); w1(X); r2(X); r1(Y); w2(X); a1 ; a2

CS346 Advanced Databases27

Cascading rollback and strict schedule¨ Cascading rollback is when an uncommitted transaction has to

be rolled back because it read from a transaction that failed– E.g. T2 in previous example r1(X); w1(X); r2(X); r1(Y); w2(X); a1 ; a2

– Try to avoid cascading rollback – can be quite time consuming¨ Can define a cascadeless schedule:

– Every transaction only reads from committed transactions– E.g. move back the r2(X): r

1(X); w1(X); r1(Y); w1(Y); c1; r2(X); w2(X); c2

¨ A strict schedule is the most restrictive type– Don’t read or write X until the last transaction to write X commits– Simple to undo writes: just restore the old value of X

If not strict, undoing write of aborted transaction is not enough

CS346 Advanced Databases28

Relation between concepts

¨ Ordering: Strict Cascadeless Recoverable– Strict: Don’t read or write X after T has written X, until T commits– Cascadeless: Transactions only read from committed transactions– Recoverable: Transactions commit only after transactions they

have read from commit

CS346 Advanced Databases29

Serializability

¨ Recoverability did not consider correctness (isolation)– Serializability is concerned with this property

¨ There are some simple approaches to serializability– Consider two transactions T1 and T2 submitted at same time – Either do T1 entirely, before T2, or vice-versa– Not great: limits throughput, can cause blocking

CS346 Advanced Databases30

Serial and non-serial schedules

¨ A schedule S is serial if for every transaction T in S, all operations in T are executed sequentially; else, it is nonserial

CS346 Advanced Databases31

Serial Schedules

¨ Only one transaction is active at an time in a serial schedule– If transactions are independent, every serial schedule is correct

¨ Serial schedules limit concurrency by prohibiting interleaving– Must wait for all I/O to finish... very slow– Serial schedules are consider unacceptable in practice

¨ Accept schedules that are equivalent to serial ones in effect– Which schedules on the previous slide are equivalent to serial?– Which suffer from the lost update problem?

¨ Serializable schedule: one that is equivalent to a serial one– Consider this to be our definition of “correctness”

¨ Need to define equivalence of schedules!– There are several alternate definitions with different properties

CS346 Advanced Databases32

Conflict serializable

¨ Result equivalent: if they produce the same final state– May happen by chance given a particular initial state

¨ Conflict equivalent is the most commonly used definition– The order of any two conflicting operations is the same in both– S is conflict serializable if it is equivalent to some serial schedule S’– That is, the nonconflicting operations can be reordered to make S’

CS346 Advanced Databases33

A & D are conflict equivalent: • r2(X) follows w1(X) in both• r1(Y); w1(Y) in D doesn’t

conflict with T2 so can be moved earlier

Two ops in a schedule conflict if: They are in different transactions They access the same item X At least one op is a write_item(X)

Testing for conflict serializability

¨ Create a serialization graph of the read and write operations– A directed graph with nodes T1, ... Tn – Directed edge (Tj Tk) if an op in Tj precedes a conflicting op in Tk

– S is serializable if and only if its serialization graph has no cycles

CS346 Advanced Databases34

Serialization Graph¨ Graphs for schedules A, B, C, D:

– Showing the name of the item causing the edges as its label

¨ Can create an equivalent serial schedule S’ from the graph of S– When there is an edge (Ti Tj), Ti must appear before Tj in S’– Else, resolve ordering arbitrarily [make total order from partial order]

CS346 Advanced Databases35

Two ops in a schedule conflict if: They are in different transactions They access the same item X At least one op is a write_item(X)

Serializability Example

CS346 Advanced Databases36

Two ops in a schedule conflict if: They are in different transactions They access the same item X At least one op is a write_item(X)

Serializability

¨ In practice, it is hard to check for serializability– Interleaving of concurrent operations controlled by the OS– DBMS can’t specify the exact order of execution of parallel tasks– Don’t want to check for serializability after the fact

¨ Instead, design schedules based on protocols– These ensure that all realized schedules are serializable

¨ Not feasible to mark start and end of schedules in live systems– Consider only the committed projection of a schedule

I.e. the operations from committed transactions¨ Most common technique is two-phase locking (2PL) [later]

– Prevent transactions that could interfere with each other– Other protocols: timestamp ordering, optimistic concurrency control

CS346 Advanced Databases37

View Equivalence

¨ View equivalence is a weaker notion than conflict equivalence– Based on the view of the data witnessed by each schedule

¨ Schedules S and S’ are said to be view equivalent if: – Both S and S’ include all operations of the same set of transactions– For any operation ri(X) in S, if the read value of X was written by

operation wj(X), the same condition must hold for S’– If wk(Y) is the last operation to write to Y in S, then wk(Y) must also

be the last to write to Y in S’¨ Read operations see the same view in both schedules

– The final write is the same in both, so the same state is reached¨ S is view serializable if it is view equivalent to a serial schedule

CS346 Advanced Databases38

View and Conflict Serializability

¨ The constrained write assumption (no blind writes): – CWA: If every wi(X) in Ti is preceded by ri(X)– Implies computation of new value of X depends on the old value– A blind write is when X is written without reading it first

¨ View and conflict serializability coincide if CWA holds¨ Unconstrained write assumption: blind writes are allowed¨ E.g. r1(X); w2(X); w1(X); w3(X); c1; c2; c3

is view serializable to r1(X); w1(X); c1; w2(X); c2; w3(X); is not conflict serializable to any serial schedule

¨ Testing for view serializability is NP-hard

CS346 Advanced Databases39

Venn diagram of schedules

http://en.wikipedia.org/wiki/Schedule_%28computer_science%29

CS346 Advanced Databases40

Other types of schedule equivalence

¨ In some situations, can relax the definition of equivalence– Such as debit-credit transactions (e.g. bank account updates)– All transactions add or subtract to the value of a data item– Can have correct schedules that are not serializable

Because addition and subtraction commute¨ Consider two transactions that want to move money

– T1: r1(X); X X – 10; w1(X); r1(Y); Y Y + 10; w1(Y);– T2: r2(Y); Y Y – 20; w2(Y); r2(X); X X + 20; w2(X);

¨ Schedule S: r1(X); w1(X); r2(Y); w2(Y); r1(Y); w1(Y); r2(X); w2(X)– S is not serializable but is correct because of transaction semantics

CS346 Advanced Databases41

Transaction Support in SQL

¨ SQL allows the definition of atomic transactions¨ No explicit begin_transaction, but must COMMIT or ROLLBACK¨ The access mode of the transaction can be specified

– READ ONLY or READ WRITE (default)¨ The diagnostic area keeps error data on n previous statements¨ The isolation level defines how strict to be with transactions

– SERIALIZABLE (default)– Lower levels: REPEATABLE READ, READ COMMITTED, READ UNCOMMITTED

May allow a transaction to read a value that has not been committed, or read a value twice in a transaction and get two values

CS346 Advanced Databases42

Sample SQL transaction

¨ So you will know what one looks like: EXEC SQL whenever sqlerror go to UNDO; EXEC SQL SET TRANSACTION READ WRITE DIAGNOSTICS SIZE 5 ISOLATION LEVEL SERIALIZABLE; EXEC SQL INSERT INTO EMPLOYEE (FNAME, LNAME, SSN, DNO, SALARY) VALUES ('Robert','Smith','991004321',2,35000); EXEC SQL UPDATE EMPLOYEE SET SALARY = SALARY * 1.1 WHERE DNO = 2; EXEC SQL COMMIT; GOTO THE_END; UNDO: EXEC SQL ROLLBACK; THE_END: ...

CS346 Advanced Databases43

Summary

CS346 Advanced Databases44

¨ Saw the concepts of transactions and concurrency control¨ ACID properties: atomicity, consistency, independence, durability ¨ System logging, commit points, and failure recovery¨ Schedules, serializability, and conflicts

¨ Multiple definitions of serializability, and checking¨ Transactions in SQL

¨ Chapter: “Introduction to Transaction Processing Concepts and Theory” in Elmasri and Navathe