29
Gray & Reuter Log 10a: 1 Log Manager Log Manager Jim Gray Jim Gray Microsoft, Gray @ Microsoft.com Microsoft, Gray @ Microsoft.com Andreas Reuter Andreas Reuter International University, [email protected] International University, [email protected] 9:00 11:00 1:30 3:30 7:00 Overview Faults Tolerance T Models Party TP mons Lock Theory Lock Techniq Queues Workflow Log ResMgr CICS & Inet Adv TM Cyberbrick Files &Buffers COM+ Corba Replication Party B-tree Access Path Groupware Benchmark Mon Tue Wed Thur Fri

Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, [email protected] 9:00 11:00

Embed Size (px)

Citation preview

Page 1: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 1

Log ManagerLog ManagerJim Gray Jim Gray

Microsoft, Gray @ Microsoft.comMicrosoft, Gray @ Microsoft.com

Andreas ReuterAndreas ReuterInternational University, [email protected] University, [email protected]

9:00

11:00

1:30

3:30

7:00

Overview

Faults

Tolerance

T Models

Party

TP mons

Lock Theory

Lock Techniq

Queues

Workflow

Log

ResMgr

CICS & Inet

Adv TM

Cyberbrick

Files &Buffers

COM+

Corba

Replication

Party

B-tree

Access Paths

Groupware

Benchmark

Mon Tue Wed Thur Fri

Page 2: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 2

Log ConceptLog Concept• Log is a history of all changes to the state.Log is a history of all changes to the state.

• Log + old state gives new state

• Log + new state gives old state (not in this picture)

• Log is a sequential file.

• Complete log is the complete history

• Current state is just a "cache" of the log records.

Archive

Sunday Master Monday Master

Monday Transactions

Monday Night Batch Run

Monday Master

Tuesday Master

Tuesday Transactions

Tuesday Night Batch Run

Tuesday Master

Wednesday Transactions

Wednesday Night Batch Run

Wednesday Master

Wednesday Master

Page 3: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 3

How Log is UsedHow Log is Used

• Recovery from faults A redundant copy of the state and transitions

• Security audits:Who did what to whom.Often too low-level for this.

• Performance Monitor & Accounting:But only records changes (not reads).

• ISSUES: Who should be allowed to read the log?It is a security hole.Must authorize access on a per-record basis.

Page 4: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 4

The Log Manager in the Scheme of Things

Interesting thing is the cycle: Need log to recover archive to recover log.Break the cycle with a bootstrap file.

Log Manager

Transaction Manager

Lock Manager

Buffer Manager

Media Manager

SQL & Other Resource Managers

Archive Manager

Operating System File

System

File Manager

Page 5: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 5

Log Is a Sequential File.Log Is a Sequential File.

Encapsulation of the log: it is a shared resource.Startup: Log manager holds startup info for all others.Careful writes: Log manager provides a

• High performance.• Very reliable• Semi-infinite• Archived Sequential file.

Some RMs keep private logs anyway.(Notably PORTABLE DB systems.)

Then user or system has to manage multiple logs

Page 6: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 6

The Log Table Log table is a sequential set (relation).

Log Records have standard part and then a log body.Often want to query table via one attribute or another: . RMID, TRID, timestamp,

create domain LSN unsigned integer(64); -- log sequence number (file #, rba)create domain RMID unsigned integer; -- resource manager identifiercreate domain TRID char(12); -- transaction identifier create table log_table (

lsn LSN, -- the record’s log sequence numberprev_lsn LSN, -- the lsn of the previous record in logtimestamp TIMESTAMP, -- time log record was createdresource_manager RMID, -- resource mgr that wrote this recordtrid TRID, -- id of transaction that wrote this recordtran_prev_lsn LSN, -- prev log record of this transaction (or 0) body varchar, -- log data: rm understands itprimary key (lsn) -- lsn is primary keyforeign key (prev_lsn) -- previous log record in this table

references a_log_table(lsn), -- foreign key (tran_prev_lsn) -- transaction's prev log rec also in table

references a_log_table(lsn), -- ) entry sequenced; -- inserts go at end of file

Page 7: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 7

Log is complete historyLog is complete history

Log anchor points at chain of each transaction.May maintain other chains.Log records map to sequence of N-plexed filesOld files are archived.Eventually, archive files are discarded (weeks, months, never)

A files B files

Archive

lsn prev_lsn resource_mgr trid tran_prev_lsn body

Log Table

Log Anchor

trid,

max_lsn,

min_lsn...

Page 8: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 8

The Log LSN

Each log record has a logical sequence number.

This number (LSN for Log Sequence Number) plays a key role in many algorithms.

Key property MONOTONICITY:

If action A happened after action B then

LSN(A) > LSN(B).

Page 9: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 9

Reading The Log

long log_read_lsn( LSN lsn, /* lsn of record to be read */

log_record_header header, /* header fields of record to be read */

long offset, /* offset into body to start read */

pointer buffer, /* buffer to receive log data */

long n); /* length of buffer */

LSN log_max_lsn(void); /* returns the current maximum lsn of the log table.*/

Read with C (see next slide) or SQL:long sql_count( RMID rmid) /* count log records written by this rmid */

{ long rec_count; /* count of records */exec sql SELECT count (*) /* ask sql to scan log counting records */

INTO :rec_count /* written by the calling resource mgr and */ FROM log_table /* place count in the rec_count */ WHERE resource_manager = :rmid; /* */

return rec_count; /* return the answer. */};

Page 10: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 10

Reading the Log: SQL is easier than CReading the Log: SQL is easier than C

long c_count( RMID rmid) /* count log records written by this rmid */{ log_record_header header; /* structure to receive log record header */LSN lsn; /* log sequence number of next log rec */char buffer[1];/* null buffer to receive log record body. */long rec_count = 0; /* count of records */int n = 1; /* size of log body returned */if (!log_open(READ)) panic(); /* open the log (authorization check) */lsn = log_max_lsn( ); /* get most recent lsn */while (lsn != NullLSN) /* scan backward through the log */

{ n = log_read_lsn( lsn, /* lsn of record to be read */header, /* log record header fields */0L, &buffer, 1L );/* log rec body ignored. */

if (header.rmid == rmid) /* if record written by this RMID then */rec_count = rec_count + 1; /* increment count */

lsn = header.prev_lsn; /* go to previous LSN. */}; /* loop over LSNs */

logtable_close( ); /* close log table */ return rec_count; /* return the answer. */ }; /* */

Page 11: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 11

Writing The Log

Add a log record, Log manager fills in header.LSN log_insert( char * buffer, long n);

/* log body is buffer[0..n-1] */

Force log up to a certain LSN to persistent storage: LSN log_flush( LSN lsn, Boolean lazy); /* */

(lazy waits for a batch write or timeout == boxcar)

Note: many real interfaces allow some of:empty buffer: to allow RM to fill it in (avoids data copies)incremental copy: build the "buffer" in steps.gather: take log data from many buffers.

Few offer SQL access to the log.

Page 12: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 12

Summary Of Log Structure And Verbs

Operations: Open/CloseRead(LSN), Insert(body), Flush(LSN)SQL read operations.

Log Tableheader body

A file

Log pages in buffer pool

log page header

end of durable log

current end of log

B file

empty page in buffer pool

durable storage

Pages written in next write

Page 13: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 13

Log Anchor Logging and Locking

Log records never updated: only inserted and read.So no locks needed on log.Semaphore (or something) needed on "end" of log

to manage space/growth/LSN for inserts

typedef struct { filename tablename; /* name of log table */struct log_files;/* A & B file prefix names & active file # */xsemaphore lock; /* semaphore regulates log write */LSN prev_lsn; /* LSN of most recent write */LSN lsn; /* LSN of next record */LSN durable_lsn; /* max lsn in durable storage */LSN TM_anchor_lsn; /* lsn of trans mgr's last ckpt */struct { /* array of open log parts */ long partno; /* partition number */ int os_fnum; /* operating system file # */ } part [MAXOPENS]; /* */

} log_anchor ; /* */

Page 14: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 14

Making Optimistic Log Reads Work

Log is duplexed. Log manager reads only one copy of the page.What if the "other" copy has more data?Trick:

read BOTH copies of FIRST and LAST page in log.Other pages have "full" flag and a timestamp.IF not full or timestamp < prev_timestamp THEN

read other page and take highest timestamp

Torn log pages Log page consists of disk sectors (512B).Write may only write some sectors.How detect missing fragments?1. Checksum?2. Byte stuffing: stuff a “parity” byte on each page

Page 15: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 15

Log InsertLog InsertLog semaphore covers

Incrementing LSNFinding the log end filling in the page(s)allocating space on a page, perhaps allocating new pages.

LSN log_insert( char * buffer, long n) /* insert a log record with body buffer[0..n]*//* Acquire the log lock (an exclusive semaphore on the log) */

Xsem_get(&log_anchor.lock); /* lock the log end in exclusive mode */lsn = log_anchor.lsn; /* make a copy of the record’s lsn. */

/* find page and allocate space in it. *//* fill in log record header & body *//* update the anchors */

log_anchor.prev_lsn = lsn; /* log anchor lsn points past this record */log_anchor.lsn.rba = log_anchor.lsn.rba + rec_len; /* */Xsem_give(&log_anchor.lock); /* unlock the log end */return lsn; }; /* return lsn of record just inserted */

Page 16: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 16

Log Write DemonLog Semaphore can be a hotspot so: No IO under semaphoreAllocation (OS requests), and Archiving is done in advance.Flush to persistent storage (disc) is done asynchronously.Demons driven by timers and by events (requests)Demons need not touch end-of-log semaphore

log daemon to flush

(carefully write) log pages as needed

log data in shared memory and on disc

log daemon to allocate

new log files as needed

application programsresource managerslog code

Page 17: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 17

Careful WritesIf partial pages may be written then

subsequent write may invalidate previous write.Standard technique:

Serial Writes: write one page then write the second page.Problem: ~ 1/2 disc bandwidth, 2x delay.

Ping-Pong technique:Never overwrite good page: Ping-Pong between I and I+1When complete, assure that page I has final data Never worse than serial write, generally 2x better.

Also note the careful techniques for optimistic reads and torn pages.

Disc Page

Disc Page

Disc Pagei:

i+1:Parallel Ping-Pong Writes

New Log

Page 18: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 18

Group Commit (Boxcaring)Batch processing of log writes.If receive 1,000 log force requests/second

why not just execute 50 of them?Response time will be the same (~20ms).IOs will be 20x fewerCPU will be ~ 10x smaller (10x fewer dispatches, 20x fewer OS IO).

Without it, systems are limited to about 50tps no ping-pong100tps ping-pong.

With it, systems are limited to disc bandwidth >>10ktps.Group commit threshold can be set automatically.

Page 19: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 19

WADS- Giving the Log Disc Zero Latency

Log disc is dedicated, so only has rotational latency.Reserve some cylinders on the disc as scratch.For each write:

Write at current position on next track (zero latency).When have a full-track (or two) of log data

consolidate the write in ramdo a single LARGE write (100KB = 1 rotation) to the log.cost of this is seek + rotation ~ 20ms.

This reserved area is called the Write Ahead Data Set (WADS).At restart:

read cylindersgather recent log datarewrite end of log.

RAID Write Cache makes this obsolete (if it works).

Page 20: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 20

Log: Normal Use

Transaction UNDO During Normal OperationTransaction log anchor: needed during normal operation

Points to most recent log rec of that transaction.Follow the transaction prev_lsn chain.

EASY!

Page 21: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 21

The Log Anchor: Where It All StartsThe Log Anchor: Where It All Starts

REDO/UNDO at System / RM Restart.Need to bootstrap the most recent log state.Log manager is the first to restartHelps Transaction Manager recoverTransaction manager helps Resource mangers recover.Alternate design (each RM has its own log).

All this depends on rebuilding the log anchor.

Log AnchorTransaction Manager Checkpoint Record

Resource Manager Checkpoint Records

The Log

Previous Transaction Manager Checpoint Record

Page 22: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 22

Preparing For Restart: Careful Write of Log Anchor

Use the "standard" careful write techniques:Put the anchor in a special well-known place(s)Ping-Pong to 2 or more copiesTimestamp each copyN-plex the copies on devices with independent failures.Align copies so that writes are "atomic"Accept most recent copy on pessimistic reads.

Now TM and RMs can bootstrap: their anchors are in the log.

Page 23: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 23

Finding the End of the Log Find the anchorIf using WADS, go to the WADS area and write log end.else Scan forward from the most log-anchor lsn

Read optimistic all full pages.At 1/2 full page or bad page read pessimistic.Now have end-of log.Finish 1/2 finished record at end of log and give to TM

Pages

End of log

Half-finished record

Invalid Page

Pages

End of log

Page 24: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 24

Archiving The Log And "Old" Transactions

What if transaction/RM low water mark is 1-month old?Abort?Copy aside:

copy the undo/redo log records to a side fileCopy forward:

copy the undo/redo log records forward in the file.Dynamic log:

copy undo records aside (so can online-undo if needed).All advance the low water mark.

Page 25: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 25

Archiving the Log Online

Log

1

2

2

3 1

3

1 2 3

Archive

Staggered Allocation of Log Tables on Secondary Storage

Page 26: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 26

The Safety Spectrum Just UNDO

transactional storage (no durable log)Just Online Restart:

keep simplexed durable log.Online plus Off-line Archive (no single point of failure):

periodic copies of dataduplex log

Electronic vaulting:archive copies and duplexing is done to remote site.via fast communications links (or Federal Express).

Page 27: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 27

Multiple Logs?

Transaction Manager has a log (DECdtm, MS-DTC,…)

Transaction Monitor has a log (CICS, Tuxedo, ACMS,...)

Each DB instance (3 Oracle, 2 Informix, 4 Rdb) has a log.Some have 3 logs: UNDO, REDO, SNAPSHOT.Cons

Lots of tapes/files.Lots of IOs at commitLots of things to break.

Pros:PortablePerformance (in the 1 RM case)

You decide

Page 28: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 28

Client/Server Logging

One server design (can be process pair)Well known log server in the net.Client sends a BATCH of log records to the server.Gets back a LSNUses "local" LSNs for his objects.Log servers can be N-plexed processes.

Multi-server designClient forms a quorum (majority of servers).Client sends log batch to all, gets back N-LSNs.If less than majority, client must poll ALL N serversServers synchronize their "logical" logs as "sum" of

physical logs (need a majority).

Page 29: Gray & Reuter Log 10a: 1 Log Manager Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, Andreas.Reuter@i-u.de 9:00 11:00

Gray & Reuter Log10a: 29

Summary

• Log is a sequential file

• Contains entire history of DB

• Many tricks to write it efficiently and carefully

• Many tricks to archive and recover it