Lecture 21 Distributed Systems. Checkpoint In journaling

Lecture 21Distributed

Systems

Checkpoint

• In journaling

Metadata Journaling

• 1/2. Data write: Write data to final location; wait for completion (the wait is optional; see below for details).• 1/2. Journal metadata write: Write the begin block and

metadata to the log; wait for writes to complete.• 3. Journal commit: Write the transaction commit block

(containing TxE) to the log; wait for the write to complete; the transaction (including data) is now committed.• 4. Checkpoint metadata: Write the contents of the metadata

update to their final locations within the file system.• 5. Free: Later, mark the transaction free in journal superblock

Checkpoint

• In journaling• Write the contents of the update to their final locations

within the file system.

• In LFS• Checkpoint regions locate on a special fixed position on

disk.• Checkpoint region contains the addresses of all imap

blocks, current time, the address of the last segment written, etc.• What if checkpoint too often?

LFS review

• Good for …?• Bad for …?

Disk after Creating Two Files

Garbage Collection

• Pick M segments, compact into N (where N < M).• Mechanism: how do we know whether data in

segments is valid?• segment summary that lists inode corresponding to

each data block

• Policy: which segments to compact?• A hot segment: the contents are being frequently over-

written• A cold segment: may have a few dead blocks but the rest

of its contents are relatively stable

Recovery

• In journaling• If crash before step 2 completes, skip the pending

update• If crash after step 2 completes, transactions are replayed

• In LFS• Identify the newest consistent one• Roll-forward: scan BEYOND the last checkpoint to

recover max data

Data Integrity and Protection

Disk Failure Modes

• Fail-stop as assumed by RAID

• Silent faults:• Latent-sector errors (LSEs): a disk sector (or group of

sectors) has been damaged in some way• Block corruption

Handling Latent Sector Errors• How to detect:• A storage system tries to access a block, and the disk

returns an error

• How to fix:• Use whatever redundancy mechanism it has to return

the correct data

Detecting Corruption:The Checksum• Common Checksum Functions• XOR• Fletcher checksum• Cyclic redundancy check (CRC)• Collision is possible

Misdirected Writes

• Arises in disk and RAID controllers which write the data to disk correctly, except in the wrong location• Physical identifier (physical ID)

Lost Writes

• Occur when the device informs the upper layer that a write has completed but in fact it never is persisted。• Do any of our strategies from above (e.g., basic

checksums, or physical ID) help to detect lost writes?• Solutions:• Perform a write verify or read-after-write• Some systems add a checksum elsewhere in the system

to detect lost writes.• ZFS includes a checksum in each file system inode and

indirect block for every block included within a file

Scrubbing

• When do these checksums actually get checked?

• Many systems utilize disk scrubbing:• Periodically read through every block of the system• Check whether checksums are still valid• Schedule scans on a nightly or weekly basis

Overhead of Checksumming• Space• Small

• Time• Noticeable• CPU overhead• I/O overhead

Distributed Systems

OSTEP Definition

• Def: more than 1 machine

• Examples: • client/server: web server and web client • cluster: page rank computation

• Other courses• Networking • Distributed Systems

Why Go Distributed?

• More compute power

• More storage capacity

• Fault tolerance

• Data sharing

New Challenges

• System failure: need to worry about partial failure.

• Communication failure: links unreliable

• Performance

• Security

Communication

• All communication is inherently unreliable.

• Need to worry about: • bit errors • packet loss • node/link failure

Overview

• Raw messages

• Reliable messages

• OS abstractions • virtual memory • global file system

• Programming-languages abstractions • remote procedure call

Raw Messages: UDP

• API: • reads and writes over socket file descriptors• messages sent from/to ports to target a process on

machine

• Provide minimal reliability features: • messages may be lost• messages may be reordered• messages may be duplicated• only protection checksums

Raw Messages: UDP

• Advantages • lightweight • some applications make better reliability decisions

themselves (e.g., video conferencing programs)

• Disadvantages • more difficult to write application correctly

Reliable Messages Strategy• Using software, build reliable, logical connections

over unreliable connections.

• Strategies: • acknowledgment

ACK

• Sender knows message was received.

Sender [send message]

[recv ack]

Receiver

[recv message][send ack]

ACK

• Sender misses ACK... What to do?

Sender [send message]

Receiver



• Strategies: • acknowledgment• timeout

ACKSender

[send message] [start timer]

... waiting for ack ...[timer goes off][send message]

[recv ack]

Receiver


Timeout: Issue 1

• How long to wait?• Too long: system feels unresponsive!• Too short: messages needlessly re-sent!

• Messages may have been dropped due to overloaded server. Aggressive clients worsen this.• One strategy: be adaptive!• Adjust time based on how long acks usually take. • For each missing ack, wait longer between retries.

Timeout: Issue 2

• What does a lost ack really mean?• Maybe the receiver does not get the message• Maybe the receiver gets the message, but the ack is not

delivered successfully

• ACK: message received exactly once• No ACK: message received at most once• Proposed Solution• Sender could send an AckAck so receiver knows whether

to retry sending an Ack• Sound good?



• Strategies: • acknowledgment• timeout• remember sent messages

Receiver Remembers MessagesSender

[send message]

[timeout][send message]

[recv ack]

Receiver


[ignore message][send ack]

Solutions

• Solution 1: remember every message ever sent.

• Solution 2: sequence numbers• give each message a seq number• receiver knows all messages before an N have been seen • receiver remembers messages sent after N

TCP

• Most popular protocol based on seq nums.

• Also buffers messages so they arrive in order

• Timeouts are adaptive.

Overview

• Raw messages

• Reliable messages

• OS abstractions • virtual memory • global file system

• Programming-languages abstractions • remote procedure call

Virtual Memory

• Inspiration: threads share memory

• Idea: processes on different machines share mem• Strategy: • a bit like swapping we saw before • instead of swap to disk, swap to other machine • sometimes multiple copies may be in memory on

different machines

Virtual Memory Problems

• What if a machine crashes? • mapping disappears in other machines • how to handle?

• Performance? • when to prefetch? • loads/stores expected to be fast

• DSM (distributed shared memory) not used today.

Global File System

• Advantages • file access is already expected to be slow • use common API • no need to modify applications (sorta true)

• Disadvantages • doesn’t always make sense, e.g., for video app

RPC: Remote Procedure Call• What could be easier than calling a function?

• Strategy: create wrappers so calling a function on another machine feels just like calling a local function.

• This abstraction is very common in industry.

RPCMachine A

int main(...) { int x = foo(); }

// client wrapperint foo(char *msg) { send msg to B recv msg from B }

Machine Bint foo(char *msg) { ... }

// server wrapper

void foo_listener() { while(1) { recv, call foo } }

RPC Tools

• RPC packages help with this with two components.

• (1) Stub generation • create wrappers automatically

• (2) Runtime library • thread pool • socket listeners call functions on server

Client Stub Steps

• Create a message buffer• Pack the needed information into the message

buffer• Send the message to the destination RPC server• Wait for the reply• Unpack return code and other arguments• Return to the caller

Server Stub Steps

• Unpack the message• Call into the actual function• Package the results• Send the reply

Wrapper Generation

• Wrappers must do conversions: • client arguments to message• message to server arguments • server return to message• message to client return

• Need uniform endianness (wrappers do this).

• Conversion is called marshaling/unmarshaling, or serializing/deserializing.

Stub Generation

• Many tools will automatically generate wrappers: • rpcgen • thrift • protobufs

• Programmer fills in generated stubs.

Wrapper Generation: Pointers• Why are pointers problematic?• The addr passed from the client will not be valid on the

server.

• Solutions? • smart RPC package: follow pointers • distribute generic data structs with RPC package

Runtime Library

• Naming: how to locate a remote service

• How to serve calls? • usually with a thread pool

• What underlying protocol to use? • usually UDP

• Some RPC packages enable a asynchronous RPC

RPC over UDP

• Strategy: use function return as implicit ACK.

• Piggybacking technique.

• What if function takes a long time? • then send a separate ACK

Conclusion

• Many communication abstraction possible: • Raw messages (UDP) • Reliable messages (TCP) • Virtual memory (OS) • Global file system (OS) • Function calls (RPC)

Next

• NFS• AFS

Documents

Lecture 21 Distributed Systems. Checkpoint In journaling