33
Coda Server Internals Peter J Braam

Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Coda Server Internals

Peter J Braam

Page 2: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Contents

Data structure overviewVolumesVnodesInodes

Page 3: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Data Structure Overview

Inodes

VolumesVnodesDirectory cntsACLReslogs

Volinfo records

VSGDBPdb recordsTokens

Servers/SCMPartitionsStartup flagsSkipvolumesLOG & DATA & DB Locators

File Contents

Meta Data &Dir contents

Volume location

Security

Configuration Data

/vicep* partitions

RVM

VLDB, VRDB: RW db files

VSGDB, .pdb, .tk files:dynamic RO db files

Static data

Object Purpose Resides where

Page 4: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

RVM layout (coda_globals.h)

Already_initialized (int)

struct VolHead[MAXVOLS]

struct VnodeDiskObject

*SmallVnodeFreeLists[SM_FREESIZE]

short SmallVnodeIndex

…. Same for large …

MaxVolId (unsigned long)

Remainder is dynamically allocated

Page 5: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Volume zoo (volume.h, camprivate.h)

RVM: structures VolumeData VolHead VolumeHeader VolumeDiskDa

ta

VM: structures Volume VolumeInfo ……..

Page 6: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

A volume in RVM

VolHead

VolumeHeader VolumeDataVolumeHeader

*volumeDiskData*smallVnodeListsnsmallVnodesnsmallLists-- same for big --

stampid parentidtype

containspointer torvm malloced data

Page 7: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

VolumeDiskData (rvm)

Lots of stuff: Identity & location: partition, name, runtime info: use, inService, blessed,

salvaged Vnode related: next uniquefier Versionvector Resolution flags, pointer to recov_vol_log Quota Resource usage: filecount, diskused etc

Page 8: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Volumes in VM

struct Volumes sit in VolHash with copies of RVM data structures

Salvage before “attaching” to VolHashModel of operation (FS):

GetVolume copy out from RVM Do your mods in VM PutVolume does RVM transaction

Model of operation (Volutil): operate on RVM

Page 9: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Volumes in Venus RPC’s

One RPC: GetVolInfo used for mount point traversal

Only relates to volume location database volume replication database VSGDB

Could sit in separate Volume Location Server

Page 10: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Vnodes (cvnode.h)

Small & large: large for directories difference is ACL at back of large vnodes

Inode field: small vnodes: points to diskfile inode number large vnodes: is RVM address of dir inode

Contain important small structure: vv_tPointers to reslog entriesVM: cvnode’s with hash table, freelists etc

Page 11: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Vnodes in RVM

RVM: VnodeDiskinfo (rvm_malloced) vnodes sit on rec_smolists

each link points to a DiskVnode lists link vnodes with identical

vnodenumbers but different uniquefiers new vnodes grabbed from FreeLists

(index.cc, recov{a,b,c}.cc) volumes have arrays of rec_smolists

which grow when they are full

Page 12: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Vnodes in action

Model: GetFSObj calls GetVnode work is done PutFS Objects calls

rvm_begin_transactionReplaceVnode - copies data from VM to RVMrvm_end_transaction

Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems.

Is this necessary? Probably not. Cure it: yes!

Page 13: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Directories (rvm)

DirInode page table and “copy on write” refcount

DirPages 2048 bytes each build up the directory divided into 64 32byte blobs Hash table for fast name lookups Blob Freelist Array of free blobs per page

Page 14: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Directories

More than one vnode can point to directory (copy on write)

VM: hash table of DirHandles point to VM contiguous copy of dir point to DirInode have a lock etc

Model: as for volumes & vnodesCritique: too baroque

Page 15: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Files

Vnode references file by InodeNumber

Files are copy on writeThere are “FileInodes” like dir

inodes, but they are held in external DB or in inode itself

Server always reads/writes whole files (could be exploited)

Page 16: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Volinit and salvage

Set up volume hash table, serverlist, DiskPartitionList

Cycle through partitions, check each for list of inodes every inode has a vnode every vnode has a directory name every directory name has a vnode

Put volume in a VM hash table

Page 17: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Server connection info

Array of HostEntry (a “venus”) Contains a linked list of connections Contains a callback connection id

Connection setup first binding creates a host & callback conn new binding creates a new connection and

verifies callback in RPC2_NewBinding & ViceNewConnectFS

Page 18: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Callbacks

Hashtable of FileEntries: each contains Fid number of users linked list of callbacks

Callbacks: point to HostEntryOps:

RPC: BreakCallBack Local: placing, delete, deleteVenus

Page 19: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Callbacks

Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire.

Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration.

Page 20: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

RPC processing

Venus RPC’s: srvproc.cc - standard file ops srvproc2.cc - standard volume ops codaproc.cc - repair stuff codaproc2.cc - reintegration stuff

Volutil RPC’s: vol-your-rpc.cc (in coda-src/volutil)

Resolution: below

Page 21: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

RPC processing

RPC structure: ValidateParms: validate, hand off COP2, cid GetObject: vm copy, lock objects CheckSemantics:

Concurrency, Integrity, Permissions

Perform operations:BulkTransfer, UpdateObjects, OutParms

PutObject: rvm transactions, inode deletions

Page 22: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

vlists

GetFSObjects: instantiate a vlist RPC needs list of objects copied from RVM Modification status is held there (did

CopyOnWrite kick in etc)PutObjects

rvm_begin_transaction walk through the list, copy, rvm_set_range,

unlock rvm_end_transaction

Page 23: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

COP2 handling

In COP2 Venus give final VV to serverare sent out by Venus (with some

delay) often piggybacked in bulkserver knows about pending COP2

entries in hash table (coppend.cc)Manager thread CopPendingManager

Runs every minute. Removes entries more than 900 secs old

Page 24: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Cop2 to RVM

Data can be PiggyBacked on another rpc sent in ViceCop2 rpc.

Both cases call InternalCop2 (srvproc.cc)InternalCop2 (codaproc.cc)

notifies the manager to dequeue gets the FS objects listed for the COP2 installs final VV’s into RVM (rvm

transaction!)

Page 25: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

COP2 Problems

Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation)

Not optimized for singly replicated volume.

Page 26: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Resolution

Initiated by client with RPC to coordinator ViceResolve (codaproc.cc)

coordinator sets up connections in VSG

(unauthenticated) LockAndFetch (res/reslock, resutil):

lock volumes, collect “closure”

Page 27: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Resolution - special cases

RegResDirRequired (rvmres/rvmrescoord.cc)

check for unresolved ancestors already inconsistent runts (missing objects) weak equality (identical storeid)

Page 28: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

RecovDirResolve

Phase II: (rvmres/{rescoord,subphase?}.cc) coordinator request logs from other servers subordinates lock affected dirs,marshall logs coordinator merges logs

Phase III: ship merged log to subordinates perform operations on VM copies Return results to coordinator

Page 29: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Resolution

Phase IV: (is old Phase 3 …) collect results, compute new VV’s ship

to subordinates commit results

Page 30: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Comments on resolution

Old versions of resolution: OldDirResolve: resolve only runts and weak DirResolve: resolve only in VM Remove these

resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest

Page 31: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Volume Log

During FS operations, log entries are created for use during resolution

Different format per operation (rvmres/recov_vollog.cc)

Added to the vlist by SpoolVMLogRecord

Put in RVM at commit time

Page 32: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Repair

Venus makes ViceRepair RPC. File and symlink repair: BulkTransfer the

object Directory repair, BulkTransfer the repair

file and replay operations Venus follows this with a COP2 multi rpc For directory repair Venus invokes

asynchronous resolve

Page 33: Coda Server Internals Peter J Braam. Contents zData structure overview zVolumes zVnodes zInodes

Future

Good: Design is simple and efficient There is little C++: should eliminate easy to multi-thread

Bad: Scalability ~8GB in practice, ~40GB in

theory Data handling is bad: tricky to fix Volume code was & is worst: rewrite