View
222
Download
3
Embed Size (px)
Citation preview
Coda Server Internals
Peter J Braam
Contents
Data structure overviewVolumesVnodesInodes
Data Structure Overview
Inodes
VolumesVnodesDirectory cntsACLReslogs
Volinfo records
VSGDBPdb recordsTokens
Servers/SCMPartitionsStartup flagsSkipvolumesLOG & DATA & DB Locators
File Contents
Meta Data &Dir contents
Volume location
Security
Configuration Data
/vicep* partitions
RVM
VLDB, VRDB: RW db files
VSGDB, .pdb, .tk files:dynamic RO db files
Static data
Object Purpose Resides where
RVM layout (coda_globals.h)
Already_initialized (int)
struct VolHead[MAXVOLS]
struct VnodeDiskObject
*SmallVnodeFreeLists[SM_FREESIZE]
short SmallVnodeIndex
…. Same for large …
MaxVolId (unsigned long)
Remainder is dynamically allocated
Volume zoo (volume.h, camprivate.h)
RVM: structures VolumeData VolHead VolumeHeader VolumeDiskDa
ta
VM: structures Volume VolumeInfo ……..
A volume in RVM
VolHead
VolumeHeader VolumeDataVolumeHeader
*volumeDiskData*smallVnodeListsnsmallVnodesnsmallLists-- same for big --
stampid parentidtype
containspointer torvm malloced data
VolumeDiskData (rvm)
Lots of stuff: Identity & location: partition, name, runtime info: use, inService, blessed,
salvaged Vnode related: next uniquefier Versionvector Resolution flags, pointer to recov_vol_log Quota Resource usage: filecount, diskused etc
Volumes in VM
struct Volumes sit in VolHash with copies of RVM data structures
Salvage before “attaching” to VolHashModel of operation (FS):
GetVolume copy out from RVM Do your mods in VM PutVolume does RVM transaction
Model of operation (Volutil): operate on RVM
Volumes in Venus RPC’s
One RPC: GetVolInfo used for mount point traversal
Only relates to volume location database volume replication database VSGDB
Could sit in separate Volume Location Server
Vnodes (cvnode.h)
Small & large: large for directories difference is ACL at back of large vnodes
Inode field: small vnodes: points to diskfile inode number large vnodes: is RVM address of dir inode
Contain important small structure: vv_tPointers to reslog entriesVM: cvnode’s with hash table, freelists etc
Vnodes in RVM
RVM: VnodeDiskinfo (rvm_malloced) vnodes sit on rec_smolists
each link points to a DiskVnode lists link vnodes with identical
vnodenumbers but different uniquefiers new vnodes grabbed from FreeLists
(index.cc, recov{a,b,c}.cc) volumes have arrays of rec_smolists
which grow when they are full
Vnodes in action
Model: GetFSObj calls GetVnode work is done PutFS Objects calls
rvm_begin_transactionReplaceVnode - copies data from VM to RVMrvm_end_transaction
Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems.
Is this necessary? Probably not. Cure it: yes!
Directories (rvm)
DirInode page table and “copy on write” refcount
DirPages 2048 bytes each build up the directory divided into 64 32byte blobs Hash table for fast name lookups Blob Freelist Array of free blobs per page
Directories
More than one vnode can point to directory (copy on write)
VM: hash table of DirHandles point to VM contiguous copy of dir point to DirInode have a lock etc
Model: as for volumes & vnodesCritique: too baroque
Files
Vnode references file by InodeNumber
Files are copy on writeThere are “FileInodes” like dir
inodes, but they are held in external DB or in inode itself
Server always reads/writes whole files (could be exploited)
Volinit and salvage
Set up volume hash table, serverlist, DiskPartitionList
Cycle through partitions, check each for list of inodes every inode has a vnode every vnode has a directory name every directory name has a vnode
Put volume in a VM hash table
Server connection info
Array of HostEntry (a “venus”) Contains a linked list of connections Contains a callback connection id
Connection setup first binding creates a host & callback conn new binding creates a new connection and
verifies callback in RPC2_NewBinding & ViceNewConnectFS
Callbacks
Hashtable of FileEntries: each contains Fid number of users linked list of callbacks
Callbacks: point to HostEntryOps:
RPC: BreakCallBack Local: placing, delete, deleteVenus
Callbacks
Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire.
Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration.
RPC processing
Venus RPC’s: srvproc.cc - standard file ops srvproc2.cc - standard volume ops codaproc.cc - repair stuff codaproc2.cc - reintegration stuff
Volutil RPC’s: vol-your-rpc.cc (in coda-src/volutil)
Resolution: below
RPC processing
RPC structure: ValidateParms: validate, hand off COP2, cid GetObject: vm copy, lock objects CheckSemantics:
Concurrency, Integrity, Permissions
Perform operations:BulkTransfer, UpdateObjects, OutParms
PutObject: rvm transactions, inode deletions
vlists
GetFSObjects: instantiate a vlist RPC needs list of objects copied from RVM Modification status is held there (did
CopyOnWrite kick in etc)PutObjects
rvm_begin_transaction walk through the list, copy, rvm_set_range,
unlock rvm_end_transaction
COP2 handling
In COP2 Venus give final VV to serverare sent out by Venus (with some
delay) often piggybacked in bulkserver knows about pending COP2
entries in hash table (coppend.cc)Manager thread CopPendingManager
Runs every minute. Removes entries more than 900 secs old
Cop2 to RVM
Data can be PiggyBacked on another rpc sent in ViceCop2 rpc.
Both cases call InternalCop2 (srvproc.cc)InternalCop2 (codaproc.cc)
notifies the manager to dequeue gets the FS objects listed for the COP2 installs final VV’s into RVM (rvm
transaction!)
COP2 Problems
Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation)
Not optimized for singly replicated volume.
Resolution
Initiated by client with RPC to coordinator ViceResolve (codaproc.cc)
coordinator sets up connections in VSG
(unauthenticated) LockAndFetch (res/reslock, resutil):
lock volumes, collect “closure”
Resolution - special cases
RegResDirRequired (rvmres/rvmrescoord.cc)
check for unresolved ancestors already inconsistent runts (missing objects) weak equality (identical storeid)
RecovDirResolve
Phase II: (rvmres/{rescoord,subphase?}.cc) coordinator request logs from other servers subordinates lock affected dirs,marshall logs coordinator merges logs
Phase III: ship merged log to subordinates perform operations on VM copies Return results to coordinator
Resolution
Phase IV: (is old Phase 3 …) collect results, compute new VV’s ship
to subordinates commit results
Comments on resolution
Old versions of resolution: OldDirResolve: resolve only runts and weak DirResolve: resolve only in VM Remove these
resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest
Volume Log
During FS operations, log entries are created for use during resolution
Different format per operation (rvmres/recov_vollog.cc)
Added to the vlist by SpoolVMLogRecord
Put in RVM at commit time
Repair
Venus makes ViceRepair RPC. File and symlink repair: BulkTransfer the
object Directory repair, BulkTransfer the repair
file and replay operations Venus follows this with a COP2 multi rpc For directory repair Venus invokes
asynchronous resolve
Future
Good: Design is simple and efficient There is little C++: should eliminate easy to multi-thread
Bad: Scalability ~8GB in practice, ~40GB in
theory Data handling is bad: tricky to fix Volume code was & is worst: rewrite