If you can't read please download the document
Upload
ceph-community
View
3.331
Download
0
Embed Size (px)
Citation preview
Future of CephFS
Sage Weil
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP
APP
HOST/VM
CLIENT
Finally, lets talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone youve ever met (and everyone theyve ever met).
M
M
M
CLIENT
0110
data
metadata
Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
M
M
M
There are multiple MDSs!
Metadata Server
Manages metadata for a POSIX-compliant shared filesystemDirectory hierarchy
File metadata (owner, timestamps, mode, etc.)
Stores metadata in RADOS
Does not serve file data to clients
Only required for shared filesystem
If you arent running Ceph FS, you dont need to deploy metadata servers.
legacy metadata storage
a scaling disastername inode block list data
no inode table locality
fragmentationinode table
directory
many seeks
difficult to partition
usretcvarhomevmlinuzpasswdmtabhostslib
includebin
ceph fs metadata storage
block lists unnecessary
inode table mostly uselessAPIs are path-based, not inode-based
no random table access, sloppy caching
embed inodes inside directoriesgood locality, prefetching
leverage key/value object
102
100
1
usretcvarhome
vmlinuz
passwdmtab
hosts
libinclude
bin
controlling metadata io
view ceph-mds as cachereduce readsdir+inode prefetching
reduce writesconsolidate multiple writes
large journal or logstripe over objects
two tiersjournal for short term
per-directory for long term
fast failure recovery
journal
directories
one tree
three metadata servers
??
So how do you have one tree and multiple servers?
load distribution
coarse (static subtree)preserve locality
high management overhead
fine (hash)always balanced
less vulnerable to hot spots
destroy hierarchy, locality
can a dynamic approach capture benefits of both extremes?
static subtree
hash directories
hash files
good locality
good balance
If theres just one MDS (which is a terrible idea), it manages metadata for the entire tree.
When the second one comes along, it will intelligently partition the work by taking a subtree.
When the third MDS arrives, it will attempt to split the tree again.
Same with the fourth.
DYNAMIC SUBTREE PARTITIONING
A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and its called dynamic subtree partitioning.
scalablearbitrarily partition metadata
adaptivemove work from busy to idle servers
replicate hot metadata
efficienthierarchical partition preserve locality
dynamicdaemons can join/leave
take over for failed nodes
dynamic subtree partitioning
Dynamic partitioning
many directories
same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
highly statefulconsistent, fine-grained caching
seamless hand-off between ceph-mds daemonswhen client traverses hierarchy
when metadata is migrated between servers
direct access to OSDs for file I/O
an example
mount -t ceph 1.2.3.4:/ /mnt3 ceph-mon RT
2 ceph-mds RT (1 ceph-mds to -osd RT)
cd /mnt/foo/bar2 ceph-mds RT (2 ceph-mds to -osd RT)
ls -alopen
readdir1 ceph-mds RT (1 ceph-mds to -osd RT)
stat each file
close
cp * /tmpN ceph-osd RT
ceph-mon
ceph-mds
ceph-osd
recursive accounting
ceph-mds tracks recursive directory statsfile sizes
file and directory counts
modification time
virtual xattrs present full stats
efficient
$ ls -alSh | headtotal 0drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 .drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 ..drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomcephdrwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1drwx--x--- 1 luko adm 19G 2011-01-21 12:17 lukodrwx--x--- 1 eest adm 14G 2011-02-04 16:29 eestdrwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzycephdrwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
snapshots
volume or subvolume snapshots unusable at petabyte scalesnapshot arbitrary subdirectories
simple interfacehidden '.snap' directory
no special tools
$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
multiple client implementations
Linux kernel clientmount -t ceph 1.2.3.4:/ /mnt
export (NFS), Samba (CIFS)
ceph-fuse
libcephfs.soyour app
Samba (CIFS)
Ganesha (NFS)
Hadoop (map/reduce)
kernellibcephfscephfuseceph-fuseyour applibcephfsSambalibcephfsGanesha
NFS
SMB/CIFS
libcephfsHadoop
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP
APP
HOST/VM
CLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOME
AWESOME
AWESOME
AWESOME
Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.
Path forward
TestingVarious workloads
Multiple active MDSs
Test automationSimple workload generator scripts
Bug reproducers
HackingBug squashing
Long-tail features
IntegrationsGanesha, Samba, *stacks
librados
object model
pools1s to 100s
independent namespaces or object collections
replication level, placement policy
objectsbazillions
blob of data (bytes to gigabytes)
attributes (e.g., version=12; bytes to kilobytes)
key/value bundle (bytes to gigabytes)
atomic transactions
client operations send to the OSD clusteroperate on a single object
can contain a sequence of operations, e.g.truncate object
write new object data
set attribute
atomicityall operations commit or do not commit atomically
conditional'guard' operations can control whether operation is performedverify xattr has specific value
assert object is a specific version
allows atomic compare-and-swap etc.
key/value storage
store key/value pairs in an objectindependent from object attrs or byte data payload
based on google's leveldbefficient random and range insert/query/removal
based on BigTable SSTable design
exposed via key/value APIinsert, update, remove
individual keys or ranges of keys
avoid read/modify/write cycle for updating complex objectse.g., file system directory objects
watch/notify
establish stateful 'watch' on an objectclient interest persistently registered with object
client keeps session to OSD open
send 'notify' messages to all watchersnotify message (and payload) is distributed to all watchers
variable timeout
notification on completionall watchers got and acknowledged the notify
use any object as a communication/synchronization channellocking, distributed coordination (ala ZooKeeper), etc.
CLIENT #1CLIENT #2CLIENT #3
OSD
watch
ack/commit
ack/commit
watch
ack/commit
watch
notify
notify
notify
notify
ack
ack
ack
complete
watch/notify example
radosgw cache consistencyradosgw instances watch a single object (.rgw/notify)
locally cache bucket metadata
on bucket metadata changes (removal, ACL changes)write change to relevant bucket object
send notify with bucket name to other radosgw instances
on receipt of notifyinvalidate relevant portion of cache
rados classes
dynamically loaded .so/var/lib/rados-classes/*
implement new object methods using existing methods
part of I/O pipeline
simple internal API
readscan call existing native or class methods
do whatever processing is appropriate
return data
writescan call existing native or class methods
do whatever processing is appropriate
generates a resulting transaction to be applied atomically
class examples
grepread an object, filter out individual records, and return those
sha1read object, generate fingerprint, return that
imagesrotate, resize, crop image stored in object
remove red-eye
cryptoencrypt/decrypt object data with provided key
ideas
distributed key/value tableaggregate many k/v objects into one big 'table'
working prototype exists (thanks, Eleanor!)
ideas
lua rados classembed lua interpreter in a rados class
ship semi-arbitrary code for operations
json classparse, manipulate json structures
ideas
rados mailbox (RMB?)plug librados backend into dovecot, postfix, etc.
key/value object for each mailboxkey = message id
value = headers
object for each message or attachment
watch/notify to delivery notification
hard links?
rare
useful locality propertiesintra-directory
parallel inter-directory
on miss, file objects provide per-file backpointersdegenerates to log(n) lookups
optimistic read complexity
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles